CNI Fall 2013 – SHARE Update: Higher Education And Public Access To Research

Tyler Walters, Virginia Tech, MacKenzie Smith, UC-Davis

SHARE – Ensuring broad and continuing access to research is central to the mission of higher education.

Catalyst was the February OSTP memo.

SHARE Tenets – How do we see the research world and our role in it? Independent of the operationalization of OSTP directive, the higher ed community is uniquely positioned to play a leading role in stewardship of research. How can we help PIS and researchers meet compliance requirements?

Higher ed also ahas an interest in collecting and pre servicing scholarly output.

Publications, data, and metadata, should be publicly accessible to accelerate research and discovery.

Complying with multiple requirements from multiple funding sources will place a significant burden on principal investigators and offices of sponsored research.  The rumors are that different agencies will have different approaches and repositories, complicating the issue.

We nee to talk more about workflows and policies. We can rely on existing standards where available.

SHARE is a cross-institutional framework to ensure, access to, preservation and reuse of and policy compliance for funded research. SHARE will be mostly a workflow architecture that can be implemented differently at different institutions. The framework will enable PIs to submit their funded research to any of the deposit locations designated by federal agencies using a common UI. It will package and deliver relevant metadata and files. Institutions implementing SHARE may elect to store copies in local repositories. Led by ARL, with support from AAU and APLU. Guided by  a steering committee drawing from libraries, university administration, and other core constituencies.

Researchers – current funder workflows have 20+ steps. Multiple funders = tangle of compliance requirements and procedures, with potential to overwhelm PIs. Single deposit workflow = more research time, less hassle.

Funding Agencies – Streamlines receipt of information about research outputs; Increases likelihood of compliance

Universities – Optimize interaction among funded research, research officers, and granting agency; Creates organic link between compliance and analytics

General Public – makes it easier for public to access reuse and mine research outputs and research data; Adotpion of standards and protocols will make it easier for search engines.

Project map: 1. Develop Project Roadmap (hopefully in January); 2. Assemble working groups; 3. Build prototypes; 4. Launch prototypes; 5. Refine; 6. Expand

Mackenzie –  The great thing about SHARE is that it means something different to everyone you talk to. 🙂

Architecture (very basic at present) – 4 layers:

Thinking about how to federate content being collected in institutional repositories – content will be everywhere in a distributed content storage layer. Want customized discovery layers (e.g. DPLA) above that. Notification layer above that. Content aggregations for things like text mining (future), in order to support that will need to identify content aggregation mechanisms (great systems out there already).

Raises lots of issues:

  • Researchers don’t want to do anything new, but want to be able to apply. Want to embed notification layer into tools they already use.
  • Sponsored research offices are terrified of a giant unfunded mandate. So we have to provide value back to researcher and institution, and leverage what we already have rather than building new infrastructure.
  • Who should we look at as existing infrastructure to leverage?

What’s the balance between publications and data? Both were covered in the memo, but was very vague. Most agencies and institutions have some idea how to deal with publications, but not the data piece. Whatever workflow SHARE deals with will have to incorporate data handling.

Want to leverage workflow in sponsored project offices to feed SHARE.

What do we know about CHORUS (the publisher’s response) at this point? Something will exist. It would be good to have notifications coming out of CHORUS – they are part of the ecosystem.

Faculty will get a lot more interested in what’s being put out on the web about their activities. Some campuses are tracking a lot of good data in promotion and tenure dossier systems, but  that may not be able to be used for other purposes.

Will there be metadata shared for data or publications that can’t be made public? Interesting issues to deal with.

What does success look like? We don’t know yet – immediate problem is the OSTP mandate which will come soon. At base the notifications system is very important – the researcher letting the people who need to know that there is output produced from their research. Other countries have had notions of accountability for funded research for a long time. Even deans don’t know what publications from faculty are being produced. In Europe they don’t have that problem.

Want to be in a position by end of 2014 to invite people to look at and touch system. Send thoughts to


CNI Fall 2013 – Creating A Data Interchange Standard For Researchers, Research, And Research Resources: VIVO-ISF

Dean B. Krafft, Brian Lowe, Cornell University

What is VIVO?

  • Software: an open0source semantic-web-based researcher and research discovery tool
  • Data: Institution-wide, publicly-visible information about research and researchers
  • Standards: A stnadard ontology (VIVO data) that interconnects researchers

VIVO normalizes complex inputs, connecting scientists and scholars with and through their research and scholarship.

Why is VIVO important?

  • The only standard way to exchange information about research and researchers across divers institutions
  • Provides authoritative data from institutional databases of record as Linked Open Data
  • Supports search, analysis, and visualization of data
  • Extensible

An http request can return HTML or RDF data

Value for institutions and consortia

  • Common data substrate
  • Distributed curation beyond what is officially tracked
  • Data that is visible gets fixed

US Dept. of Agrigculture implementing VIVO for 45,000 intramural researchers to link to Land Grant universities and international agricultural research institutions.

VIVO exploration and Analytics

  • structured data can be navigated, analyzed, and visualized within or across institutions.
  • VIVO can visualize strengths of networks
  • Create dashboards to understand impact

Providing the context for research data

  • Context is critical to find, understand, and reuse research data
  • Contexts include: narrative publications, research grant data, etc.
  • VIVO dataset registries: Australian National Data Registry, Datastar tool at Cornell

Currently hiring a full-time VIVI project director.

VIVO and the Integrated Semantic Framework

What is the ISF?

  • A semantic infrastructure to represent people based on all the products of their research and activities
  • A partnership between VIVO, eagle-i, and ShareCenter
  • A Clinical and Translational Information Exchange Project (CTSAConnect): 18 months (Feb2012-Aug2013) funded by NIH))

People and Resources – VIVO interested primarily in people, eagle-i interested in genes, anatomy, manufacturer. Overlap in techniques, training, publications, protocols.

ISF Ontology about making relationships – connecting researchers, resources, and clinical activities. Not about classification and applying terms, but about linking things together.

Going beyond static CVs – distributed data, research and scholarship in context, context aids in disambiguation, contributor roles, outputs and outcomes beyond publications.

Linked Data Vocabularies: FOAF (Friend of a Friend) for people, organizations, groups; VCard (Contact info) (new version); BIBO (publications); SKOS (terminologies, controlled vocabularies, etc).

Open biomedical Ontologies (OBO family): OBI (Ontology of biomedical investigations); ERO (eagle-i Research Resource Ontology); RO (Relationship Ontology); IAO (Information Artifact Ontology – goes beyond bibliographic)

Basic Formal Ontology from OBO – Process, Role, Ocurrent, Continuant, Spatial Region, Site.

Reified Relationships – Person-Position-Org, Person-Authorship-Article. RDF Subject/predicate model breaks down for some things, like trying to model different position relationships over time.  So use a triple so the relationship gets treated as an entity of its own with its own metadata. Allows aggregation over time, e.g. Position can be held over a particular time interval. Allows building of a distributed CV over time.  Allows aggregating name change data over time by applying time data to multiple VCards with time properties.

Beyond publication bylines – What are people doing? Roles are important in VIVO ISF. Person-Role-Project. Roles and outputs: Person-Role-Project-document, resource, etc.

Application examples: search ( can pull in data from distributed software (e.g. Harvard Profiles) using VIVO ontologies.

Use cases: Find publications supported by grants; discover and reuse expensive equipment and resources; demonstrate importance of facilities services to research results; discover people with access to resources or expertise in techniques.

Humanities and Artistic Works -performances of a work, translations, collections and exhibits. Steven McCauley and Theodore Lawless at Brown.

Collaborative development – DuraSpace VIVO-ISF Working Group. Biweekly calls Wed 2 pm ET.

Linked Data for Libraries

December 5, 2013 Mellon made a 2 year grant to Cornell, Harvard, and Stanford starting Jan 2014 to develop Scholarly Resource Semantic Information Store model to capture the intellectual value that librarians and other domain experts add to information resources, together with the social value evident from patterns of research.

Outcomes: Open source extensible SRSIS ontology compatible with VIVO, BIBFRAME and other ontologies for libraries.

Sloan has funded Cornell to integrate ORCID more closely with VIVO. At Cornell they’re turning MARC records into RDF triples indexed with SOLR –


CNI Fall 2013 – Visualizing: A New Data Support Role For Duke University Libraries

Angela Voss – Data Visualization Coordinator, Duke Libraries

Data visualization can be typical types such as maps or tag clouds, or custom visualizations such as parallel axes plots. Helping people match their data to their needs, and what they want to get out of their data. Also help people think about cost/benefits of creating visualizations.

Why visualize?

  • Explore data, uncover hidden patterns. e.g. Anscombe’s Quartet.
  • Translate something typically invisible into the visible – makes the abstract easier to understand, increase engagement. Important to people performing research as well as reporting to others.
  • Communicate results, contextualize data, tell a story, or possible even mobilize action around a problem. (see Hans Rosling: The River of Myths). Important to build context around data, not just think that the numbers speak for themselves.

Visualization at Duke

  • No single centralized community, but plenty of distributed groups and projects.
  • Library was already offering GIS help.
  • Who could support visualization? Faculty/department? College/school? Campus-wide organization – was the only option with wide enough reach. There were several options – Duke created a position that reports jointly to Libraries and OIT.
  • Position started in June 2012 – Dual report to Data and GIS Services in the Libraries and Research Computing in OIT.
  • Objectives: instruction and outreach; consultation; develop new visualization services, spaces, programs.

After 18 months, what has been the most successful?

  • Visualization workshop series – software (Tableau (full time students get software free), d3 (Javascript library)), data processing (text analysis, network analysis), best practices (designing academic figures/posters, top 10 dos and don’ts for charts and graphs). The barrier is understanding data transformations to get data into software
  • Online instructional material
  • Just-in-time consulting – crucial to people getting started.
  • Ongoing visualization seminar series – this had been happening since 2002. Helped introduce the community.
  • Student data visualization contest

d3 monthly study group – Using GitHub to share sample code. Using Gist and to see the visualization right away. e.g.

Top 10 Dos and Don’ts for Charts and Graphs:

  • Simplify less important information
  • Don’t use 3D effects.
  • Don’t use rainbows for ordered, numerical variables. Use single hue, varying luminance.

Just in time consulting

  • Weekly walk-in consulting hours in the Data & GIS Services computer lab
  • Additional appointments outside of walk-in hours
  • Detailed support and troubleshooting via email

Weekly visualization seminars – Lunch provided, speakers from across campus and outside. Regularity helps. Live streaming and archived video.

Student data visualization contest

  • Goal: to advertise new services, take a survey of visualization at Duke – helped build relationships across the campus.
  • Open to Duke students, any type of visualization
  • Judged on insightfulness, narrative, aesthetics, technical merit, novelty
  • Awarded three finalists and two winners. Created posters of the winners to display in the lab, and run them on the monitor wall.

After 18 months, what are the challenges?

  • Marketing and outreach – easy to get overwhelmed by the people already using services at the expense of reaching new communities.
  • Staying current – every week there’s a new tool.
  • Project work, priorities – important to continue work as a visualizer on projects.
  • Disciplinary silos and conventions
  • Curriculum and skill gaps – there aren’t people teaching visualization at Duke as a separate topic. Common skill gaps: visualization types and tools; spreadsheet and/or database familiarity; scripting; robust data management practices; basic graphic design

Hopes for the future

  • Active student training program (courses, independent studies, student employment)
  • Additional physical and digital exhibit opportunities
  • Continued project and workshop development

What should a coordinator know?

  • Data transformations
  • Range of visualization types, tools
  • Range of teaching strategies
  • Marketing

What should a coordinator do?

  • Find access points to different communities
  • Use events to build community
  • Collaborate on research projects
  • Stockpile interesting datasets
  • Beware of unmanaged screens
  • Block out plenty of quiet time for the above

How should an organization establish a new visualization support program?

  • Identify potential early adopters
  • Budget for a few events, materials, etc
  • Involve othe service points
  • Provide a support system for the coordinator
  • Expect high demand

Working primarily with staff and grad students, this quarter a lot of undergrads due to a few courses.

Angela’s background is in communication for the most part. There’s a IEEE visualization conference.

CNI Fall 2013: Capturing the Ephemeral: Collecting Social Media and Supporting Twitter Research with Social Feed Manager

Daniel Chudnov, Bergis Jules, Daniel Kerchner, Laura Wrubel – George Washington University

Save the time of the researcher:

  • Demand from all across the spectrum from researchers for access to historical Twitter data. Library of Congress is archiving Twitter but not giving access.
  • How are current researchers collecting Twitter data? One example from GWU: By hand – Google reader (RSS feeds from Twitter), copy and paste into Excel, add coding, then pull into SPSS and Stata. Too much work for too little data (not to mention tools which don’t exist anymore). Copy and paste to Excel doesn’t scale. Not an isolated case. Over 5000 theses and dissertations since 2010 using Twitter data. 
  • What researchers ask for: specific users, keywords; basic values (user, date, text, retweets and follower counts); Thousands, not millions; need delimited files to import into coding software; need historical software.
  • Getting historical data usually requires getting from licensed Twitter reseller: DataSift, Gnip, NTT Data, Topsy (recently purchased by Apple). There are some research platforms for using Twitter data. Data is not cheap, but they are friendly and receptive to working with researchers (hoping to hear about products being developed specifically for academic community). Used to dealing with customers who can deal with very large datasets.
  • University archives using this to document student organizations, who are highly active on Twitter. Since March tracking 329 accounts with over 299k tweets.
  • Social Feed Manager software is available on GitHub. Django framework with django-social-auth library and tweepy library.
  • Just capturing a single feed doesn’t capture all the depth of interaction – want to expand it further and also use other sources.

CNI Fall 2013 – Opening Plenary – Cliff Lynch

I’m in DC for the Fall membership meeting of the Coalition for Networked Information, which is always a great place to pick up on the latest goings-on at the intersection of libraries and digital information and technology. As usual, the meeting kicks off with Cliff Lynch, the Executive Director of CNI in giving a summary of the current state of the art.

Some things Cliff won’t talk about:

  • The work Joan Lippincott has been leading on digital scholarship centers. Gets at some of the vehicles for forging and sustaining collaborations within research institutions and stewardship of materials that come out of that. There is a session tomorrow about that.
  • Output of executive roundtable on the acquisition, collection, and curation of e-books at scale by University libraries, as well as interaction between online textbooks and ebooks in research libraries. A summary session on that tomorrow. A general challenge question in this area: Are there examples from the recent landscape of books being published in electronic format only (or perhaps with print on demand) that contain high impact contact? Are we starting to see the market emerge where you HAVE to deal with electronic material because it’s not coming out in print, to get coverage of recent events. Thinking mostly about books outside of extremely narrow scholarly domains. Spring executive roundtable will look at software as a networked based service.
  • MOOCs. A year ago you couldn’t convene a group of five academics and get them not to talk about MOOCs. While discussion continues, it’s at a much lower temperature. There is some interesting preliminary research looking at the characteristics of the folks who seem to be most successful at MOOCs. Invites us to go back to much more fundamental notions of the library as the center of the university (if you connect a learner with a good library they’ll go far). That’s true of a certain type of person, but not all. The delivery of teaching and learning experiences is different than delivering a collection of knowledge. In the early enthusiasm about MOOCs there’s a tendency to see them as courses by other means. We will see MOOCs or MOOC-like things for purposes different than traditional courses, like training and other things that don’t fit well in the traditional academic definition of a course.

Things that are changing the landscape:

Hard to not give prominent place to the OSTP directive for federal funding agencies to develop plans to give access to reports and underlying data produced by funded research. There was an August deadline for submission of plans from agencies, which are not public. OSTP has been forthright that some plans are a bit more mature than others, depending on agency. We don’t have a firm date for when they will become public, but there is momentum. These developments will reshape the landscape for institutions that host the researchers as well as for the researchers themselves. Other nations and non-governmental funders are moving in the same directions.

One way to think about this is in the governmental sector, as a new set of compliance requirements. But we’ve seen the leadership in research and higher education (ARL, AAU, APLU) look at this as an opportunity and a challenge to rationalize the production of scholarly literature and data, which needs to be done. We’re seeing  a lot of changes in the obligations and practices that surround scholarly publishing – a whole range of behaviors that need to be rationalized so that researchers aren’t left scratching their heads. Seen a variety of responses to the OSTP mandate – from SHARE (ARL), CHORUS (STEM publishers), government (based on PubMed Central). All have places in the ecology. One of the attractive things about CHORUS is that it makes articles available in the context of the journals in which they appear, which institutional repositories do not. We need to think about how to take advantage of this, not view it as competition.

A little bit of redundancy is not a bad thing. When the government shut down we got an education about how deeply entwined many of the scholarly information services are as non-essential government services. Interesting to look as a case study at what was unavailable during the shutdown. At some level PubMed Central was deemed essential and stayed up, though it was not ingesting new contributions. Conversations that took place recently under the name ANADAPT2, in Barcelona, mostly of national libraries, looking at aligning digital preservation strategies. Can see very clear advantages to aligning strategies at a nation state level, but realize that there are some functions that each nation wants to maintain autonomy in, rather than getting into interdependent collaborations. A set of trade-offs. When does collaboration turn into interdependence?

SHARE is not just an opportunity and a challenge to straighten out publication, but it also deals with data. There was a second executive order over the summer that told federal agencies that the default thinking about data systems (modulo security and privacy concerns) is to provide public access to data. The word public is popular in governmental circles (rather than “open”). When you talk about public access (to let the public of the United States have access to data) that can be from access to raw data files, all the way to systems that help the public understand and analyze the data. There are people in government struggling to understand where on that continuum to fall, especially as there is no money associated with these initiatives.

Issues emerging in the data area: Research and higher-ed community is mobilizing to address needs through bit preservation services. Some data is constrained because of personally identifiable information. Anonymization, while a useful tool, has limited power. It is frighteningly easy to de-anonymize data. Need to think about how to handle personal data while we gain the power of recombination and reuse of research data. We are seeing a movement away of commitment to open-ended preservation of data to a more limited language of data management plans, e.g. preservation of bits for ten years. There are a number of commercial services or consortially based services where you can prepay for ten years. General proposition is we’ll go ten years and then look at what kind of use has been made of the data and then look at alternatives. We have no process for doing that evaluation – we’ll need to involve all sorts of community discussions about value of data, which will need to be cross-institutional (we’ll need registries). It’s not too soon to start thinking about this problem.

This is an example of a broader issue of “transitions of stewardship” – somebody’s been taking care of something, but now their commitment is expiring. We need an orderly way of putting the information resource in front of the scholarly community and evaluating the need for continuing preservation and finding who will step up to it. We’re getting very good at making digital replications of 2-dimensional things like fine art, but the difference is tracking provenance. There’s lots of progress in 3-dimensional work as well (Smithsonian, e.g.). We now have an opportunity to peel off the scholarly side of artifacts for not only exhibition, but as objects of study. There are lots of institutions of cultural memory that are under severe stress – see the discussions around the collection of the Detroit Institute of Art, which again leads to the idea of taking transitions seriously.

We need to be thinking about where we’re assigning resources. Two things that are troublesome: 1) we don’t know how well we’re doing with our digital preservation efforts. How much of the web gets covered by web archiving? We don’t have an inventory of the kinds of things that are out there or what parts are covered, or where the areas of highest risk are. There’s a tendency to go after the easy stuff – part of our strategy going forward needs to become much more systematic. We have a tendency to continually improve things we’ve already in our grasp (like continually improving layers of backup for archives), but we need to look at the tradeoff of resources for this versus focusing on what we’re not yet capturing.

Another place where we’re seeing emerging activities that need to turn into a system is in distributed factual biography. Author identity, citations, aggregation, interchange, and compilation of citations. Connected to compliance issues, with academic processes, social networking among scholars, identifying important work. There’s an enormous amount of siloed work going on. Creeping up towards a place where we have factual biographies that we can break up into smaller parts and reassemble. What degree of assurance do we need on these bits? What role does privacy play? Is the fact that you published something a secret, or should it be able to be? Noteworthiness – Wikipedia has a complicated set of criteria of deciding whether your biography is worthy of Wikipedia. This has a rich wonderful history, going back to the nineteenth century work on national biographical dictionaries. When does someone become a public figure? There’s a question about systems of annual faculty reviews – one of the most hideous examples of siloed activity imaginable. Information often collected in forms that aren’t reusable in multiple ways. These need to be tied together with things like grant management systems, bibliometric systems, etc which are all moving the same data around. Other countries where the government is involved in assessing faculty work to pass out funds are more sophisticated than is typical in the US. One of the things we need to look at hard and quick is interchange formats. There’s good work in Europe and out of the VIVO community.

Notion of coherence at scale – framed by Chuck Henry at CLER. We’re moving past the era of building fairly little system and federating them, but we need to be thinking at scale – how do systems depend on each other and interrelate? Look beyond academia – Wikipedia, Google, Microsoft, Internet Archive. Look at incredible accomplishments of DPLA (Digital Public Library of America). They’re being very clear about what they’re not going to do in the near future, by implication saying that someone else needs to worry about those topics. The scale of engineering we’re looking at to manage scholarship and research knowledge is crossing some fundamental thresholds and we’re going to need to do things very differently than we did in the past. Examples are all around – look at the Pentagon Papers which are now a fundamental reference source to history of that time. That was a book – the research community knew how to deal with it when it was published. What do we do with things like Wikileaks? What do we do with massive data revelations?