CNI Fall 2015 Day 1

I’m at the fall meeting for the Coalition for Networked Information. For those who don’t know, CNI is a joint initiative of Educause and the Association of Research Libraries and was founded in 1990 to promote the use of digital information technology to advance scholarship and education. I was involved in the early days of CNI and I’m happy to have recently been appointed as a representative of Educause on the CNI Steering Committee.

Cliff Lynch is CNI’s Executive Director, and one of the highlights of the member meetings is his plenary address, where he generally surveys the landscape of digital information and pulls together interesting, intriguing, and sometimes troubling themes that he thinks are worth watching and working on.

In today’s plenary Cliff talked about the evolving landscape of federal mandates for public access to federally funded research results. It is only in 2016 that we will see the actual implementation of the plans the various federal agencies put forward to implement the directive that the Office of Science and Technology Policy put out in 2013. Cliff noted that the implementations of the multiple federal funding agencies are not coordinated, and that some of them are not in sync with existing practices at institutions, and there will be a lot of confusion.

Cliff also had some very interesting observations on the current set of issues surrounding security and privacy. He cited the recent IETF work on pervasive surveillance threat models, noting that if you can watch enough aggregate traffic patterns going to and from network locations you can infer a lot, even if you can’t see into the contents of encrypted traffic.  And with the possible emergence of quantum computing that may be able to break current encryption technologies, security and privacy become much more difficult. Looking at the recent string of data breaches at Sony, the Office of Personnel Management, and several universities, you have to start asking whether we are capable of keeping things secure over time.

He then moved on to discussing privacy issues, noting that all sorts of data is being collected on people’s activities in ways that can be creepy – e-texts that tattle on you, e-companions for children or the elderly that broadcast information. CNI held a workshop in the spring on this topic, and the general consensus was that people should be able to have a reasonable expectation of privacy in their online activities, and they should be informed about use of their data. It’s generally clear that we’re doing a horrible job at this. NISO just issued work on distilling some principles. In our campuses people have different impressions of what’s happening in authorization handoffs between institutions and publishers – it’s confused enough that CNI will be fostering some work to gather some facts about this.

The greatest area of innovation right now that Cliff sees is where technology gets combined with other things (the internet of things) – like drones, autonomous vehicles, machine learning, robotics, etc.  But there isn’t a lot of direct technical IT innovation happening, and what we’re seeing is a degree of planned obsolescence where we’re forced to spend lots of time and effort to upgrade software or hardware in ways that don’t get us any increased functionality or productivity. If that continues to be the case we’ll need to figure out how to “slow down the hamster  wheel.”

Finally Cliff closed by talking about the complexity of preservation in a world where information is presented in ways increasingly tailored to the individual. How do we document the evolution of experiences that are mediated by changing algorithms? And this is not just a preservation problem but an accountability issue, given the pervasive use of personalized algorithms in important functions like credit ratings.




CNI Fall 2014 Meeting: The NIH Contribution to the Commons

The NIH Contribution to the Commons
Philip E Bourne, Associate Director for Data Science, NIH

There’s a realization that the future of biomedical research will be very different – the whole enterprise is becoming more analytical. How can we maximize rate of discovery in this new environment?

We have come a long way in just one researcher’s career. But there is much to do: too few drugs, not personalized, too long to get to market; rare diseases are ignored; clinical trials are too limited in patients, too expensive, and not retroactive; education and training does not match current market needs; research is not cost effective – not easily replicated, too slow to disseminate. How do we do better?

There is much promise: 100,000 genomes project – goal is to sequence 100k people and use it as a diagnostic tool. Comorbidity network for 6.2 million Danes over 14.9 years – the likelihood if you have one disease you’ll get another. Incredibly powerful tool based on an entire population. We don’t have the facilities in this country to do that yet – need access and homogonization of data.

What is NIH doing?

Trying to create an ecosystem to support biomedical research and health care. Too much is lost in the system – grants are made, publications are issued, but much of what went into the publication is lost.

Elements of ecosystem: Community, Policy, Infrastructure. On top of that lay a virtuous research cycle – that’s the driver.

Policies – now & forthcoming
Data sharing: NIH takes seriously, and now have mandates from government on how to move forward with sharing. Genomic data sharing is announced; Data sharing plans on all research awards; Data sharing plan enforcement: machine readable plan, repository requirements to include grant numbers. If you say you’re going to put data in repository x on date y, it should be easy to check that that has happened and then release the next funding without human intevention. Actually looking at that.

Data Citation – elevate to be considered by NIH as a legitimate form of scholarship. Process: machine readable standard for data citation (done – JATS (xml ingested by PubMed) extension); endorsement of data citation in NIH bib sketch, grants, reports, etc.

Infrastructure –
Data to Knowledge initiative (BD2K). Funded 12 centers of data excellence – each associated with different types of data. Also funded data discovery index consortium, building means of indexing and finding that data. It’s very difficult to find datasets now, which slows process down. Same can be said of software and standards.

The Commons – A conceptual framework for sharing and being FAIR: Finding, Accessing, Integrating, Reusing
Digital research objects with attribution – can be data, software, narrative, etc. The Commons is agnostic of computing platform.

Digital Objects (with UIDs); Search (indexed metadata); Search

Public cloud platforms, super computing platforms, other platforms

Research Object IDs under discussion by the community – BD2K centers, NCI cloud pilots (Google and AWS supported), large public data sets, MODs. Meeting in January in UK – could DOIs or some other form.

Search – BD2K data and software discovery indices; Google search functions

Appropriate APIs being developed by the community, eg Global Alliance for Genomic Health. I want to know what variation there is in chromosome 7, position x, across the human population. With the Commons more of those kinds of questions can be answered. Beacon is an app being tested – testing people’s willingness to share and people’s willingness to build tools.

The Commons business model: What happens right now is people write grant to NIH, with line items to manage data resources. If successful they get money – then what happens? Maybe some of it gets siphoned off to do somethign else. Or equipment gets bought where it’s not heavily utilized. As we move more and more to cloud resources, it’s easier to think on a business model based on credit. Instead of getting hard cash you’re given an amount of credit that you can spend in any Commons compliant service, where compliance means they’ve agreed to share. Could be that institution is part of Commons or it could be public cloud or some other kind of resource. Creates more of a supply and demand environment. Enables a public/private partnership. Want to test idea that more can be done with computing dollars. NIH doesn’t actually know how much they spend on computation and data activities – but undoubtedly over a billion dollars per year.

Community: Training initiatives. Build an OPEN digital framework for data science training: NIH data science workforce development center (call will go out soon). How do you crate metadata around physical and virtual courses? Develop short-term training opportunities – e.g. supported workshop with gaming community. Develop the discipline of biomedical data science and support cross-training – OPEN courseware.

What is needed? Some examples from across the ICs:
Homgenization of disparate large unstructured datasets; deriving structure from unstructured data; feature mapping and comparison from image data; visualization and analysis of multi-dimensional phenotypic datasets; causal modeling of large scale dynamic networks and subsequent discovery.

In process of standing up Commons with two or three projects – centers being funded from BD2K who are interested, working with one or two public cloud providers. Looking to pilot specific reference datasets – how to stimulate accessibility and quality. Having discussions with other federal agencies who are also playing around with these kinds of ideas. FDA looking to burst content out into cloud. In Europe ELIXIR is a set of countries standing up nodes to support biomedical research. Having discussions to see if that can work with Commons.

There’s a role for librarians, but it requires significant retraining in order to curate data. You have to understand the data in order to curate it. Being part of a collective that is curating data and working with faculty and students that are using that data is useful, but a cultural shift. The real question is what’s the business model? Where’s the gain for the institution?

We may have problems with the data we already have, but that’s only a tiny fraction of the data we will have. The emphasis on precision medicine will increase dramatically in the next while.

Our model right now is that all data is created equal, which clearly is not the case. But we don’t know which data is created more equal. If grant runs out and there is clear indication that data is still in use, perhaps there should be funding to continue maintenance.

CNI meeting Fall 2014: SHARE update

SHARE update
Tyler Walters, SHARE director and dean of libraries at Virginia Tech
Erice Celeste, SHARE technical director
Jeff Spies, Co-founder/CTO at Center for Open Science

Share is a higher education initiative to maximize research impact. (huh?)

Sponsored by ARL, AAU, APRU.

Knowing what’s going on and keeping informed of what’s going on.

Four working groups addressing key tasks: repository, workflow, technical, communication

Received $1 million from IMLS and Sloan to generate a notification service.

SHARE is a response to the OSTP memo, but roots before that.

Infrastructure: Repositories, research network platforms, CRIS systems, standards and protocols, identifiers

Workflow – multiple silos = administrative burden

Policy – public access, open access, copyright, data management and sharing, internal policies.

Insittutional context: US federal agencies join growing trend to require public access to funded research; measureable proliferation of institutional and disciplinary repositories; premium on impact and visibility in higher ed.

Research Context – Scolarly outcomes are contextualized by materials generated in the process and aftermath of scholarly inquiry. Research process gendrates materials covering methods employed, evidence used, and formative discussion.

Ressearch libraries: collaboration among institutions going up; shift from collections as products to collections as components of the academy’s knowledge resources; library is supporting and embedded within the process of scholarship.

Notification Service: Knowing who is producing what, and under whose auspices, is critical to a wide range of stakeholders – funders, sponsored research offices, etc.

Researchers produce articles, preprints, presentations, datasets, and also administrative output like grant reports and data management plans. Research release events. Meant to be public.

Consumers of research release events: repositories, sponsored research offices, funders, public. Interest in process as well as product. Today each entity must relate arrange with one another to learn what’s going on. Notification service shares metadata about research release events.

Center for open science has partnered with SHARE to implement notification service.

Looking for feedback on proposed metadata schema, though the system is schema agnostic.

API – push API and content harvesters (pulling data in from various sources). Now have 24 providers and adding more. 16 use OAIPMH while 8 use non-standard metadata formats.

Harvested data gets put into open science framework – pushes out RSS/Atom, PubSubHubbub, etc. Sit on top of elastic search. You can add a lucene format full-text search to a data request.

250k research release events so far.arxiv and crossref are largest providers. Averaging about 900 events per day. Now averaging 2-3k per day in last few days as new providers are added.

Developed push protocol for providers to push data rather than waiting for pull.

Public release: Early 2015 beta release, fall 2015 first full release.

Some early lessons: Metadata rights issues – some sites not sure about thier right to, for example, share abstracts; Is there an explicit license for metadata (e.g. CC Zero)?;

Inclusion of identifiers – need some key identifiers to be available in order to create effective notifications. Most sources to not even collect email addresses of authors, much less ORCID or ISNI. Most sources make no effort to collect funding information or grant award numbers. Guidelines? See

Consistency across providers – reduce errors, simplify preparing for new providers. Required for push reporting.

Next layer: Reconciliation service – takes output of notification service to create enhanced and interrelated data set.

Share Discovery – searchable and friendly.

Phase 2 benefits – Researchers can keep everyone informed by keeping anyone informed, institutions can assemble more comprehensive record of impact,; open access advocates can hold publishers accountable for promises; other systems can count on consistency of metadata from SHARE.

Relation to Chorus – when items get into Chorus it is a research release event, hopfully will get into notification service.

Information, Interaction, and Influence – Information intensive research initiatives at the University of Chicago

Sam Volchenbaum 

Center for Research Informatics – established in 2011 in response of need for researchers in Biological Sciences for clinical data. Hired Bob Grossman to come in and start a data warehouse. Governance structure – important to set policies on storage, delivery, and use of data. Set up secure, HIPAA and FISMA compliance in data center, got certified. Allowed storage and serving of data with PHI. Got approval of infrastructure from IRB to serve up clinical data. Once you strip out identifiers, it’s not under HIPAA. Set up data feeds, had to prove compliance to hospital. Had to go through lots of maneuvers. Released under open source software called I2B2 to discover cohorts meeting specific criteria. Developed data request process to gain access to data. Seemingly simple requests can require considerable coding. Will start charging for services next month. Next phase is a new UI with Google-like search.

Alison Brizious – Center on Robust Decision Making for Climate and Energy Policy

RDCEP is very much in the user community. Highly multi-disciplinary – eight institutions and 19 disciplines. Provide methods and tools to provide policy makers with information in areas of uncertainty. Research is computationally and information intensive. Recurring challenge is pulling large amounts of data from disparate sources and qualities. One example is how to evaluate how crops might fail in reaction to extreme events. Need massive global data and highly specific local data. Scales are often mismatched, e.g. between Iowa and Rwanda. Have used Computation Institute facilities to help with those issues. Need to merge and standardize data across multiple groups in other fields. Finding data and making it useful can dominate research projects. Want researchers to concentrate on analysis. Challenges: Technical – data access, processing, sharing, reproducibility; Cultural – multiple disciplines, what data sharing and access means, incentives for sharing might be mis-aligned.

Michael Wilde – Computation Institute

Fundamental importance of model of computation in overall process of research and science. If science is focused on delivery of knowledge in papers, lots of computation is embedded in those papers. Lots of disciplinary coding that represents huge amounts of intellectual capital. Done in a chaotic way – don’t have a standard for how computation is expressed. If we had such a standard could expand on the availability of computation. We could also trace back what has been done. Started about ten years ago – Grid Physics Netowrk to apply these concepts to the LHC, the Sloan Sky Survey, and LIGO – virtual data. If we shipped along with findings a standard codified directory of how data was derived, could ship computation anywhere on planet, and once findings were obtained, could pass along recipes to colleagues. Making lots of headway, lots of projects using tools. SWIFT – high level programming/workflow language for expressing how data is derived. Modeled as a high level programming language that can also be expressed visually. Trying to apply the kind of thinking that the Web brought to society to make science easier to navigate.

Kyle Chard – Computation Institute

Collaboration around data – Globus project. Produce a research data management service. Allow researchers to manage big data – transfer, sharing, publishing. Goal is to make research as easy as running a business from a coffee shop. Base service is around transfer of large data – gets complex with multi-institutions, making sure data is the same from one place to the other. Globus helps with that. Allow sharing to happen from existing large data stores. Need ways to describe, organize, discover. Investigated metadata – first effort is around publishing – wrap up data, put in a place, describe the data. Leverage resources within the institution – provide a layer on top of that with publication and workflow, get a DOI. Services improve collaboration by allowing researchers to share data. Publication helps not only with public discoverability, but sharing within research groups.

James Evans – Knowledge Lab

Sociologist, Computation Institute. Knowledge Institute started about a year ago. Driven by a handful of questions: Where does knowledge come from? What drives attention, imagination? What role does social, institutional play in what research gets done? How is knowledge shared? Purpose to marry questions with the explosion of digital information and the opportunities that provides. Answering four questions: How do we efficiently harvest and share knowledge harvested from all over?; How do we learn how knowledge is made from these traces?; Represent, recombine knowledge in novel ways; Improve ways of acquiring knowledge. Interested in long view – what kinds of questions could be asked? Providing mid-scale funding for research projects. Questions they’ve been asking: How science as an institution thinks and how scientists pick the next experiment; What’s the balance of tradition and innovation in research? ; How people understand safety in their environment, using street-view data; Taking data from millions of cancer papers then drive experiments with a knowledge engine; studying peer review – how does review process happen? Challenges – the corpus of science, working with publishers – how to represent themselves as safe harbor that can provide value back; how to engage in rich data annotations at a level that scientists can engage with them?; how to put this in a platform that fosters sustained engagement over time.

Alison Heath – Institute for Genomics and Systems Biology and Center for Data Intensive Science

Open Science Data Cloud – genomics, earth sciences, social sciences. How to leverage cloud infrastructure? How do you download and analyze petabyte size datasets? Create something that looks kind of like Amazon or Google, but with instant access to large science datasets. What ideas to people come up with that involve multiple datasets. How do you analyze millions of genomes? How do you protect the security of that data? How do you create a protected cloud environment for that data? BioNimbus protected data cloud. Hosts bulk of Cancer Genome Project – expected to be about 2.5 petabytes of data. Looked at building infrastructure, now looking at how to grow it and give people access. In past communities themselves have handled data curation – how to make that easier? Tools for mapping data to identifiers, citing data. But data changes – how do you map that? How far back do you keep it? Tech vs. cultural problems – culturally has been difficult. Some data access controlled by NIH – took months to get them to release attributes about who can access data. Email doesn’t scale for those kinds of purposes. Reproducibility – with virtual machines you can save the snapshot to pass it on.


Engagement needs to be well beyond technical. James Evans engaging with humanities researchers. Having equivalent of focus groups around questions over a sustained period – hammering questions, working with data, reimagining projects. Fund people to do projects that link into data. Groups that have multiple post-docs, data-savvy students can work once you give them access. Artisanal researchers need more interaction and interface work. Simplify the pipeline of research outputs – reduce to 4-5 plug and play bins with menus of analysis options. Alison – helps to be your own first user group. Some user communities are technical, some are not. Globus has Web UI, CLI, APIs, etc. About 95% of community use the web interface, which surprised them. Globus has a user experience team, making it easier to use. Easy to get tripped up on certificates, security, authentication – makes it difficult to create good interfaces. Electronic Medical Record companies have no interest in being able to share data across systems – makes it very difficult. CRI – some people see them as service provider, some as a research group. Success is measured differently so they’re trying to track both sets of metrics, and figure out how to pull them out of day-to-day workstreams. Satisfaction of users will be seen in repeat business and comments to the dean’s office, not through surveys. Doing things like providing boilerplate language on methods and results for grants and writing letters of support go a long way towards making life easier for people. CRI provides results with methods and results section ready to use in paper drafts. Should journals require an archived VM for download? Having recipes at right level of abstraction in addition of data is important. Data stored in repositories is typically not high quality – lacks metadata, curation. Can rerun the exact experiment that was run, but not others.  If toolkits automatically produce that recipe for storage and transmission then people will find it easy.


Information, Interaction, and Influence – Digital Science demos

Digital Science is a UK company that is sponsoring this workshop, and they’re starting off the morning by demoing their family of products.

Julia Hawks – VP North America, Symplectic

Founded in2003 in London to serve the need of researchers and research administrators. Joined Digital Science in 2010. Works with 50 universities – Duke, UC Boulder, Penn, Cornell, Cambridge, Oxford, Melbourne.

Elements – research information management solution. Capturing and collating quality data on faculty members to fulfill reporting needs: annual review, compliance with open access policies, showcasing institutional research through online profiles, tracking the impact of publications (capture citation and bibliometric scores). Trying to reduce burden on faculty members.

How is it done? – automated data capture, organize into profiles, data types configurable, reporting, external systems for visibility (good APIs).

Where does the data come from? External sources – Web of Science, Scopus, host of others, plus internal sources from courses, HR, grants.

Altmetric –

Help researchers find the broader impact of their work. Collect information on articles online in one place. COmpany founded in 2011 in London. Work with publishers, supplying data to many journals, including Nature, Science, PNAS, JAMA. Also working with librarians and repositories. Some disciplines have better coverage than others.

Altmetric for institutions – allows users withinn an institution to get an overview of the attention research outputs are getting. Blogs, mainstream media, smaller local papers, and news sources for specific verticals, patents, policy documents.

Product built with an API to feed research information systems, or have a tool called Explorer to browse bibliographies.


Build tor researchers, but also have products for publishers and institutions. Manages publications and articles for reading. Manages a library of PDFs. Has highlighting, annotations, reference lookup. Recommends other articles based on articles in your library.

ReadCube for Publishers – free indexing and discovery service, embedded PDF viewer + data, Checkout – individual article level ecommerce.

ReadCube Access for Institutions – enables institutions to close the collections gap with affordable supplemental access to content. Institutions can pick and choose by title and access type.

figshare – Dan Valin

Three offerings – researcher, publisher, institutions

Created by an academic, for academics. Further hte open science movement, build a collaborative portal, change existing workflows. Provides open APIs

Cloud-based research data management system. Manage research outputs openly or privately with controlled collaborative spaces. Public repository for data.

For institutions – research outputs management and dissemination. Unlimited collaborative spaces that can be drilled down to on a departmental level.

Steve Leicht – UberResearch

Workflow processes – portfolio analysis and reporting, classification, etc. Example – Modeling a classification semantically. Seeing difference across different funding agencies. Can compare different institutions, can hook researchers to ORCID.


Information, Interaction, and Influence – Research networking and profile platforms

Research networking and profile platforms: design, technology and adoption of 
networking tools 

Tanu Malik, UChicago CI – treating science as an object. Need to record inputs and outputs, which is difficult, but some things are relatively easy to document: publications, patents, people, institutions, grants. Some of this has been taking place, documenting metadata associated with science. How can we integrate this data and establish relationships in order to get meaningful knowledge out of it? There have been a few success stories: VIVO, Harvard Profiles. This panel will discuss the data integration challenges and the deployment challenges. Computational methods exist but still need to be implemented in easy to use ways.

Simon Porter – University of Melbourne

Implemented VIVO as Find an Expert – oriented towards students and industry. Now gets around 19k unique visitors per week.

Serendipity as research activity – the maximum number of research opportunities are possible when we can maximize the number of people discovering or engaging with our research. Enabled by policy, enabled by search, enabled by standards, enabled by syndication. 

At Australian universities have had to collect the information on research activity all along. Some of it is private, but some is public and the University can assert publication of it.  Most universities have something, but lots of different systems.

Only a small number of people will use the search in your system. Most will come from Google. 

Syndicating information within the university – VIVO – gateway to information – departments take information from VIVO to publish their own web pages. Different brands for different departments. 

Syndication beyond the University – Want to plug into international research profiling efforts. 

New possibilities: Building capability maps. How to support research initiatives. Start from people being appointed to start the effort. Use Find An Expert to identify potential academics. Can put together multiple searches to outline capability sets. Graphing interactions of search results. 

Leslie Yuan – Clinical and Translational Science Institute – UCSF

The Profiles team all came from industry – highly oriented towards execution. When she started they wanted lots of people to use, so how to get adoption? If you build it, they probably won’t come. Use your data and analyses to drive success with a very lean budget. In four years went to over 90k visits per month. Gets 20% of the traffic of the main UCSF web page.


1. Use Google (both inside and outside the institution).  Used SEO on site. 88% of researcher profiles have been viewed 10+ times. Goal was to get every one of researchers to come up in top 3 results when they type the name in. Partnered with University Relations – any article that the press office writes about a researcher links to their profile.

2. Share the data. APIs provide data to 27 UCSF sites and apps. Has made life easier for IT people across the university, leading to evangelization in the departments. Personalized stats are sent to profile owners – how many times your profile was viewed within the institution, from other universities, from major pharmas. People wanted specifics. Nobody unsubscribed. Vanity trumps all.  Research analytics shared with leadership. Helped epidemiology and biostatistics show that they are the most collaborative unit on campus.

3. Keep looking at the data – monthly traffic reporting, engagement stats (by school, by department, who’s edited profile, who’s got pictures), Network visualizations of co-authorships.

4. Researcher engagement – automated onboarding emails – automatically creating profiles, then letting people know about them as they come on board. Added websites, videos, tweets and more inline. Batch loaded all UCTV videos onto people’s profiles, then got UCTV to send email to researchers letting them now. Changed URLS – 

5. Partnerships – University Relations, Development & Alumni, Library, UC TV, Directory,  School of Medicine, Center for AIDS research, Dept. of Radiology. Was able to give data back to Univ Relations on articles by department or specialty, which they weren’t tracking. Automatic email that goes out if people get an article added. 

Took 8 or 9 months of concentrated conversations with chairs, deans, etc to convince them that this was a good thing. Only 7 people asked to be taken off the system. Uptake was slow, but now people are seeing the benefit of having their work out there.  6 people on her team have touched the system in some way, but it’s nobody’s full-time job.

Griffin Weber, Harvard – Research Networking at the School, University, and Global Scale

Added passive and active networking to the profiles system. Passive network provided information that people hadn’t seen before, driving adoption, active networks allowed the site to grow over time. Passive network creates networks based on related concepts. Different ways of visualizing the concept maps – list, timeline, co-authors, geography (map), ego-centric radial graph (social network reach), list of similar people

Different kinds of data for Harvard Faculty Finder – comets and stars discovered, cases presented to the Supreme Court, classes taught, etc. Pulled in 500k publications from Web of Science. Derived ontologies in 250 disciplines across those publications using statistical methods. – federated search across 70 biomed institutions. 

Faculty affairs uses Profiles to form promotions committees, students using it to find mentors. 

Bart Trawick, NCBI – NLM – Easy come, easy go; SciENcv & my bibliography 

NIH give $15.5 in grants per year. Until 2007 didn’t have a way of seeing what they were getting from the investment. Public access to publications mandated by Congress in 2007. Started using MyBibliography to track. Over 61k grant applications coming in every year, just flat PDFs. 

About 125k US trained scientists in the workforce now. Many have been funded by training grants. Want to see how the scientists continue their career. Over 2500 unemployed PhDs in biomedical science.

My NCBI Overview – tools and preferences integrated with NCBI databases. Connected to PubMed, genomics, etc. Uses federated login (can link google accounts e.g.) Can link ERA commons account – pull in information about profiles, grants linked. 

My Bibliography – make it a tool to capture information and link grant data to publications. Set up to monitor many of the databases that information flows through. End result of public access policy is that all NIH-funded research publications get deposited in PubMed Central. MyBibliograhpy lets scientists know if they’re compliant with policy. Send structured data back out to PubMed, allowing searching by grant numbers, etc. 

SciENcv – released second version this week. Help scientists fill out profile – each agency has their own biosketch format. SciENcv is attempt to standardize that. NIH set up, working on others, NSF next on list. Wanted to make it easy for researchers who are already funded and using MyBibliography. Data exists out there – would like to get to a point of reuse of data for grant reporting. Added inputs – ORCID, eRA Commons (used to manage grants), MyBibliography. requires biosketches in PDF. Can export from SciENcv in pdf to, with rich metadata attached.

Information, Interaction, and Influence 

 I’m attending a workshop on Research Information Technologies and their Role in Advancing Science.

Ian Foster from the UChicago Computation Institute is kicking it off. 

We now have the ability to collect data to study how science is conducted. We can also use that data for other purposes: finding collaborators, easing administrative burdens, etc. Those are the reasons we often get funding to build research information systems, but can use those systems to do more interesting things.

Interested in two aspects of this:

1. Treat science itself as an object of study.
2. Can use this information to improve the way we do science. Don Swanson – research program to discover undiscovered public knowledge. 

The challenge we face as a research community as we create research information systems is to bring together large amounts of information from many different places to create systems that are sustainable, scalable, and usable. Universities can’t build them by themselves, and neither can private companies. 

Victoria Stodden – Columbia University (Statistics) – How Science is Different: Digitizing for Discovery

Slides online at:

Tipping point as science becomes digital – how do we confront issues of transparency and public participation? New area for scientists. Other areas have dealt with this, but what is different about science?

1. Collaboration, reproducibility, and reliability: scientific cornerstones and cyberinfrastructure

Scoping the issue – looking at the June issues of Journal of American Statistical Association – how computational is it? Is the code available? 1996 – about half computational, by 2009 almost all computational. In ’96 none talked about getting the code. In 2011, 21% did. Still 4 out of 5 are black boxes. 

In 2011 ? looked at 500 papers in biological sciences. Was able to get data in 9% of the cases.

The scientific method:

Deductive: math, formal logic; Empirical (or inductive): largely centered around statistical analysis of controlled experiments. Computational, simulations, data-driven science, might be 3rd and 4th branches. The Fourth Paradigm.

Credibility Crisis: Lots of discussion in journals and pop press about dissemination and reliability of scientific record. 

Ubiquity of Error: central motivation of scientific method – we realize that our reasoning may be flawed, so we want to hit it against evidence to get closer to the truth. In deductive branch, we have proofs. In empirical branch, we have the machinery of hypothesis testing. Hundreds of years to come up with standards of reliability and reproducibility. The computational aspect is only a potential new branch, until we develop comparable standards. Jon Clairbout (Stanford): “Really reproducible Research” – an article about computational science is merely the advertisement of the scholarship. The actual scholarship is the set of code and data that generate the article.

Supporting computational science: Dissemination platforms; Workflow tracking and research environments (prepublication); embedded publishing – documents with ways of interacting with code and data. Mostly being put together by academics without reward because they thing these are important problems to solve. 

Infrastructure design is important and rapidly moving forward.

Research Compendia – a site with dedicated pages which house code and data, so you can download digital scholarly objects. 

2. Driving forces for Infrastructure Development

ICERM Workshop Dec 2012 – reproducibility in computational mathematics. Workshop report that was collaboratively written by attendees. Tries to lay out guidelines for releasing code and data when publishing results. Details about what needs to be described in the publication. 

3. Re-use and re=purposing: Crowd sourcing and evidence-based-***

Reproducible Research is Grassroots.

External drivers: Open science from the Whitehouse. OSTP Exec memorandum: federal funding agencies to submit plans within 6 months to say how they will facilitate access to publications and data; in May order to federal agencies doing research directing them to make data publicly available. Software is conspicuously absent. Software has different legal status than data – makes it different than data for federal mandating – Bye Dole act, allowing universities to claim patents on software.

Science policy in congress – how do we fund science and what are the deliverables? Much action around publications.

National Science Board 2011 report on Digital Research Data Sharing and Management

Federal funding agencies have had a long-standing commitment to sharing data and (to a degree) software. NSF grant guidelines expect investigators to share with other researchers at no more than incremental cost, data. Also encourages investigators to share software. Largely unenforced. How do you hold people’s feet to the fire when definitions are still very slippery. NIH expects and supports timely release of research data for bigger grants (over $500k). NSF data management plan – looks like it’s trying to put meat on the bones of the grant guidelines.

2003 Natioanl Academies report on Data Sharing in the Life Sciences. 

Institute of Medicine – report on Evolution of Translational Omics: Lessons Learned and the Path Forward. When people tried to replicate work they couldn’t , and found many mistakes. How did work get approved for clinical trial? New standards were recommended. Reccomends standards for locking down software.

Openness in Science: Thinking about infrastructure to suppor this – not part of our practice as computational scientists. Having some sense of permanence of links to data and code. Standards for sharing data and code so they’re usable by others. Just starting to develop.

Traditional Sources of Error: View of the computer as another possible source of error in the discovery process. 

Reproducibility at Scale – May take specialized hardware and long run times? How do we reproduce that? What do we do with enormous output data?

Stability and Robustness of Results : Are the results stable? IF I’m using statistical methods, do they add their own variability to the findings?

The Larger Community – Greater transparency opens scholarship to a much larger group – crowd sourcing and public engagement in science. Currently most engagement is in collecting data, much less in the analysis and discovery. How do we provide context for use of data? How do we ingest and evaluate findings coming from new sources? New legal issues – copyright, patenting, etc. Greater access has possibility of increasing trust and help inform the debates around evidence-based policy making. 

Legal Barriers – making code and data available. Immediately run into copyright. In US there is no copyright on raw facts, per Supreme Court. Original selection and arrangement of facts is copyrightable. Datasets munged creatively could be copyright, but it’s murky. Easiest to put data in public domain, or use CC license. Different in Europe.  GPL – includes “sharealike” preventing from using open source code in proprietary ways. Science norms are slightly different – share work openly to fuel further development wherever it happens. 

CNI Fall 2013: Capturing the Ephemeral: Collecting Social Media and Supporting Twitter Research with Social Feed Manager

Daniel Chudnov, Bergis Jules, Daniel Kerchner, Laura Wrubel – George Washington University

Save the time of the researcher:

  • Demand from all across the spectrum from researchers for access to historical Twitter data. Library of Congress is archiving Twitter but not giving access.
  • How are current researchers collecting Twitter data? One example from GWU: By hand – Google reader (RSS feeds from Twitter), copy and paste into Excel, add coding, then pull into SPSS and Stata. Too much work for too little data (not to mention tools which don’t exist anymore). Copy and paste to Excel doesn’t scale. Not an isolated case. Over 5000 theses and dissertations since 2010 using Twitter data. 
  • What researchers ask for: specific users, keywords; basic values (user, date, text, retweets and follower counts); Thousands, not millions; need delimited files to import into coding software; need historical software.
  • Getting historical data usually requires getting from licensed Twitter reseller: DataSift, Gnip, NTT Data, Topsy (recently purchased by Apple). There are some research platforms for using Twitter data. Data is not cheap, but they are friendly and receptive to working with researchers (hoping to hear about products being developed specifically for academic community). Used to dealing with customers who can deal with very large datasets.
  • University archives using this to document student organizations, who are highly active on Twitter. Since March tracking 329 accounts with over 299k tweets.
  • Social Feed Manager software is available on GitHub. Django framework with django-social-auth library and tweepy library.
  • Just capturing a single feed doesn’t capture all the depth of interaction – want to expand it further and also use other sources.

Cliff Lynch – wrap-up discussion [ #rdlmw ]

One tag line – scholarly practice is changing, and that’s what put us all here. That won’t go away just because we’re having trouble dealing with it.

There’s a great search for leverage points where we can get a lot of return for a little investment. Lots of wondering whether there aren’t things we can do as consortia, for example. Other players, like instrument manufacturers. We have to keep looking for these leverage points, but need to realize that this is a sizable and expensive problem that we can’t make go away with one or two magic leverage points.

The discussion we just had about scaling and being involved up front, but being scared about whether we can deal with demand, is a real look at our problems.

Some new discussions about putting data lifecycle and funding strategies on different timelines that have complex interactions – certain funding strategies can distort the lifecycle by making it attractive or necessary for investigators to hold on to data that should be migrating.

This is not a NSF problem, nor a funding agency problem. We need to come up with a system that accommodates unsponsored research too. There’s a significant amount of work that goes on in social sciences and humanities with little or no funding attached.

One of the ugly facts we need to mindful of is the systematic defunding of (particularly public) higher education, and the pressure for defunding of scholarly research in government agencies. We need to come up with means of data curation and management that allow us to make intelligent decisions about priorities. Saw a striking example of this in the UK when they applied massive cuts to their funding agencies, including defunding of national archiving system for arts and humanities.

Pleased to see a session on secure data which said more than “this is hard – let’s run away”, which is the usual response. Secure data is probably not the right word. So much work to do on definitions and common languages, so we don’t spend so much time redefining problems – maybe we need to put some short-term effort into definitions. Also, we’re short on facts on the ground. Serge offered some real data on what’s going on with grants at Princeton – we need that campus by campus and rolled up by discipline and national lines. It’s not that hard to get, and there are various projects talking about it.

What’s the balance between enabling sharing and enabling preservation. Often a lot of the investment starts going into preservation and you never get to sharing. Bill Michener gave us a look at a nice set of investment into discovery and reuse systems (DataONE), maybe that’s something that could be federated so we don’t have to build a ton of them.

Was glad to hear about the PASIG work – developed over 3-4 years between Stanford, a group of other institutions and Sun. As we think about the right kinds of industry/university venues for collaboration that’s one to have a look at. In particular look at the agendas for past meetings. Some of the conversations about expected structure of storage market, things that drive tech refresh cycles in storage, are very helpful.

Panel Discussion – Funding Agencies [ #rdlmw ]

    Michael Huerta – NIH/NLM

Benefits of data sharing include tansparency, reanalysis, integration, algorithm development
Data sharing has costs
Sharing data is good, but sharing all data probably isn’t.
Should data be shared? Considerations:
– maturity of science – exploratory vs. well understood (might make more sense to share)
– maturity of means of collections – unique means might not be valuable for others
– amount and complexity of data – more might be better for sharing
– utility of the data – to research community and public
– ethical and policy considerations

At NIH have formulas to guide applicants in formulating data sharing plans – important questions to address (NIH requirements kick in for direct costs > $500k/yr, but are revisiting).
– What data will be shared – domain, file type, format type, QA methods, raw and/or processed, individual/summary etc.
– who will have access? Public? research community? more restricted?
– where will data be located? what’s the plan for maintenance?
– when will data be shared? at collection, or publication? incremental release of longtitudinal data?
– How will researchers locate and access data?

NIH success stories –
– Data resources – NLM: Genbank, dbGaP, PubChem,; NIH Blueprint for Neuroscience Research: NITRC, NIF, Human Connectome Project; NIH National Database for Autism Research – all autism research data from human subjects, federated with other resources.

    Jennifer Schopf – NSF

NSF Data policy is NOT changing – it just wasn’t enforced very broadly.
What has changed is that since mid-January every proposal that comes in must have a data management plan. DMP plan may include – types of data, standards for data and metadata, policies for access, policies for re-use, plans for archiving. Community driven and reviewed – there aren’t generally accepted definitions and practices across all the disciplines. It is acceptable to have a plan that says “I don’t plan to share my data” – but then you should probably explain why. Expected to grow and change over time, the same way impact and review criteria have changed over time.

Within NSF, looking at implications of sharing data from a computer science point of view. There is a cross-NSF task force called ACCI data task force (Tony Hey and Dan Atkins).

Trying to enable data-enabled science.

What are the perceived roles of internal support mechanisms for data lifecycles? How are we looking to interact with libraries, local archives, etc? How are researchers, librarians, CIOs, etc think about linking to regional or national efforts, and how can NSF help support this?

    Don Waters – Mellon Foundation

Most humanist disciplines depend on durable data. The digital humanities are, like e-science, changing. Witnessing massive defunding of higher education across the country, so we need to work to address common problem together.

The definition of data needs to be wider than numerical information, but not as broad as bit-level. Don defines data as being primary sources. Scientific data come in many of the same forms as humanistic data.

Data now depends on sensors and capture instruments. There’s a tendency to treat these curation issues as novel – even if they’re new to scientists, they’re familiar to humanists that have had to interpret very rich types of data – data driven scholarship is not new.

What is new is the formalization of traditional interpretive activities and powerful algorithms that can work on this data. Projects in humanities have moved the needle, but there are problems with curating data.

To achieve promise a flexible and scalable repository structure is needed. Mellon has been experimenting for over a decade. ArtStor is one example. Universities and scholarly societies have been willing to step up and provide places to store these data. Bamboo is a virtual research environment for various forms of humanistic data.

A question is raised about how NSF if working with the National Archives – they’ve been collaborating and expect to continue.

Another question is about who is the ultimate owner of data? From Mellon, data are institutionally owned and grants have explicit agreements that require institutions to gather rights from creators. NSF doesn’t have as formal a process, but NSF makes grants to institutions. From NIH, ultimately the owners are those that pay for it, which are the taxpayers. Cliff notes that it’s not clear in the US whether data (a collection of facts) can be owned. So it comes down to who has control over it. Control obligations can be shaped by contracts between funders and institutions. It’s different overseas. Don notes that in the humanities, at least, the data are often other works that have their own IP issues so rights need to be negotiated.

David asks about opportunities for sharing across institutions and disciplines. Mike answers that bringing together resources is useful, and that requires work to converge on common definitions and formats. The work they’ve done with the autism research is a good example. Once you’ve got things defined, the data can reside anywhere and there’s no onus for supporting a large infrastructure. NSF supports a wide variety of research – solutions for sharing are not easy. Shared metadata is a good idea. Don notes that there are differences that get in the way of sharing data and finding ground for shared community is a big part of the work that needs to be done. Some of that has been done at the repository level, much less at the level of tools for use of data.

Grace asks whether funding agencies are starting to do more assessment of longer-term impacts? It seems like innovation is more key to getting funding rather than sustainability or impact. At Mellon they separate the infrastructure from the innovative – and in Don’s division grants don’t get made unless they have a sustainability story. At NSF grants that support infrastructure evaluation of impact is becoming more common.