The NIH Contribution to the Commons
Philip E Bourne, Associate Director for Data Science, NIH
There’s a realization that the future of biomedical research will be very different – the whole enterprise is becoming more analytical. How can we maximize rate of discovery in this new environment?
We have come a long way in just one researcher’s career. But there is much to do: too few drugs, not personalized, too long to get to market; rare diseases are ignored; clinical trials are too limited in patients, too expensive, and not retroactive; education and training does not match current market needs; research is not cost effective – not easily replicated, too slow to disseminate. How do we do better?
There is much promise: 100,000 genomes project – goal is to sequence 100k people and use it as a diagnostic tool. Comorbidity network for 6.2 million Danes over 14.9 years – the likelihood if you have one disease you’ll get another. Incredibly powerful tool based on an entire population. We don’t have the facilities in this country to do that yet – need access and homogonization of data.
What is NIH doing?
Trying to create an ecosystem to support biomedical research and health care. Too much is lost in the system – grants are made, publications are issued, but much of what went into the publication is lost.
Elements of ecosystem: Community, Policy, Infrastructure. On top of that lay a virtuous research cycle – that’s the driver.
Policies – now & forthcoming
Data sharing: NIH takes seriously, and now have mandates from government on how to move forward with sharing. Genomic data sharing is announced; Data sharing plans on all research awards; Data sharing plan enforcement: machine readable plan, repository requirements to include grant numbers. If you say you’re going to put data in repository x on date y, it should be easy to check that that has happened and then release the next funding without human intevention. Actually looking at that.
Data Citation – elevate to be considered by NIH as a legitimate form of scholarship. Process: machine readable standard for data citation (done – JATS (xml ingested by PubMed) extension); endorsement of data citation in NIH bib sketch, grants, reports, etc.
Data to Knowledge initiative (BD2K). Funded 12 centers of data excellence – each associated with different types of data. Also funded data discovery index consortium, building means of indexing and finding that data. It’s very difficult to find datasets now, which slows process down. Same can be said of software and standards.
The Commons – A conceptual framework for sharing and being FAIR: Finding, Accessing, Integrating, Reusing
Digital research objects with attribution – can be data, software, narrative, etc. The Commons is agnostic of computing platform.
Digital Objects (with UIDs); Search (indexed metadata); Search
Public cloud platforms, super computing platforms, other platforms
Research Object IDs under discussion by the community – BD2K centers, NCI cloud pilots (Google and AWS supported), large public data sets, MODs. Meeting in January in UK – could DOIs or some other form.
Search – BD2K data and software discovery indices; Google search functions
Appropriate APIs being developed by the community, eg Global Alliance for Genomic Health. I want to know what variation there is in chromosome 7, position x, across the human population. With the Commons more of those kinds of questions can be answered. Beacon is an app being tested – testing people’s willingness to share and people’s willingness to build tools.
The Commons business model: What happens right now is people write grant to NIH, with line items to manage data resources. If successful they get money – then what happens? Maybe some of it gets siphoned off to do somethign else. Or equipment gets bought where it’s not heavily utilized. As we move more and more to cloud resources, it’s easier to think on a business model based on credit. Instead of getting hard cash you’re given an amount of credit that you can spend in any Commons compliant service, where compliance means they’ve agreed to share. Could be that institution is part of Commons or it could be public cloud or some other kind of resource. Creates more of a supply and demand environment. Enables a public/private partnership. Want to test idea that more can be done with computing dollars. NIH doesn’t actually know how much they spend on computation and data activities – but undoubtedly over a billion dollars per year.
Community: Training initiatives. Build an OPEN digital framework for data science training: NIH data science workforce development center (call will go out soon). How do you crate metadata around physical and virtual courses? Develop short-term training opportunities – e.g. supported workshop with gaming community. Develop the discipline of biomedical data science and support cross-training – OPEN courseware.
What is needed? Some examples from across the ICs:
Homgenization of disparate large unstructured datasets; deriving structure from unstructured data; feature mapping and comparison from image data; visualization and analysis of multi-dimensional phenotypic datasets; causal modeling of large scale dynamic networks and subsequent discovery.
In process of standing up Commons with two or three projects – centers being funded from BD2K who are interested, working with one or two public cloud providers. Looking to pilot specific reference datasets – how to stimulate accessibility and quality. Having discussions with other federal agencies who are also playing around with these kinds of ideas. FDA looking to burst content out into cloud. In Europe ELIXIR is a set of countries standing up nodes to support biomedical research. Having discussions to see if that can work with Commons.
There’s a role for librarians, but it requires significant retraining in order to curate data. You have to understand the data in order to curate it. Being part of a collective that is curating data and working with faculty and students that are using that data is useful, but a cultural shift. The real question is what’s the business model? Where’s the gain for the institution?
We may have problems with the data we already have, but that’s only a tiny fraction of the data we will have. The emphasis on precision medicine will increase dramatically in the next while.
Our model right now is that all data is created equal, which clearly is not the case. But we don’t know which data is created more equal. If grant runs out and there is clear indication that data is still in use, perhaps there should be funding to continue maintenance.