Benefits of data sharing include tansparency, reanalysis, integration, algorithm development
Data sharing has costs
Sharing data is good, but sharing all data probably isn’t.
Should data be shared? Considerations:
– maturity of science – exploratory vs. well understood (might make more sense to share)
– maturity of means of collections – unique means might not be valuable for others
– amount and complexity of data – more might be better for sharing
– utility of the data – to research community and public
– ethical and policy considerations
At NIH have formulas to guide applicants in formulating data sharing plans – important questions to address (NIH requirements kick in for direct costs > $500k/yr, but are revisiting).
– What data will be shared – domain, file type, format type, QA methods, raw and/or processed, individual/summary etc.
– who will have access? Public? research community? more restricted?
– where will data be located? what’s the plan for maintenance?
– when will data be shared? at collection, or publication? incremental release of longtitudinal data?
– How will researchers locate and access data?
NIH success stories –
– Data resources – NLM: Genbank, dbGaP, PubChem, ClincalTrials.gov; NIH Blueprint for Neuroscience Research: NITRC, NIF, Human Connectome Project; NIH National Database for Autism Research – all autism research data from human subjects, federated with other resources.
NSF Data policy is NOT changing – it just wasn’t enforced very broadly.
What has changed is that since mid-January every proposal that comes in must have a data management plan. DMP plan may include – types of data, standards for data and metadata, policies for access, policies for re-use, plans for archiving. Community driven and reviewed – there aren’t generally accepted definitions and practices across all the disciplines. It is acceptable to have a plan that says “I don’t plan to share my data” – but then you should probably explain why. Expected to grow and change over time, the same way impact and review criteria have changed over time.
Within NSF, looking at implications of sharing data from a computer science point of view. There is a cross-NSF task force called ACCI data task force (Tony Hey and Dan Atkins).
Trying to enable data-enabled science.
What are the perceived roles of internal support mechanisms for data lifecycles? How are we looking to interact with libraries, local archives, etc? How are researchers, librarians, CIOs, etc think about linking to regional or national efforts, and how can NSF help support this?
Don Waters – Mellon Foundation
Most humanist disciplines depend on durable data. The digital humanities are, like e-science, changing. Witnessing massive defunding of higher education across the country, so we need to work to address common problem together.
The definition of data needs to be wider than numerical information, but not as broad as bit-level. Don defines data as being primary sources. Scientific data come in many of the same forms as humanistic data.
Data now depends on sensors and capture instruments. There’s a tendency to treat these curation issues as novel – even if they’re new to scientists, they’re familiar to humanists that have had to interpret very rich types of data – data driven scholarship is not new.
What is new is the formalization of traditional interpretive activities and powerful algorithms that can work on this data. Projects in humanities have moved the needle, but there are problems with curating data.
To achieve promise a flexible and scalable repository structure is needed. Mellon has been experimenting for over a decade. ArtStor is one example. Universities and scholarly societies have been willing to step up and provide places to store these data. Bamboo is a virtual research environment for various forms of humanistic data.
A question is raised about how NSF if working with the National Archives – they’ve been collaborating and expect to continue.
Another question is about who is the ultimate owner of data? From Mellon, data are institutionally owned and grants have explicit agreements that require institutions to gather rights from creators. NSF doesn’t have as formal a process, but NSF makes grants to institutions. From NIH, ultimately the owners are those that pay for it, which are the taxpayers. Cliff notes that it’s not clear in the US whether data (a collection of facts) can be owned. So it comes down to who has control over it. Control obligations can be shaped by contracts between funders and institutions. It’s different overseas. Don notes that in the humanities, at least, the data are often other works that have their own IP issues so rights need to be negotiated.
David asks about opportunities for sharing across institutions and disciplines. Mike answers that bringing together resources is useful, and that requires work to converge on common definitions and formats. The work they’ve done with the autism research is a good example. Once you’ve got things defined, the data can reside anywhere and there’s no onus for supporting a large infrastructure. NSF supports a wide variety of research – solutions for sharing are not easy. Shared metadata is a good idea. Don notes that there are differences that get in the way of sharing data and finding ground for shared community is a big part of the work that needs to be done. Some of that has been done at the repository level, much less at the level of tools for use of data.
Grace asks whether funding agencies are starting to do more assessment of longer-term impacts? It seems like innovation is more key to getting funding rather than sustainability or impact. At Mellon they separate the infrastructure from the innovative – and in Don’s division grants don’t get made unless they have a sustainability story. At NSF grants that support infrastructure evaluation of impact is becoming more common.