Panel Discussion – Funding Agencies [ #rdlmw ]

    Michael Huerta – NIH/NLM

Benefits of data sharing include tansparency, reanalysis, integration, algorithm development
Data sharing has costs
Sharing data is good, but sharing all data probably isn’t.
Should data be shared? Considerations:
– maturity of science – exploratory vs. well understood (might make more sense to share)
– maturity of means of collections – unique means might not be valuable for others
– amount and complexity of data – more might be better for sharing
– utility of the data – to research community and public
– ethical and policy considerations

At NIH have formulas to guide applicants in formulating data sharing plans – important questions to address (NIH requirements kick in for direct costs > $500k/yr, but are revisiting).
– What data will be shared – domain, file type, format type, QA methods, raw and/or processed, individual/summary etc.
– who will have access? Public? research community? more restricted?
– where will data be located? what’s the plan for maintenance?
– when will data be shared? at collection, or publication? incremental release of longtitudinal data?
– How will researchers locate and access data?

NIH success stories –
– Data resources – NLM: Genbank, dbGaP, PubChem,; NIH Blueprint for Neuroscience Research: NITRC, NIF, Human Connectome Project; NIH National Database for Autism Research – all autism research data from human subjects, federated with other resources.

    Jennifer Schopf – NSF

NSF Data policy is NOT changing – it just wasn’t enforced very broadly.
What has changed is that since mid-January every proposal that comes in must have a data management plan. DMP plan may include – types of data, standards for data and metadata, policies for access, policies for re-use, plans for archiving. Community driven and reviewed – there aren’t generally accepted definitions and practices across all the disciplines. It is acceptable to have a plan that says “I don’t plan to share my data” – but then you should probably explain why. Expected to grow and change over time, the same way impact and review criteria have changed over time.

Within NSF, looking at implications of sharing data from a computer science point of view. There is a cross-NSF task force called ACCI data task force (Tony Hey and Dan Atkins).

Trying to enable data-enabled science.

What are the perceived roles of internal support mechanisms for data lifecycles? How are we looking to interact with libraries, local archives, etc? How are researchers, librarians, CIOs, etc think about linking to regional or national efforts, and how can NSF help support this?

    Don Waters – Mellon Foundation

Most humanist disciplines depend on durable data. The digital humanities are, like e-science, changing. Witnessing massive defunding of higher education across the country, so we need to work to address common problem together.

The definition of data needs to be wider than numerical information, but not as broad as bit-level. Don defines data as being primary sources. Scientific data come in many of the same forms as humanistic data.

Data now depends on sensors and capture instruments. There’s a tendency to treat these curation issues as novel – even if they’re new to scientists, they’re familiar to humanists that have had to interpret very rich types of data – data driven scholarship is not new.

What is new is the formalization of traditional interpretive activities and powerful algorithms that can work on this data. Projects in humanities have moved the needle, but there are problems with curating data.

To achieve promise a flexible and scalable repository structure is needed. Mellon has been experimenting for over a decade. ArtStor is one example. Universities and scholarly societies have been willing to step up and provide places to store these data. Bamboo is a virtual research environment for various forms of humanistic data.

A question is raised about how NSF if working with the National Archives – they’ve been collaborating and expect to continue.

Another question is about who is the ultimate owner of data? From Mellon, data are institutionally owned and grants have explicit agreements that require institutions to gather rights from creators. NSF doesn’t have as formal a process, but NSF makes grants to institutions. From NIH, ultimately the owners are those that pay for it, which are the taxpayers. Cliff notes that it’s not clear in the US whether data (a collection of facts) can be owned. So it comes down to who has control over it. Control obligations can be shaped by contracts between funders and institutions. It’s different overseas. Don notes that in the humanities, at least, the data are often other works that have their own IP issues so rights need to be negotiated.

David asks about opportunities for sharing across institutions and disciplines. Mike answers that bringing together resources is useful, and that requires work to converge on common definitions and formats. The work they’ve done with the autism research is a good example. Once you’ve got things defined, the data can reside anywhere and there’s no onus for supporting a large infrastructure. NSF supports a wide variety of research – solutions for sharing are not easy. Shared metadata is a good idea. Don notes that there are differences that get in the way of sharing data and finding ground for shared community is a big part of the work that needs to be done. Some of that has been done at the repository level, much less at the level of tools for use of data.

Grace asks whether funding agencies are starting to do more assessment of longer-term impacts? It seems like innovation is more key to getting funding rather than sustainability or impact. At Mellon they separate the infrastructure from the innovative – and in Don’s division grants don’t get made unless they have a sustainability story. At NSF grants that support infrastructure evaluation of impact is becoming more common.


Brian Athey – Big Data 2011 [ #rdlmw ]

Brian Athey is a professor in the Medical School at the University of Michigan.

It’s difficult to incentivize researchers to share data.

Agile data integration is an engine that drives discovery.

Developing personal health system requires combining data extracted from genomics with data extracted from a clinical record of the individual.

There’s a disconnect between classic IT’s “command and control” approach and what actually happens in research labs. We want to achieve a focused collaboration balancing high levels of focus and participation.

Next gen sequencing – turning out around 10 terabytes per day at Michigan, from 1500 users.

In 2006 there was a knee in the curve where it became more economical to generate the genomic data than to store it. We have to make decisions about what we store – we can’t save everything.

Brian is working on a Federated Enterprise Data Warehouse, that stores both clinical and research data. There’s an “honest broker” that mediates the data accessible to the research side.

PCAST NITRD “Big Data” report from November. Has a list of recommendations.

We are all challenged by having to bring heterogeneous data together. Working with Johnson and Johnson on something called tranSMART – J&J have over 400 pharma research databases.

Clinicians have worfklow – researchers don’t.

Discussion items:
IT doesn’t own the problem.
The rise of “architecture”
Data governance
Data governance – who owns the data? bring them into the room. But there also has to be top down convenors.
Privacy, security, confidentiality – the idea of the “honest broker” could be a model.
Cost and value-centered models – if we remain just a cost center we’re cooked.

Question – why can’t we keep all the data? The “Best Buy conundrum” – why do you charge me so much for storage when I can get it elsewhere cheap. Takes money to curate and level out the chaos. Maybe we should let the researchers decide what stays and what goes. The questioner, dealing with crystallography data and working with people dealing with NASA data, says that they’ve learned that getting rid of raw data is a huge mistake. Vijay notes that now the cost of hardware is only 5% of the cost of storage – it’s people and facilities that cost.