Defining the problem space –
Grand challenges are difficult. We’re using different languages and standards for how we deal with our domain data. Most scientists complain that they’re spending most of their time doing mundane data management and integration. Only a small part of their time is spent on the analysis.
The data deluge – lots of sensors.
Proliferation of citizen science programs. A whole new way of doing science.
Data silos. Lots of big repositories, tons of small ones, each using their own, non-interoperable data standards. Creates the long tail of orphan data – scattered worldwide.
Data entropy – most scientists are really familiar with their data just prior to publication. They may or may not document the intricacies of the data, and we lose the ability to use the data over time.
Community Engagement – been funded by 1.5 years by NSF. Will be releasing infrastructure in December 2011. Started interacting with scientific community two years ago via interviews, surveys, etc. What are the challenges scientists are facing?
Recent study found (in earth sciences) >80% would be willing to share data, across a broad group of researchers.
Stakeholder needs – what are data management plans? How do I describe and preserve my data?
Brought an array of people into the room to look at continental bird migration. What do we need to answer this? 31 different data layers, including a single researcher in Utah with data in his desk. Data discovery is an issue. Needed lots of compute cycles, which was a shock. Took an initial .5 million hours on TeraGrid, and more later. Also needed visualization tools. One of the datasets used, ebird, is a citizen science data source. Produced State of the Birds report 2011.
Cyberinfrastructure support – goal is to enable new sciences through universal access to data about life on earth, the environment, plus access to key tools. Three precepts: 1 Build on existing cyberinfrastructure; 2. Create new cyberinfrastructure; 3. support communities of practice – we’ve ignored this over time.
Member nodes – data repositories that already exist. Coordinating nodes – retain complete metadata catalog, indexing for search, networ-wide services, ensure content availability (preservation), replication services (would like to see data in 3 or more repositories. Investigator toolkit – familiar to scientists, integrated with data resources.
First three member node prototypes – ORNL-DAAC, Dryad, KNB.
There’s beginning to be some evidence that when you share your data, citation rates to your publications go up.
Working with Microsoft Research to make Excel a more powerful tool.
Added (impending release) a Data Management Planning Tool (DMPTool) for building a data management plan – wizard driven.
DataONE includes a powerful data discovery tool.
Education and Training – there’s a lot to do! In DataONE, created DataONEpedia – best practices. Scientists want one pagers, not detailed manuals.