Center for Research Informatics – established in 2011 in response of need for researchers in Biological Sciences for clinical data. Hired Bob Grossman to come in and start a data warehouse. Governance structure – important to set policies on storage, delivery, and use of data. Set up secure, HIPAA and FISMA compliance in data center, got certified. Allowed storage and serving of data with PHI. Got approval of infrastructure from IRB to serve up clinical data. Once you strip out identifiers, it’s not under HIPAA. Set up data feeds, had to prove compliance to hospital. Had to go through lots of maneuvers. Released under open source software called I2B2 to discover cohorts meeting specific criteria. Developed data request process to gain access to data. Seemingly simple requests can require considerable coding. Will start charging for services next month. Next phase is a new UI with Google-like search.
Alison Brizious – Center on Robust Decision Making for Climate and Energy Policy
RDCEP is very much in the user community. Highly multi-disciplinary – eight institutions and 19 disciplines. Provide methods and tools to provide policy makers with information in areas of uncertainty. Research is computationally and information intensive. Recurring challenge is pulling large amounts of data from disparate sources and qualities. One example is how to evaluate how crops might fail in reaction to extreme events. Need massive global data and highly specific local data. Scales are often mismatched, e.g. between Iowa and Rwanda. Have used Computation Institute facilities to help with those issues. Need to merge and standardize data across multiple groups in other fields. Finding data and making it useful can dominate research projects. Want researchers to concentrate on analysis. Challenges: Technical – data access, processing, sharing, reproducibility; Cultural – multiple disciplines, what data sharing and access means, incentives for sharing might be mis-aligned.
Michael Wilde – Computation Institute
Fundamental importance of model of computation in overall process of research and science. If science is focused on delivery of knowledge in papers, lots of computation is embedded in those papers. Lots of disciplinary coding that represents huge amounts of intellectual capital. Done in a chaotic way – don’t have a standard for how computation is expressed. If we had such a standard could expand on the availability of computation. We could also trace back what has been done. Started about ten years ago – Grid Physics Netowrk to apply these concepts to the LHC, the Sloan Sky Survey, and LIGO – virtual data. If we shipped along with findings a standard codified directory of how data was derived, could ship computation anywhere on planet, and once findings were obtained, could pass along recipes to colleagues. Making lots of headway, lots of projects using tools. SWIFT – high level programming/workflow language for expressing how data is derived. Modeled as a high level programming language that can also be expressed visually. Trying to apply the kind of thinking that the Web brought to society to make science easier to navigate.
Kyle Chard – Computation Institute
Collaboration around data – Globus project. Produce a research data management service. Allow researchers to manage big data – transfer, sharing, publishing. Goal is to make research as easy as running a business from a coffee shop. Base service is around transfer of large data – gets complex with multi-institutions, making sure data is the same from one place to the other. Globus helps with that. Allow sharing to happen from existing large data stores. Need ways to describe, organize, discover. Investigated metadata – first effort is around publishing – wrap up data, put in a place, describe the data. Leverage resources within the institution – provide a layer on top of that with publication and workflow, get a DOI. Services improve collaboration by allowing researchers to share data. Publication helps not only with public discoverability, but sharing within research groups.
James Evans – Knowledge Lab
Sociologist, Computation Institute. Knowledge Institute started about a year ago. Driven by a handful of questions: Where does knowledge come from? What drives attention, imagination? What role does social, institutional play in what research gets done? How is knowledge shared? Purpose to marry questions with the explosion of digital information and the opportunities that provides. Answering four questions: How do we efficiently harvest and share knowledge harvested from all over?; How do we learn how knowledge is made from these traces?; Represent, recombine knowledge in novel ways; Improve ways of acquiring knowledge. Interested in long view – what kinds of questions could be asked? Providing mid-scale funding for research projects. Questions they’ve been asking: How science as an institution thinks and how scientists pick the next experiment; What’s the balance of tradition and innovation in research? ; How people understand safety in their environment, using street-view data; Taking data from millions of cancer papers then drive experiments with a knowledge engine; studying peer review – how does review process happen? Challenges – the corpus of science, working with publishers – how to represent themselves as safe harbor that can provide value back; how to engage in rich data annotations at a level that scientists can engage with them?; how to put this in a platform that fosters sustained engagement over time.
Alison Heath – Institute for Genomics and Systems Biology and Center for Data Intensive Science
Open Science Data Cloud – genomics, earth sciences, social sciences. How to leverage cloud infrastructure? How do you download and analyze petabyte size datasets? Create something that looks kind of like Amazon or Google, but with instant access to large science datasets. What ideas to people come up with that involve multiple datasets. How do you analyze millions of genomes? How do you protect the security of that data? How do you create a protected cloud environment for that data? BioNimbus protected data cloud. Hosts bulk of Cancer Genome Project – expected to be about 2.5 petabytes of data. Looked at building infrastructure, now looking at how to grow it and give people access. In past communities themselves have handled data curation – how to make that easier? Tools for mapping data to identifiers, citing data. But data changes – how do you map that? How far back do you keep it? Tech vs. cultural problems – culturally has been difficult. Some data access controlled by NIH – took months to get them to release attributes about who can access data. Email doesn’t scale for those kinds of purposes. Reproducibility – with virtual machines you can save the snapshot to pass it on.
Engagement needs to be well beyond technical. James Evans engaging with humanities researchers. Having equivalent of focus groups around questions over a sustained period – hammering questions, working with data, reimagining projects. Fund people to do projects that link into data. Groups that have multiple post-docs, data-savvy students can work once you give them access. Artisanal researchers need more interaction and interface work. Simplify the pipeline of research outputs – reduce to 4-5 plug and play bins with menus of analysis options. Alison – helps to be your own first user group. Some user communities are technical, some are not. Globus has Web UI, CLI, APIs, etc. About 95% of community use the web interface, which surprised them. Globus has a user experience team, making it easier to use. Easy to get tripped up on certificates, security, authentication – makes it difficult to create good interfaces. Electronic Medical Record companies have no interest in being able to share data across systems – makes it very difficult. CRI – some people see them as service provider, some as a research group. Success is measured differently so they’re trying to track both sets of metrics, and figure out how to pull them out of day-to-day workstreams. Satisfaction of users will be seen in repeat business and comments to the dean’s office, not through surveys. Doing things like providing boilerplate language on methods and results for grants and writing letters of support go a long way towards making life easier for people. CRI provides results with methods and results section ready to use in paper drafts. Should journals require an archived VM for download? Having recipes at right level of abstraction in addition of data is important. Data stored in repositories is typically not high quality – lacks metadata, curation. Can rerun the exact experiment that was run, but not others. If toolkits automatically produce that recipe for storage and transmission then people will find it easy.