I’m attending a workshop on Research Information Technologies and their Role in Advancing Science.
Ian Foster from the UChicago Computation Institute is kicking it off.
We now have the ability to collect data to study how science is conducted. We can also use that data for other purposes: finding collaborators, easing administrative burdens, etc. Those are the reasons we often get funding to build research information systems, but can use those systems to do more interesting things.
Interested in two aspects of this:
1. Treat science itself as an object of study.
2. Can use this information to improve the way we do science. Don Swanson – research program to discover undiscovered public knowledge.
The challenge we face as a research community as we create research information systems is to bring together large amounts of information from many different places to create systems that are sustainable, scalable, and usable. Universities can’t build them by themselves, and neither can private companies.
Victoria Stodden – Columbia University (Statistics) – How Science is Different: Digitizing for Discovery
Slides online at: http://www.stanford.edu/~vcs/talks/UCCI-May182014-STODDEN.pdf
Tipping point as science becomes digital – how do we confront issues of transparency and public participation? New area for scientists. Other areas have dealt with this, but what is different about science?
1. Collaboration, reproducibility, and reliability: scientific cornerstones and cyberinfrastructure
Scoping the issue – looking at the June issues of Journal of American Statistical Association – how computational is it? Is the code available? 1996 – about half computational, by 2009 almost all computational. In ’96 none talked about getting the code. In 2011, 21% did. Still 4 out of 5 are black boxes.
In 2011 ? looked at 500 papers in biological sciences. Was able to get data in 9% of the cases.
The scientific method:
Deductive: math, formal logic; Empirical (or inductive): largely centered around statistical analysis of controlled experiments. Computational, simulations, data-driven science, might be 3rd and 4th branches. The Fourth Paradigm.
Credibility Crisis: Lots of discussion in journals and pop press about dissemination and reliability of scientific record.
Ubiquity of Error: central motivation of scientific method – we realize that our reasoning may be flawed, so we want to hit it against evidence to get closer to the truth. In deductive branch, we have proofs. In empirical branch, we have the machinery of hypothesis testing. Hundreds of years to come up with standards of reliability and reproducibility. The computational aspect is only a potential new branch, until we develop comparable standards. Jon Clairbout (Stanford): “Really reproducible Research” – an article about computational science is merely the advertisement of the scholarship. The actual scholarship is the set of code and data that generate the article.
Supporting computational science: Dissemination platforms; Workflow tracking and research environments (prepublication); embedded publishing – documents with ways of interacting with code and data. Mostly being put together by academics without reward because they thing these are important problems to solve.
Infrastructure design is important and rapidly moving forward.
Research Compendia – a site with dedicated pages which house code and data, so you can download digital scholarly objects.
2. Driving forces for Infrastructure Development
ICERM Workshop Dec 2012 – reproducibility in computational mathematics. Workshop report that was collaboratively written by attendees. Tries to lay out guidelines for releasing code and data when publishing results. Details about what needs to be described in the publication.
3. Re-use and re=purposing: Crowd sourcing and evidence-based-***
Reproducible Research is Grassroots.
External drivers: Open science from the Whitehouse. OSTP Exec memorandum: federal funding agencies to submit plans within 6 months to say how they will facilitate access to publications and data; in May order to federal agencies doing research directing them to make data publicly available. Software is conspicuously absent. Software has different legal status than data – makes it different than data for federal mandating – Bye Dole act, allowing universities to claim patents on software.
Science policy in congress – how do we fund science and what are the deliverables? Much action around publications.
National Science Board 2011 report on Digital Research Data Sharing and Management
Federal funding agencies have had a long-standing commitment to sharing data and (to a degree) software. NSF grant guidelines expect investigators to share with other researchers at no more than incremental cost, data. Also encourages investigators to share software. Largely unenforced. How do you hold people’s feet to the fire when definitions are still very slippery. NIH expects and supports timely release of research data for bigger grants (over $500k). NSF data management plan – looks like it’s trying to put meat on the bones of the grant guidelines.
2003 Natioanl Academies report on Data Sharing in the Life Sciences.
Institute of Medicine – report on Evolution of Translational Omics: Lessons Learned and the Path Forward. When people tried to replicate work they couldn’t , and found many mistakes. How did work get approved for clinical trial? New standards were recommended. Reccomends standards for locking down software.
Openness in Science: Thinking about infrastructure to suppor this – not part of our practice as computational scientists. Having some sense of permanence of links to data and code. Standards for sharing data and code so they’re usable by others. Just starting to develop.
Traditional Sources of Error: View of the computer as another possible source of error in the discovery process.
Reproducibility at Scale – May take specialized hardware and long run times? How do we reproduce that? What do we do with enormous output data?
Stability and Robustness of Results : Are the results stable? IF I’m using statistical methods, do they add their own variability to the findings?
The Larger Community – Greater transparency opens scholarship to a much larger group – crowd sourcing and public engagement in science. Currently most engagement is in collecting data, much less in the analysis and discovery. How do we provide context for use of data? How do we ingest and evaluate findings coming from new sources? New legal issues – copyright, patenting, etc. Greater access has possibility of increasing trust and help inform the debates around evidence-based policy making.
Legal Barriers – making code and data available. Immediately run into copyright. In US there is no copyright on raw facts, per Supreme Court. Original selection and arrangement of facts is copyrightable. Datasets munged creatively could be copyright, but it’s murky. Easiest to put data in public domain, or use CC license. Different in Europe. GPL – includes “sharealike” preventing from using open source code in proprietary ways. Science norms are slightly different – share work openly to fuel further development wherever it happens.