CSG Fall Meeting 2012 – Big Data – open discussion

Raj opens the discussion by asking about who’s responsible for big data on campus, and do we even know who makes that decision?

In talking about repositories, are there campuses that are trying hard to fill their repositories? A few raise their hand, but most don’t.

Moving to the cloud for storage – will there still be a need for local storage expertise? John says the complexity and heterogeneity is increasing and won’t go away any time soon. Bruce would like to see us positioned to provide expertise and advice on advanced data handling like Hadoop. Mark notes that we should be using those technologies to mine our own data sets – e.g. security mining application and network logs.

In a discussion of data management and the need to engage faculty on planning for data management Kitty notes that she’s had some success hiring PhDs who didn’t want to go into faculty positions but had a lot of experience with data. They could talk to faculty in ways that neither librarians nor IT people can. Curt notes that last year’s data management workshop noted the need for grad students to be trained in data management as part of learning research methods.

John asks whether people are planning central services for systems that store secure data that cannot be made public. Bill notes that Stanford is definitely building that as part of their repository, including ways to store data that cannot be released for a certain number of years. Carnegie Mellon is also planning to do this. Columbia has a pilot project in this area.

Raj notes that on the admin side, Arizona State is doing a lot of mining of their student data and providing recommendations for students on courses, etc.

Mark notes that we don’t have a good story to tell about groups in controlling access. Michael says we do have a fairly good story on group technology (thanks to Tom’s work on Grouper), but we still need to work across boundaries and to develop new standards, such as O-AUTH and SCIM.

Mark also postulates that the volume of data generated by the Massively Online Courses would be a really interesting place to think about big data.

There’s some general discussion about the discoverability of data in the future and the ability to understand and use data in the future when the creators might not be available. That’s where data curation becomes important.  Data curation should include standards for metadata and annotation, and also processes for migrating data forward as formats change. Ken quotes a scientist as saying “I would rather use someone else’s toothbrush than their ontology”.

On to lunch.

Advertisement

Cliff Lynch – wrap-up discussion [ #rdlmw ]

One tag line – scholarly practice is changing, and that’s what put us all here. That won’t go away just because we’re having trouble dealing with it.

There’s a great search for leverage points where we can get a lot of return for a little investment. Lots of wondering whether there aren’t things we can do as consortia, for example. Other players, like instrument manufacturers. We have to keep looking for these leverage points, but need to realize that this is a sizable and expensive problem that we can’t make go away with one or two magic leverage points.

The discussion we just had about scaling and being involved up front, but being scared about whether we can deal with demand, is a real look at our problems.

Some new discussions about putting data lifecycle and funding strategies on different timelines that have complex interactions – certain funding strategies can distort the lifecycle by making it attractive or necessary for investigators to hold on to data that should be migrating.

This is not a NSF problem, nor a funding agency problem. We need to come up with a system that accommodates unsponsored research too. There’s a significant amount of work that goes on in social sciences and humanities with little or no funding attached.

One of the ugly facts we need to mindful of is the systematic defunding of (particularly public) higher education, and the pressure for defunding of scholarly research in government agencies. We need to come up with means of data curation and management that allow us to make intelligent decisions about priorities. Saw a striking example of this in the UK when they applied massive cuts to their funding agencies, including defunding of national archiving system for arts and humanities.

Pleased to see a session on secure data which said more than “this is hard – let’s run away”, which is the usual response. Secure data is probably not the right word. So much work to do on definitions and common languages, so we don’t spend so much time redefining problems – maybe we need to put some short-term effort into definitions. Also, we’re short on facts on the ground. Serge offered some real data on what’s going on with grants at Princeton – we need that campus by campus and rolled up by discipline and national lines. It’s not that hard to get, and there are various projects talking about it.

What’s the balance between enabling sharing and enabling preservation. Often a lot of the investment starts going into preservation and you never get to sharing. Bill Michener gave us a look at a nice set of investment into discovery and reuse systems (DataONE), maybe that’s something that could be federated so we don’t have to build a ton of them.

Was glad to hear about the PASIG work – developed over 3-4 years between Stanford, a group of other institutions and Sun. As we think about the right kinds of industry/university venues for collaboration that’s one to have a look at. In particular look at the agendas for past meetings. Some of the conversations about expected structure of storage market, things that drive tech refresh cycles in storage, are very helpful.

Panel Discussion – Funding Agencies [ #rdlmw ]

    Michael Huerta – NIH/NLM

Benefits of data sharing include tansparency, reanalysis, integration, algorithm development
Data sharing has costs
Sharing data is good, but sharing all data probably isn’t.
Should data be shared? Considerations:
– maturity of science – exploratory vs. well understood (might make more sense to share)
– maturity of means of collections – unique means might not be valuable for others
– amount and complexity of data – more might be better for sharing
– utility of the data – to research community and public
– ethical and policy considerations

At NIH have formulas to guide applicants in formulating data sharing plans – important questions to address (NIH requirements kick in for direct costs > $500k/yr, but are revisiting).
– What data will be shared – domain, file type, format type, QA methods, raw and/or processed, individual/summary etc.
– who will have access? Public? research community? more restricted?
– where will data be located? what’s the plan for maintenance?
– when will data be shared? at collection, or publication? incremental release of longtitudinal data?
– How will researchers locate and access data?

NIH success stories –
– Data resources – NLM: Genbank, dbGaP, PubChem, ClincalTrials.gov; NIH Blueprint for Neuroscience Research: NITRC, NIF, Human Connectome Project; NIH National Database for Autism Research – all autism research data from human subjects, federated with other resources.

    Jennifer Schopf – NSF

NSF Data policy is NOT changing – it just wasn’t enforced very broadly.
What has changed is that since mid-January every proposal that comes in must have a data management plan. DMP plan may include – types of data, standards for data and metadata, policies for access, policies for re-use, plans for archiving. Community driven and reviewed – there aren’t generally accepted definitions and practices across all the disciplines. It is acceptable to have a plan that says “I don’t plan to share my data” – but then you should probably explain why. Expected to grow and change over time, the same way impact and review criteria have changed over time.

Within NSF, looking at implications of sharing data from a computer science point of view. There is a cross-NSF task force called ACCI data task force (Tony Hey and Dan Atkins).

Trying to enable data-enabled science.

What are the perceived roles of internal support mechanisms for data lifecycles? How are we looking to interact with libraries, local archives, etc? How are researchers, librarians, CIOs, etc think about linking to regional or national efforts, and how can NSF help support this?

    Don Waters – Mellon Foundation

Most humanist disciplines depend on durable data. The digital humanities are, like e-science, changing. Witnessing massive defunding of higher education across the country, so we need to work to address common problem together.

The definition of data needs to be wider than numerical information, but not as broad as bit-level. Don defines data as being primary sources. Scientific data come in many of the same forms as humanistic data.

Data now depends on sensors and capture instruments. There’s a tendency to treat these curation issues as novel – even if they’re new to scientists, they’re familiar to humanists that have had to interpret very rich types of data – data driven scholarship is not new.

What is new is the formalization of traditional interpretive activities and powerful algorithms that can work on this data. Projects in humanities have moved the needle, but there are problems with curating data.

To achieve promise a flexible and scalable repository structure is needed. Mellon has been experimenting for over a decade. ArtStor is one example. Universities and scholarly societies have been willing to step up and provide places to store these data. Bamboo is a virtual research environment for various forms of humanistic data.

A question is raised about how NSF if working with the National Archives – they’ve been collaborating and expect to continue.

Another question is about who is the ultimate owner of data? From Mellon, data are institutionally owned and grants have explicit agreements that require institutions to gather rights from creators. NSF doesn’t have as formal a process, but NSF makes grants to institutions. From NIH, ultimately the owners are those that pay for it, which are the taxpayers. Cliff notes that it’s not clear in the US whether data (a collection of facts) can be owned. So it comes down to who has control over it. Control obligations can be shaped by contracts between funders and institutions. It’s different overseas. Don notes that in the humanities, at least, the data are often other works that have their own IP issues so rights need to be negotiated.

David asks about opportunities for sharing across institutions and disciplines. Mike answers that bringing together resources is useful, and that requires work to converge on common definitions and formats. The work they’ve done with the autism research is a good example. Once you’ve got things defined, the data can reside anywhere and there’s no onus for supporting a large infrastructure. NSF supports a wide variety of research – solutions for sharing are not easy. Shared metadata is a good idea. Don notes that there are differences that get in the way of sharing data and finding ground for shared community is a big part of the work that needs to be done. Some of that has been done at the repository level, much less at the level of tools for use of data.

Grace asks whether funding agencies are starting to do more assessment of longer-term impacts? It seems like innovation is more key to getting funding rather than sustainability or impact. At Mellon they separate the infrastructure from the innovative – and in Don’s division grants don’t get made unless they have a sustainability story. At NSF grants that support infrastructure evaluation of impact is becoming more common.

Panel Discussion – Vendor and Corporate Relationships [ #rdlmw ]

    Ray Clarke, Oracle

SNIA – 100 Year Archive Requirements Study – key concerns and observations:
– Logical and physical migration do not scale cost-effectively
– A never ending, costly cycle of migration across technology generations

Lots of challenges – Oracle (of course) offers solutions across the stack.
The ability to monitor workflow as data moves through a system is important.
There will always be a plethora of different types of media and storage – important to manage that. There are technology considerations about the shelf life and power/cooling consumption across different technologies that are better with tape than disk. Also with bit error detection. The cost ration for a terabyte stored long term on SATA disk vs. LTO-4 tape is about 23:1. For energy cost it is about 290:1. Tapes can now hold 5 TB uncompressed on a single cartridge. There are ways to deploy tiered architecture of different kinds of media, from flash, through disk, to tape for archival storage.

We need more data classification to understand how best to store it.

Oracle Preservation and Archiving Special Interest Group (PASIG), founded 2007 by Michael Keller at Stanford and Art Pasquinelli at Oracle.

    Jeff Layton, Dell

Looking at three aspects of DLM:
Data availability – how do you make your repository accessible to users? Perfect example is IRODS.
Data preservation – the “infamous problem of bit-rot” – make sure that data stays the same. Experiments with extended file attributes, and being able to restore in the background.
Metadata techniquest – How, what, when, why of data. The key is getting users to fill it out. How do we help the users make this easy? Should be part of the workflow as you go. Investigating as part of job scheduling.

Dell is acquiring pieces – Ocarina (data compression), Equallogic (data tiering), Exanet (scalable file system). Ocarina can actually compress even compressed data another 20%.

Dell prototyping/testing on data access/search methods and extended file attributes for metadata and data checksums for fighting bit-rot. The idea is putting the metadata with the data.

    Imtiaz Khan, IBM

Aspects of Lifecycle Management
– Utilization of research
– Data Management
– Storage Management

Current Challenges – Research & Publishing
– Volume, velocity, and variety – e.g. real-time analysis is about heavy volume.
– Discrete rights management – at very granular levels.
– Metadata management
– Resuability/Transformation
– Analytics
– Long term preservation.

Content analytics and insight – Watson is a great example. Taking text and using natural language processing to extract meaning and leverage that meaning for other applications.

Smart Archive Strategy – content passes through a rule-based content assessment stage before deciding where to put the content (on prem, cloud, etc).

IBM has a Long Term Digital Preservation system.

    Q&A

Oracle working on strategies for infrastructure, database, platform, and software as services in clouds.

A question is raised about intellectual property rights – e.g. proprietary compression schemes impeding scientific progress. Long term preservation and access is an important consideration.

Curt asks about middleware that can manage workflow that ease the metadata burden. Oracle does that in their enterprise content management offerings. Dell is considering enabling users to add metadata in the existing workflow, e.g in job submission or file opening.

A question is asked about PASIG and whether the other vendors have community groups working with higher ed. Dell is working with Cambridge and University of Texas on some of these issues and invites others to participate, but it’s not a formal group. IBM has various community groups (non-specified). Ray notes that PASIG is about the community, not marketing.

Are there areas beyond preservation where the vendors are working? Dell has worked with bioinformatics data, making it available. Another example is aircraft data that has to be kept available for the life of the aircraft, and we’re still flying planes from the 1940s. Finding the data is not everything – we have to be able to visualize and mine the data wrapped up with the data – it’s all one. IBM’s Smart Archive strategy is not just about preservation, but also compliance and discovery (from the legal perspective is a common use case).

Unstructured data represents about 80% of the data, and growing geometrically. RFID data is a great example of data that need to be captured and extracted. Data access patterns are random and iops-driven, not sequential.

Serge asks about pricing models for long-term pricing storage. Oracle has an ability to charge for unlimited capacity, charging by cores on the servers. Dell honestly admits they don’t have a good answer – what should be charged for that accommodates moving data across generations of technology? IBM’s pricing strategy is based on storage size. Jeffrey notes that the model has to accommodate how many copies are saved and how often they’re checked for integrity.

Vijay concludes by noting that size of data matters, and we may not want to move terabytes of data and bring the compute to the data. He tosses out a thought experiment that the vendors could store the data, for free, and charge for the compute cycles people execute on the data.

Breakout session reports [ #rdlmw ]

    Secure research data

– Wanted to focus narrowly on where access to restricted datasets are important in research computing. In social sciences, sometimes researchers have to apply to analyze data from government that is not public. Medical data is protected by regulation. Geospatial data research can use sensitive data on individuals. People working with industry sometimes have restrictions on data. Intellectual property has to be respected. Recommendations:

1. People who manage research computing environments want to know what federal standards need to be complied with – come up with a national working group on how to comply. There is a federal interagency working group on data which might be a good venue to communicate with.
2. A simple catalog of solutions from institutions on how to enable remote access to secure data. Use the Educause Cyberinfrastructure working group.
3. Catalog items for clinical translational study.

    Policy

Recommendations:
1. Develop a set of documentation (elevator speech, exec summary, and extensive report) to describe the need for policies and standards across disciplines as much as possible.
2. Develop workshop for university officers (VP of Research, Provost) to include them in discussions on how institutions can be involved.
3. Catalog of issues on data ownership and responsibility. Reduce mean time to discovery for researcher in how they should deal with their data.
4. Develop workshop for leaders of disciplinary communities.
5. Develop discipline-blind framework – what are the kinds of things a discipline needs to do to develop policies and standards?
6. University librarian is key in this role.
7. It’s time for the researchers to walk into the room with the librarians and say “we’re here”. – Brian Athey.

    Assessment and selection of research data

Is it really a goal to keep all data if possible? Good question.
Good practices with physical materials should be studied for guidance.
Expense of what it takes to manage data shouldn’t be primary consideration for what we keep.
Selection process has to be discipline specific.
What’s the cost of getting rid of something? Is reproduction of the data possible, and if so, what does that cost?
It’s easier to throw things away than to try to collect them after the fact. So collect and manage data before deciding to throw it away.
Researchers will have to provide at least core metadata.
Selection process is not yes/no but a continuum from minimal to full.

1. To make decision easier, develop a framework for making decisions. The researcher is a full partner in this.
2. Educate key audiences on importance of curatorial concepts. – researchers in all disciplines, and catch grad students now.
3. Encourage policy makers to rethink roles across the institution.

    Funding and operation

Recommendations for action:
1. Repository builders should collaborate – build with knowledge and forethought of others. Too many isolated repositories. Think federation.
2. Make data movable. Funding models will change over time. Should be movable from one caretaker to another.
3. Prepare for the hand-off. Anybody organizing a repository must put enough details in plan and budget to enable hand-off at the end of business cycle.
4. It would be useful to have a study of existing repository models.

Partnering researchers, IT staff, librarians and archivists
30 people in this breakout!
1. Communication of what’s out there – what models exist? Portal that identifies workable solutions. What practices work for training – resources for cross-training?
2. Institute more training for grad students.
3. Substantial workshop report from here – task NSF for developing a generic framework that allows institutions to implement policies and appropriate procedures.
4. Hold a workshop to define best institutional practices in communicating between researchers and librarians.
5. Survey our campuses on data management practices.

    Standards for provenance, metadata, discoverability

Got into a discussion on “what is metadata” – anything that supports the core user needs for information. IFLA def – can you find it, can you identify it, can you select among resources, can you retain or reuse it? We want our metadata to be interoperable – move across repositories, workspaces, etc. We also want trustworthy and reliable data.
Core needs:
1. Common framework for data. some emerging, like METS.
2. Role of ontologies – domains recognizing standardized terminologies. Linked Data (semantic web) might be worth exploring for this.
3. Instrumented data – if numeric data is off, then data is useless. How do we know if the data is good? Huge gap in current data – need to work with instrument manufacturers. What captured this data? Usually entered manually.
4. Metadata needs to be captured at point of data creation.
5. Need standards of provenance – what’s the purpose of creating this data? Relationships between datasets are critical. Most scientists spend a long time exploring dimensions of the same set of problems.
Researchers want to develop their own metadata – treat it like any other data stream. Don’t worry about having to bring it into a structure.

    Partnering funding agencies, research institutions, communities, and industrial and corporate partnerships

Recommendations:
1. Joint study of the feasibility of the “digital sheepskin”. Is there a model for a digital container that can be sustained through the ages, including metadata? We’ll probably have to invent some of the social context for this.
2. Conduct an aggregated study of TCO models using trusted party (academia) for storage for perpetuity or for ten years.
3. Identify the missing pieces of the research data software stack, and encourage collaborations between academia and industry.
4. A study on criteria for throwing data away, by discipline.
5. Continue to emphasize that data volume is growing much faster than our ability to move data around. Think about where we need to site data.
6. What are the possible models for joint activity with industrial partners?

Lightning Round! [ #rdlmw ]

John Fundine – USGS – the case for not keeping everything. They deal with observational records, now in the 5 petabyte range. They end up with orphans as programmatic sponsors change. Budgets are going down in government agencies – now looking at best case next year being the same as 2007-2008. What not to keep and who decides? Would advocate a formal appraisal process – as an archivist he owns the process but not the outcomes. They have a 30% disposal rate. Disposal is not the same as destruction – owners can find new homes.

Jim Myers – W3C Provenance Introduction. Group is in progress. Started from open source provenance group – data history tracking. Scope – want to talk about input objects being used to produce output objects. Also trying to add the distinction between document and file. Will be able to talk about how documents were funded, reports were derived from it, etc. Also talking about physical objects – the web of things.

Herbert J. Bernstein – We’re looking at too easy of a problem. Should we be thinking about communicating with the future. Get out of the mode of thinking what we do works – our processes produce errors. The data should be able to stand on its own legs without people. In a format that will be readable, unlike what we do know. Need to be much more conservative about our frameworks, and stop changing them.

Grace Agnew, Brian Womack – Spent about three years working with scientists on organizing data. Faculty didn’t remember context of data just months later – makes it hard to reuse data. Need to know what trial, who conducted it, etc. – entire provenance. Developed events-based data model that fits in a METS-based data model. Training a team internally of librarians – will support research data efforts going forward.

Jen Scheop – Woods Hole – Points from this morning – Not true that we don’t have to throw data out. Data are not like books – not vetted or standardized. Not true that large projects can find money for curation.

Scott Brant – Purdue – Data curation profiles (http://datacurationprofiles.org ) Developed profile template. Got a grant to teach librarians how to interact with researchers. There is a toolkit available on the web site.

“DataONE (Observation Network for Earth): Enabling New Science by Supporting the Management of Data Throughout its Life Cycle” – Bill Michener, University Libraries, University of New Mexico [ #rdlmw ]

Defining the problem space –

Grand challenges are difficult. We’re using different languages and standards for how we deal with our domain data. Most scientists complain that they’re spending most of their time doing mundane data management and integration. Only a small part of their time is spent on the analysis.

The data deluge – lots of sensors.
Proliferation of citizen science programs. A whole new way of doing science.
Data silos. Lots of big repositories, tons of small ones, each using their own, non-interoperable data standards. Creates the long tail of orphan data – scattered worldwide.

Data entropy – most scientists are really familiar with their data just prior to publication. They may or may not document the intricacies of the data, and we lose the ability to use the data over time.

DataONE approaches
Community Engagement – been funded by 1.5 years by NSF. Will be releasing infrastructure in December 2011. Started interacting with scientific community two years ago via interviews, surveys, etc. What are the challenges scientists are facing?

Recent study found (in earth sciences) >80% would be willing to share data, across a broad group of researchers.

Stakeholder needs – what are data management plans? How do I describe and preserve my data?

Brought an array of people into the room to look at continental bird migration. What do we need to answer this? 31 different data layers, including a single researcher in Utah with data in his desk. Data discovery is an issue. Needed lots of compute cycles, which was a shock. Took an initial .5 million hours on TeraGrid, and more later. Also needed visualization tools. One of the datasets used, ebird, is a citizen science data source. Produced State of the Birds report 2011.

Cyberinfrastructure support – goal is to enable new sciences through universal access to data about life on earth, the environment, plus access to key tools. Three precepts: 1 Build on existing cyberinfrastructure; 2. Create new cyberinfrastructure; 3. support communities of practice – we’ve ignored this over time.

Member nodes – data repositories that already exist. Coordinating nodes – retain complete metadata catalog, indexing for search, networ-wide services, ensure content availability (preservation), replication services (would like to see data in 3 or more repositories. Investigator toolkit – familiar to scientists, integrated with data resources.

First three member node prototypes – ORNL-DAAC, Dryad, KNB.

There’s beginning to be some evidence that when you share your data, citation rates to your publications go up.

Working with Microsoft Research to make Excel a more powerful tool.

Added (impending release) a Data Management Planning Tool (DMPTool) for building a data management plan – wizard driven.

DataONE includes a powerful data discovery tool.

Education and Training – there’s a lot to do! In DataONE, created DataONEpedia – best practices. Scientists want one pagers, not detailed manuals.

“Taking AIM at Data Lifecycle Management” – Jose-Marie Griffiths, Bryant University [ #rdlmw ]

Representing the point of view of chief research officers for this talk.

Most of concerns relate to current economic conditions and uncertainties, particularly concerned about overhead costs. Also concerned about policies that turn into unfunded mandates. Concerned about roles and liabilities.

Size and scale issues – big universities can do things smaller ones can’t – need to find ways of federating so the smaller institutions can participate.

AIM – Access, Integrity, Mediation

Access – what goes in must be able to come out. Need to focus on users, defining “users” as widely as possible. Need metadata, which requires people. Also need to understand the costs of migrating data as technologies become obsolete.

ICPSR is a good model of an inter-institutional data consortium.

Interoperability – never easy. Referential integrity degrades over time. Decisions tend to get made on the fly.

Increased public access is a trend, supported by the government funding agencies. NSTC report expected next spring.

Integrity – We need to plan for preservation across the entire lifecycle. What are going to share? raw data? processed, analyzed datasets? instruments? calibration? analytical tools?

Mediation – needed at all stages of lifecycle. Where there is high intensity of interaction, it may make sense to have lots of replication and different mediation. Mediation may not always need to be formal, but for repositories and analysis it does need to be more formal. But must make sure that creating new repositories is not a solution in search of a problem.

Players and relationships among them are constantly shifting, vying for funding and attention. Issues about research directories. For data to be discovered, must have a shared overlay of connections. An ecosystem of multiple stakeholders.

Serge points out that large swaths of disciplines don’t have disciplinary repositories. Jose replies that there is a role for institutional repositories, but there are challenges – we don’t know enough about building a sustainable economic model. We don’t have good metrics about progress in cyberinfrastructure. All we have is number of high speed connections to institutions.

Serge Goldstein – DataSpace [ #rdlmw ]

I’ve reported on Serge’s experimental model at Princeton before, at http://blog.orenblog.org/2010/05/12/csg-spring-2010-storing-data-forever/

Funding and operational model for long-term preservation of research data. Piloting at Princeton.

Storing data forever.

What’s “forever”? We don’t usually tell people how long we keep stuff – like in libraries. We can treat data the same way as books – “indefinitely” – best effort to keep data around for a long time, which doesn’t have to be precisely defined.

Quotes Cliff Lynch – funding agencies don’t expect data to be kept forever. But Serge is uncomfortable with that.

The reality today is that we’re talking about an indefinite period of a “few years”.

Where do we store data? Your local web site; A disciplinary repository; At another university; in the cloud (Amazon, Google, Duracloud)

How to pay for storing data? Institution pays; grants pay – but they don’t go on forever; or – we don’t know (the most popular model). Most mechanisms require ongoing payment. That answers the “what should we store” question – by being willing to store whatever someone’s willing to pay for. Duracloud is charging $1800/year/Tb. Not a reasonable charge for long-term preservation.

At Princeton they’re trying a Pay Once Store Endlessly approach. Based on a steadily declining cost of storage (as computed on a per-unit-of-storage basis). Turns out you can store the research data forever for about twice the original storage cost. At Princeton that turns out to be about $5 per gigabyte (including tape drives) to store forever.

Not including added services like curation or translation – just a bit storage.

Serge looked at the data management plans for all grants submitted at Princeton since the mandate for a data management plan. 93 grants total. 27 (30%) have no data management plan. Most popular is on a web site or local disk (20%). Then DataSpace.

Brian Athey – Big Data 2011 [ #rdlmw ]

Brian Athey is a professor in the Medical School at the University of Michigan.

It’s difficult to incentivize researchers to share data.

Agile data integration is an engine that drives discovery.

Developing personal health system requires combining data extracted from genomics with data extracted from a clinical record of the individual.

There’s a disconnect between classic IT’s “command and control” approach and what actually happens in research labs. We want to achieve a focused collaboration balancing high levels of focus and participation.

Next gen sequencing – turning out around 10 terabytes per day at Michigan, from 1500 users.

In 2006 there was a knee in the curve where it became more economical to generate the genomic data than to store it. We have to make decisions about what we store – we can’t save everything.

Brian is working on a Federated Enterprise Data Warehouse, that stores both clinical and research data. There’s an “honest broker” that mediates the data accessible to the research side.

PCAST NITRD “Big Data” report from November. Has a list of recommendations.

We are all challenged by having to bring heterogeneous data together. Working with Johnson and Johnson on something called tranSMART – J&J have over 400 pharma research databases.

Clinicians have worfklow – researchers don’t.

Discussion items:
IT doesn’t own the problem.
The rise of “architecture”
Data governance
Data governance – who owns the data? bring them into the room. But there also has to be top down convenors.
Privacy, security, confidentiality – the idea of the “honest broker” could be a model.
Cost and value-centered models – if we remain just a cost center we’re cooked.

Question – why can’t we keep all the data? The “Best Buy conundrum” – why do you charge me so much for storage when I can get it elsewhere cheap. Takes money to curate and level out the chaos. Maybe we should let the researchers decide what stays and what goes. The questioner, dealing with crystallography data and working with people dealing with NASA data, says that they’ve learned that getting rid of raw data is a huge mistake. Vijay notes that now the cost of hardware is only 5% of the cost of storage – it’s people and facilities that cost.