[CSG Spring 2010] Storing Data Forever

Serge from Princeton is talking about storing data. There’s a piece by MacKenzie Smith called Managing Research Data 101.

What do we mean by data? What about transcribing obsolete formats? Lot of metadata issues. Lots of issues.

What is “forever”? Serge thinks we’re talking about storing for as long as we possibly can, which can’t be precisely defined.

Why store data forever?
– because we have to – funding agencies want data “sharing” plans – e.g. NIH data sharing policy (2003). NIH says that applicants may request funds for data sharing and archiving.
Science Insider May 5 – Ed Seidel says NSF will require applicants to submit a data management plan. That could include saying “we will not retain data”.

– Because we need to encourage honesty – e.g. did Mendel cheat?
– Like open source help uncover mistakes or bugs.
– Open data and access movement – what about research data?

Michael Pickett asks who owns the data? At Brown, the institution claims to own the data.

Cliff Lynch notes that most of the time the data is not copryightable, so that “ownership” comes down to “possession”

There’s a great deal of variation by branch of science on what the release schedules look like – planetary research scientists get a couple of years to work their data before releasing to others, whereas in genomics the model is to pump out the data almost every night.

Current storage models
– Let someone else do it
– Government agency/lab/bureau e.g. NASA, NOAA
– Professional society
–

Dryad is an interesting model – if you publish in a given model you can deposit your data there. That’s like genbank.

Duraspace wants to promote a cloud storage model based on dspace and fedora.

There are a number of data repositories that are government sponsored that started in universities.

Shel says that researchers will be putting data in the cloud as part of the research process, but where does it migrate to?

Serge’s proposal – Pay once, store endlessly (Terry notes that it’s also called a ponzi scheme).

Total cost of storage =
I = initial cost
D = rate at which storage costs decrease yearl, expressed as a fraction
R = how often, in years, storage is replaced
T = cost to store data forever

T = I + (1-d) to the r *I + (1=d) to the 2r * I + ….

if d=20%, r = 4, T=I * 2

If you charge twice the cost of initial storage, you can store the data forever.

They’re trying to implement this model at Princeton, calling it DataSpace.

People costs (calculated per gigabyte managed) also go down over time.

Cliff – there was a task force funded by NSF, Mellon, and JISC on sustainable models for digital preservation – http://brtf.sdsc.edu

[CSG Spring 2010] Storing Data Forever

One thought on “[CSG Spring 2010] Storing Data Forever”

Leave a comment Cancel reply