[CSG Spring 2010] Storing Data Forever

Serge from Princeton is talking about storing data. There’s a piece by MacKenzie Smith called Managing Research Data 101.

What do we mean by data? What about transcribing obsolete formats? Lot of metadata issues. Lots of issues.

What is “forever”? Serge thinks we’re talking about storing for as long as we possibly can, which can’t be precisely defined.

Why store data forever?
– because we have to – funding agencies want data “sharing” plans – e.g. NIH data sharing policy (2003). NIH says that applicants may request funds for data sharing and archiving.
Science Insider May 5 – Ed Seidel says NSF will require applicants to submit a data management plan. That could include saying “we will not retain data”.

– Because we need to encourage honesty – e.g. did Mendel cheat?
– Like open source help uncover mistakes or bugs.
– Open data and access movement – what about research data?

Michael Pickett asks who owns the data? At Brown, the institution claims to own the data.

Cliff Lynch notes that most of the time the data is not copryightable, so that “ownership” comes down to “possession”

There’s a great deal of variation by branch of science on what the release schedules look like – planetary research scientists get a couple of years to work their data before releasing to others, whereas in genomics the model is to pump out the data almost every night.

Current storage models
– Let someone else do it
– Government agency/lab/bureau e.g. NASA, NOAA
– Professional society

Dryad is an interesting model – if you publish in a given model you can deposit your data there. That’s like genbank.

Duraspace wants to promote a cloud storage model based on dspace and fedora.

There are a number of data repositories that are government sponsored that started in universities.

Shel says that researchers will be putting data in the cloud as part of the research process, but where does it migrate to?

Serge’s proposal – Pay once, store endlessly (Terry notes that it’s also called a ponzi scheme).

Total cost of storage =
I = initial cost
D = rate at which storage costs decrease yearl, expressed as a fraction
R = how often, in years, storage is replaced
T = cost to store data forever

T = I + (1-d) to the r *I + (1=d) to the 2r * I + ….

if d=20%, r = 4, T=I * 2

If you charge twice the cost of initial storage, you can store the data forever.

They’re trying to implement this model at Princeton, calling it DataSpace.

People costs (calculated per gigabyte managed) also go down over time.

Cliff – there was a task force funded by NSF, Mellon, and JISC on sustainable models for digital preservation – http://brtf.sdsc.edu

[CSG Spring 2010] Staffing for Research Computing

Greg Anderson from Chicago is talking about funding staff for research computing.

Most people in the room raise their hand when asked if they dedicate staff to research computing on campus.

At Illinois they have 175 people in NCSA, but it doesn’t report to CIO.

Shel notes that employees have gotten stretched into doing lots of other things besides just providing research support. They’re trying to rein that back in in their career classification structures by requiring people to classify themselves. Now there’s 300 generalists classified as such.

At Princeton they’ve started a group of scientific sysadmins. The central folks are starting to help with technical supervision, creating some coherence across units. At Berkeley the central organization buys some time from some of the technical groups to make sure that they’re available to work with the central organization. Groups don’t get any design or consultation help unless they agree to put their computers in the data center.

At Columbia they have a central IT employee who works in the new center for (social sciences?) research computing – it’s a new model.

Greg asks how people know what the ratio of staff to research computing support should be and how do they make the case?

Shel asks whether anybody has surveyed grad students and postdocs about the sysadmin work they’re pressed into doing. He thinks that they’re seeing that work as more tangential to their research than they did a few years back.

Dave Lambert is talking about how the skill set for sysadmin has gotten sufficiently complex that the grad student or postdoc can’t hope to be successful at it. He cites the example of finding lots of insecure Oracle databases in research groups.

Klara asks why we always put funding at the start of the discussion of research support? Dave says it’s because of the funding model for research at our institutions. The domain scientists see any investment in this space by NSF as competing directly with the research funding. We need to think about how we build the political process to help lead on these issues.

[CSG Spring 2010] Research Computing Funding Issues

Alan Crosswell from Columbia kicks off the workshop on Research Computing Funding Issues. The goals of the session are: what works, what are best practices, what are barriers or enablers for best practices?

Agenda:
– Grants Jargon 101 – Alan
– Funding Infrastructure, primarily data centers- Alan
– Funding servers and storeage – Curt
– Funding staff – Greg
– Funding storace and archival life cycle – Serge and Raj
– Summary and reports from related initiatives – Raj

Grants Jargon
– A21: Principles for determining costs applicable to grants, contracts and other agreements with educational institutions. What are allowed and unallowed costs.
– you can’t charge people different rates for the same service.
– direct costs – personnel, equipment, supplies, travel consultants, tuition, central computer charges, core facility charges
– indirect costs a/k/a Facilities adn Admin (F&A) – overhead costs such as heat, administrative salaries, etc.
– negotiated with federal government. Columbia’s rate is 61%. PIs see this as wasted money.
– modified direct costs – substractions include equipment, participant support, GRA tuition, alteration or renovation, subcontracts > $25k.
Faculty want to know why everything they need isn’t included in the indirect cost. Faculty want to know why they can buy servers without paying overhead, but if they buy services from central IT they pay the overhead. Shel notes that CPU or storage as a service is the only logical direction, but how do we do that cost effectively under A21? Dave Lambert says that they negotiated a new agreement with HHS for their new data center. Dave Gift says that at Michigan State they let researchers buy nodes in a condo model, but some think that’s inefficient and not a good model for the future.
Alan asks whether other core shared facilities like gene sequencers are subject to indirect costs.

Campus Data Center Models
– Institutional core research facility – a number that grew out of former NSF supercomputer centers.
– Departmental closet clusters – sucking up lots of electricity that gets tossed back into the overhead.
– Shared data centers between administration and research – Columbia got some stimulus funding for some renovation around NIH research facilities.
– Multi-institution facilities (e.g. RENCI in North Carolina, recent announcement in Massachusets)
– Cloud – faculty go out with credit card and buy cycles on Amazon
– Funding spans the gamut from fully institutionally funded to fully grant funded.

Funding pre-workshop survey results
– 19 of 22 have centrally run research data centers, mostly (15) centrally funded. 9 counts of charge-back, 3 counts of grant funding)
– 18 of 22 respondents have departmentally run research data centers, mostly (14 counts) departmentally funded (3 counts of using charge back, 4 counts of grant funding)
– 14 have inventoried their research data centers
– 10 have gathered systematic data on research computing needs

Dave Lambert – had to create a cost allocation structure for the data centers for the rest of the institution to match what they charge grants for research use, in order to satisfy A21’s requirement to not charge different rates.

Kitty – as universities start revealing the costs of electricity to faculty, people will be encouraged to join the central facility. Dave notes that security often provides another incentive for people because of the visibility of incidents. At Georgetown they now have security office (in IT) review of research grants.

Curt Hillegas from Princeton is talking about Server and Short to Mid-Term Storage Funding
talking about working storage, not long-term archival storage
-some funding has to kick-start the process – either an individual faculty member or central funding. Gary Chapman notes that there’s an argument to be made for central funding of interim funding to keep the resources going between grant cycles.

Bernard says that at Minnesota they’ve done a server inventory and found that servers are located in 225 rooms in 150 different buildings, but only 15% of those are devoted to research. Sally Jackson thinks the same is approximately true at Illinois. At Princeton about 50% of computing is research, and that’s expected to grow.

Stanford is looking at providing their core image as an Amazon Machine Image.

At UC Berkeley they have three supported computational models available and they fund design consulting with PIs before the grant.

Cornell has a fee-for-service model that is starting to work well. At Princeton that has never worked.

Life Cycle management – you gotta kill the thing, to make room for the new. Terry says we need a “cash for computer clunkers” program. You need to offer transition help for researchers.

[CSG Spring 2008] Cyberinfrastructure Workshop – Funding for CI – Three Success Stories

Steve Cawley – University of Minnesota The CIO’s job – get the money. What has worked at Minnesota, what doesn’t work? What are we doing to move forward? Three centrally funded research areas – the Minnesota Supercomputer Institute. $6.5 million budget to provide high-performance computing to institution. Moving from engineering/science college to VP for Research. … Continue reading “[CSG Spring 2008] Cyberinfrastructure Workshop – Funding for CI – Three Success Stories”

Steve Cawley – University of Minnesota

The CIO’s job – get the money. What has worked at Minnesota, what doesn’t work? What are we doing to move forward?

Three centrally funded research areas –

the Minnesota Supercomputer Institute. $6.5 million budget to provide high-performance computing to institution. Moving from engineering/science college to VP for Research.

Central IT – Network is a common good. Unified gig to the desk meshed 10 gig backbone. BOREAS-net 2100 mile owned optical network connecting to Chicago and Kansas City. Chicago CIC fiber ring and OMNIPOP. NLR and Internet2.

Had good luck funding SAN storage for researchers past three years. Received startup funding for server centralization utilizing virtualization. New data center plan central IT plus research.

University Library – expanding expertise including data collection and curation, data preservation and stewardship. VIrtual organization tools. University Digital Conservancy. Rich Media Services.

Problems – limited collaboration between researchers.

Heavy reliance on chargeback is a detriment. Central IT was 80% chargeback, now only 20% is chargeback. Common good services should be centrally funded.

Moving forward – the Research Cyberinfrastructure Alliance. Great exec partnership – VP for Research, CP for Tech, U Librarian. Try to speak with one voice.

Input from interviews with faculty. Large storage needs. Little thought being given to long term data preservation. University does not exist in a vacuum.

Julian Lombardi – Duke

Center for Computational Science Engineering and Math (CSEM) – Visualization Lab, Duke Shared Cluster Resource – blades donated to cluster. Those who donated got priority cycles.

Provost discontinued funding to the center. Cluster and Vis Lab were still being supported by OIT and departments.

Needs are – broad and participatory direction setting; support for emerging and inter-disciplines.

Bill Clebsch – Stanford – Cyberinfrastructure: A Cultural Change

Religious camps blocking progress.

There were three separate campus – the schools, the faculty, and the administration. Everyone pretending that computing can be managed locally.

Asked the Dean of Research to send out letters to the top fifty researchers. Spent time with each of them to find out what the state of computing is.

Exploding Urban Myths – when they talked to faculty found out that the received wisdom wasn’ ttrue.

Myths and facts #1 – myth: scientific research computing methodology has not fundamentally changed (heard from Provost). Fact: researchers’ computational needs have changed fundamentally in the last five years, increased computing availability itself directly yields research benefits, researchers have abandoned the notion that computing equipment needs to be down the hall.

Myths and Facts #2: Myth: faculty needs are highly specialized and cannot be met with shared facilities. Fact: Faculty are willing to share resources, clusters, and cycles. Research methodologies are surprisingly similar regardless of discipline (e.g. larger data sets; simulation studies; from shared memory to parallelism). Episodic nature of research computing allows for coordinated resource sharing.

Myths and Facts #3: Myth: distributed facilitie scan keep pace with demand. Fact: Lessons learned from BioX Program: running out space, cycles, cooling, and power. Cross-disciplinary facility economies of scale. Multi-disciplinary computing economies of scale.

Myths and Facts #4: Myth: Central computing facilities are bureaucratic and inflexible. Fact: Colocating and sharing models reduce overhead. Modularity in building, power, cooling, and support help create sustainability. Move from control to collaboration empowers faculty and reduces central cost. Faculty own the clusters – OIT will just cycle power on request.

Where is this going? 21st century cyberinfrastructure costs will dwarf 20th century ERP investments. Sustainability will be an economic necessity. Cloud/grid computing will affect investment horizons.