CSG Fall Meeting 2012 – Big Data

The Fall CSG Meeting is hosted by Princeton.

Mark McCahill from Duke is leading off the Big Data workshop, defining the problem.

Do we even have a problem? Big data advocates would argue that traditional SQL-centric ERP-style approach is no longer adequate.

Big data is what doesn’t fit this old-school paradigm.

4Vs – Volume (lots of data), Velocity (changes frequently), Variety (heterogeneous), Variability (polymorphous)

Is my infrastructure and network up to it? Two pieces of storage: active and archival. Analysis tools are key – not in a relational database, then have to find something to do correlations. Publishing data and crowd-sourcing analysis.

Volume – When you start talking about 200 petabyte datasets across 50k nodes, you get into real money. (information week). Costs drive volume – Cloudera claim that Hadoop systems are way cheaper than traditional SQL, about $1k a terabyte.

Analysis architecture – should you be trying to move the data to the computation or vice-versa? How do you make it so other people can do analysis on the data?

Publishing – http://beamartian.jpl.gov – crowd sourcing crater counts. Amazon hosts US Census data.

Streaming realtime data – Velocity is important. click streams or user activity streams or sensor networks. Want to be able to act on very quickly. Commercial sector is into this in a big way. A company is marketing a smart soap dispenser that detects RFID badges near it and whether it dispensed soap.

If we’re thinking about big data as stuff that’s hard to with traditional tools, how does that fit into our constraints for our institutions?

Jim Leous from Penn State – Classic example of volume is the Large Hadron Collider, or the LSST telescope (data rates of 30 Terabytes per night, which amounts to 15 gigabits per second), but the scientists at Penn State will go to the data, not need the whole dataset. They’re looking at what changed since last observation, which is a much smaller set. But they want to quickly be able to order other instruments to look at the parts of the sky that has changed.

Genomics – Democratization of biology. Everyone can afford gene sequencers so they’re all producing massive data.

Big data tools – need to scale out not up. Create lots of copies of data, tolerate failure, bring computation to data, can function anywhere.

Hadoop – A computational environment modeled on what google and yahoo did. It has a filesystem (HDFS, based on Google file system), database (HBase), MapReduce function (reducing the data to manageable chunks), tools to make SQL like queries (like Hive).

MongoDB – schema-less format, which makes it easy to handle sparse data.

These tools are called NoSQL tools – key/value pairs – value can be anything, a document, a gene sequence, a video, etc. SMAQ stack (Storage, Map reduce And Query). To data like the LAMP stack was to the internet.

Internet2/Net+/Microsoft agreement – uses Azure cloud, waving data transfer fee, offers MS MapReduce, connections to SQL Server.

MongoDB is very good with textual data. Penn State using it in analysis of political speech, tracking real-time proceedings in the Congressional Record, correlating snippets of speech with speaker, time of year, what was being discussed, etc.

Studying little cyclid fish in Lake Malawi. Evolution happening very quickly. Field notes, videos, measurements, genome sequences, etc. Digitizing the species and inviting people from the outside to identify new gena and species of fish from the data.

A question calls genomics the “anti-LHC” – Has PII and intellectual property issues, has to be protected, considerations which have to be in scope. When publishing data how do you keep those considerations in mind?

John Board from Duke is talking about current state on campuses. Improving the network system is important – it’s good for routine operations but doesn’t work well for large data operations.

VRFs (VPNs) are configured by central IT – would like to give more control to scientists.

IPS/IDS can add latency and complexity

External big data collaborations are the norm.

Embracing investigations into Software Defined Networks – don’t want to build whole new network, but augment with special services for high throughput big data users.

SDN mediated access to the most likely suspects – use cases are genome folks accessing shared clusters over prepositioned routes that don’t go up through the interchange layer for traffic where it has been decided that it doesn’t need screening.

Also want to establish private SDN links between sites that bypass the campus network.

One of the big challenges is training the core campus networking staff on these new technologies.

Tim Gleason is talking about what’s happening at Harvard.

Lots of spinnign disk – 5 petabytes, 10 petabytes of total data across six data centers with 25-20k individual disks. Not necessarily a good thing when trying to reduce energy consumption.

Mixed funding model with funded staff and annual subscription model for storage.

Long germ archival storage remains a problem $200/TB/yr/copy for RAID6 disk.

Moving to a new higher efficiency green computing center, shared with other institutions. MGHPCC

Challenges – Sharing storage across schools and domain areas – largely an identity and access management issue. Data security – access to the data in repositories needs to be vetted with security for appropriate use of data.

Archive storage – maintaining the data after publication is daunting. Who pays after the grant runs out?

Barriers to storage as a service.

Thomas Hauser talks about University of COlorado at Boulder.

Research computing as a new organization at Boulder. Lasting significance of data is a challenge – how long does it need to persist. Descriptive challenges are also a challenge.

Try to partner with faculty and researchers, becoming partners on the ground to know what’s coming.

They have a physically separate network infrastructure. They’re building a PetaLibrary storage infrastructure.

They have a 13k core cluster for users.

Offering research data services, collaborating with the Libraries. Research hands off datasets for archiving and preservation.

Projects – National Snow and Ice Data Center – good expertise on curating long-term data. High Energy Physics runs an Open Science Grid site as part of the LHC experiment. BioFrontiers institute does genomic research.

Scientific Data Movement – large data transfer end to end. Multi-domain, very complex. Separate network grew out of Physics realizing they weren’t getting the data transfer rates they were expecting. Showed that real collaboration is needed with all the parties.

Science DMZ – Architectural split – enterprise vs. science use. Migration of big data off the LAN. It’s a paradigm shift – learn to trust things.

Using the right tools – monitoring (perfSONAR), dynamic allocation of nadnwidth – DYNES, OSCARS, openflow.

ScienceDMZ –

Key successes – using GridFTP and Globus online.

Storage – petabyte of lustre scratch, filling rapidly. NSF funded petaLibrary project. Different tiers for different sizes. TB for free, up to 100 TB subsidized, above that you have to buy. Using HFS to move older files that haven’t been accessed.

Data Management Support. data.colorado.edu – basic services with existing staffing. Teach researchers to develop business plans to build funding for data management services into funding plan.

Raj Bose from Columbia is talking about the new announcement of the Institute for Data Sciences and Engineering. Led by engineering faculty, but will bring in strengths of campus. Will bring in Journalism, Medical Center, Earth sciences, etc. 30 faculty in next five years, more later. Attracting faculty will be related to the computing infrastructure that’s available. Concentrating on centrally managed, scalable things.

Serge Goldstein from Princeton – an update on his Pay Once Store Endlessly – not enough time has passed to report yet. But what’s happened at Princeton with the availability of DataSpace (Archiving service) and the NSF data plan mandate? Data disposition for submitted NSF grants. 441 proposals since January. 333 have a data management plan. The rest are multi-institutional grants where the data management plan is submitted by another institution. A small number say they don’t have data to manage (training, outreach, etc). A majority of grants plan to store in existing repository (disciplinary) (24%) or on the researcher’s web site (30%). Hoping over time to convince those latter folks to put data into Dataspace, which provides a URL for them to link from. 17% will publish data as part of article. 21% will store data in their office. 13% will store in DataSpace.

Curt Hillegas from Princeton – new HPC Research Center. Leasing fiber on two paths to get to the data center, which is 3 miles away. Some researchers don’t want to locate the equipment that far away, due to latency. They’re measuring latency now. Lower density systems can be cooled with outside air when temperatures are low. 12,500 sq feet that will fit 408 racks.

Large storage system based of GPFS. About to replace that with 1.5 petabytes, also of GPFS. Scratch is a challenge – consistently full because researchers don’t want to manage the data. Trying to utilize management within GPFS to create a tiered system that will manage the data.

Trying to figure out tiers of security to advise IRB on what rules should be for handling very secure data.

Mairead Martin from Penn State is talking about Data Curation and Preservation.

It takes a lot of human capital to properly do data curation and preservation. We’ve been doing some preservation for systems for a long time, but now we’re talking about the bit level and data.

IN this space:
NSF DataNet: DataOne, Data Conservancy. Creating data models, metadata, cyberinsfrastructure.

Preservation Networks:
Lockss, MetaArchive, Chronopolis, DPN
National Digital Stewardship Alliance
Univeristy of California Curation Center (UC3)
Johns Hopkins Sheridan Libraries (DataPub initiative – how do you link data storage with scholarly publication lifecycle?)
CurateCAMP – an unconference
Digital Curation Center UK

What does this mean for IT? Infrastructure provisioning (storage and archival storage – not just backup. Selection, appraisal; repository/DB platforms; access management). Coordinated service offering – so faculty know who to talk to. Sustainable service models – how do we pull that off? Collaboration with libraries, archives, information managers, research offices.

Whew! Quite a start to the day – on to the first break!

Leave a comment Cancel reply