Data Management Challenge
Distributed Set of Collaborators
Existing Data Sets (sometimes curated, sometimes less)
Taking advantage of commodity data storage
Researchers widely distributed, data widely distributed
Pre-Stage, then Write Back – Read/write workload, widely distributed. Tend to make assumption that researcher is data management expert.
Goal: Enable a scalable number of collaborators and apps to share access to data independent of where it’s stored: minimize operational burden on users; maximize use of commodity infrastructure; maximize aggregate I/O performance
Syndicate Solution – Add a CDN into the mix. Distribute big data in the same way Netflix distributes video, using caching.
Syndicate gateways sit between players and the caching transport (which uses HTTP instead of TCP).
Metadata service on the side which has to scale.
Result is all collaborators share a volume.
Gateways bridge application workflow and HTTP transport, e.g. IRODS, Hadoop drivers. Data gets acquired from existing repositories, or commodity storage (e.g. S3 or Box) which get treated as block storage. Gives ability to spread risk over multiple backing stores.
Metadata store manages data consistency and key distribution. Uses adaptive HTTP streaming. Plays a role of security distributing credentials which are delivered through the same CDN.
Shared volume is mounted just like Dropbox. Can auto mount volumes into multiple VMs like in EC2.
Service composition – Syndicate = CDN + Obect Store + NoSQL DB
CDN gives scalable read bandwidth (Akamai Hyper Cache and RequestRouter). Built a CDN for R&E content on Internet2s footprint.
Object store – gives data durability. (S3, Glacier, DropBox, Box, Swift).
NoSQL DB (Google App Engine) for metadata service
Multi-tier cloud. Commodity cloud is one of the tiers. You could contribute private clouds into the project too. Internet2 Backbone. Regional & Campus (4 servers minimum). -> End Users
Caching hierarchy – some in the private cloud, less at the regional & campus side.
Request Router in the I2 backbone, which tells you which cache to get service from.
Value Proposition: Cloud Ready (Allows users to mount shared voluments into cloud-hosted VMs with minimal operational overhead; Adapt to existing workflows – makes it easy to integrate existing user workflows. There are ways to build modular plugins on user side or data side. Sustainable Design – I got a big data problem, I need to connect commodity disk to my workload.
Will first be used by IPlant community. More info at opencloud.us
Scientific Data with Syndicate – John Hartman, Univ. of Arizona Computer Science
Use Hadoop and Syndicate to support “big data” science – meta-genomics research w/ Bonnie Hurwitz, Agriculture and Biosystems.
Meta-genomics – statistical analysis of DNA rather than sequencing entire genomes. Sequencing produces snippets of DNA (called reads) – requires very pure samples of DNA. Instead, look at samples in the environment, e.g. compare population of reads in one sample with reads in another sample. Tara Oceans Expedition; Bacterial infections; Colon cancer.
Tara is a ship collecting information about the oceans, taking samples to enable comparative analysis. Currently about 9 TB of data from ship.
Looking at bacterial infections with Dr. George Watts. Treatment depends on identifying characteristics of the bacteria, so ideal to perform meta-genomic analysis on an infection to classify and determine treatment.
Analysis Techniques – originally custom HPC applications with manual data staging; now- Hadoop application with manual data staging; future – Hadoop application with data access via Syndicate.
Hadoop: open-source MapReduce; includes Hadoop Distributed File System, so storage nodes for the computation nodes. Tasks are run on local data when possible – hadoop task scheduler knows data location. Data must be manually staged into HDFS, and Hadoop does are managed by central controller.
Trying to allow remote Hadoop data access: Storage in iRODS and HDFS; Transport by Syndicate and HTTP; enable federation between Hadoop clusters.
Storage-side functionality: delivers data sets to Syndicate via SG; Publish/subscribe mechanism keeps datasets up to date via Rabbit MQ. Integration of Syndicate and iRODs authentication mechanisms.
Working on federating Haddop clusters via Syndicate, so clusters can pull data from each other.
Challenges – Identity Management; While-file write (need to write results back to storage; syndicate designed for file reads, block writes) have to provide consistency at dataset level; Performance.
Biologists are thrilled when this works at all.