Ray Clarke, Oracle
SNIA – 100 Year Archive Requirements Study – key concerns and observations:
– Logical and physical migration do not scale cost-effectively
– A never ending, costly cycle of migration across technology generations
Lots of challenges – Oracle (of course) offers solutions across the stack.
The ability to monitor workflow as data moves through a system is important.
There will always be a plethora of different types of media and storage – important to manage that. There are technology considerations about the shelf life and power/cooling consumption across different technologies that are better with tape than disk. Also with bit error detection. The cost ration for a terabyte stored long term on SATA disk vs. LTO-4 tape is about 23:1. For energy cost it is about 290:1. Tapes can now hold 5 TB uncompressed on a single cartridge. There are ways to deploy tiered architecture of different kinds of media, from flash, through disk, to tape for archival storage.
We need more data classification to understand how best to store it.
Oracle Preservation and Archiving Special Interest Group (PASIG), founded 2007 by Michael Keller at Stanford and Art Pasquinelli at Oracle.
Jeff Layton, Dell
Looking at three aspects of DLM:
Data availability – how do you make your repository accessible to users? Perfect example is IRODS.
Data preservation – the “infamous problem of bit-rot” – make sure that data stays the same. Experiments with extended file attributes, and being able to restore in the background.
Metadata techniquest – How, what, when, why of data. The key is getting users to fill it out. How do we help the users make this easy? Should be part of the workflow as you go. Investigating as part of job scheduling.
Dell is acquiring pieces – Ocarina (data compression), Equallogic (data tiering), Exanet (scalable file system). Ocarina can actually compress even compressed data another 20%.
Dell prototyping/testing on data access/search methods and extended file attributes for metadata and data checksums for fighting bit-rot. The idea is putting the metadata with the data.
Imtiaz Khan, IBM
Aspects of Lifecycle Management
– Utilization of research
– Data Management
– Storage Management
Current Challenges – Research & Publishing
– Volume, velocity, and variety – e.g. real-time analysis is about heavy volume.
– Discrete rights management – at very granular levels.
– Metadata management
– Long term preservation.
Content analytics and insight – Watson is a great example. Taking text and using natural language processing to extract meaning and leverage that meaning for other applications.
Smart Archive Strategy – content passes through a rule-based content assessment stage before deciding where to put the content (on prem, cloud, etc).
IBM has a Long Term Digital Preservation system.
Oracle working on strategies for infrastructure, database, platform, and software as services in clouds.
A question is raised about intellectual property rights – e.g. proprietary compression schemes impeding scientific progress. Long term preservation and access is an important consideration.
Curt asks about middleware that can manage workflow that ease the metadata burden. Oracle does that in their enterprise content management offerings. Dell is considering enabling users to add metadata in the existing workflow, e.g in job submission or file opening.
A question is asked about PASIG and whether the other vendors have community groups working with higher ed. Dell is working with Cambridge and University of Texas on some of these issues and invites others to participate, but it’s not a formal group. IBM has various community groups (non-specified). Ray notes that PASIG is about the community, not marketing.
Are there areas beyond preservation where the vendors are working? Dell has worked with bioinformatics data, making it available. Another example is aircraft data that has to be kept available for the life of the aircraft, and we’re still flying planes from the 1940s. Finding the data is not everything – we have to be able to visualize and mine the data wrapped up with the data – it’s all one. IBM’s Smart Archive strategy is not just about preservation, but also compliance and discovery (from the legal perspective is a common use case).
Unstructured data represents about 80% of the data, and growing geometrically. RFID data is a great example of data that need to be captured and extracted. Data access patterns are random and iops-driven, not sequential.
Serge asks about pricing models for long-term pricing storage. Oracle has an ability to charge for unlimited capacity, charging by cores on the servers. Dell honestly admits they don’t have a good answer – what should be charged for that accommodates moving data across generations of technology? IBM’s pricing strategy is based on storage size. Jeffrey notes that the model has to accommodate how many copies are saved and how often they’re checked for integrity.
Vijay concludes by noting that size of data matters, and we may not want to move terabytes of data and bring the compute to the data. He tosses out a thought experiment that the vendors could store the data, for free, and charge for the compute cycles people execute on the data.