Raj opens the discussion by asking about who’s responsible for big data on campus, and do we even know who makes that decision?
In talking about repositories, are there campuses that are trying hard to fill their repositories? A few raise their hand, but most don’t.
Moving to the cloud for storage – will there still be a need for local storage expertise? John says the complexity and heterogeneity is increasing and won’t go away any time soon. Bruce would like to see us positioned to provide expertise and advice on advanced data handling like Hadoop. Mark notes that we should be using those technologies to mine our own data sets – e.g. security mining application and network logs.
In a discussion of data management and the need to engage faculty on planning for data management Kitty notes that she’s had some success hiring PhDs who didn’t want to go into faculty positions but had a lot of experience with data. They could talk to faculty in ways that neither librarians nor IT people can. Curt notes that last year’s data management workshop noted the need for grad students to be trained in data management as part of learning research methods.
John asks whether people are planning central services for systems that store secure data that cannot be made public. Bill notes that Stanford is definitely building that as part of their repository, including ways to store data that cannot be released for a certain number of years. Carnegie Mellon is also planning to do this. Columbia has a pilot project in this area.
Raj notes that on the admin side, Arizona State is doing a lot of mining of their student data and providing recommendations for students on courses, etc.
Mark notes that we don’t have a good story to tell about groups in controlling access. Michael says we do have a fairly good story on group technology (thanks to Tom’s work on Grouper), but we still need to work across boundaries and to develop new standards, such as O-AUTH and SCIM.
Mark also postulates that the volume of data generated by the Massively Online Courses would be a really interesting place to think about big data.
There’s some general discussion about the discoverability of data in the future and the ability to understand and use data in the future when the creators might not be available. That’s where data curation becomes important. Data curation should include standards for metadata and annotation, and also processes for migrating data forward as formats change. Ken quotes a scientist as saying “I would rather use someone else’s toothbrush than their ontology”.
On to lunch.