Closing Plenary – Cliff Lynch
A record setting meeting – not just attendees, but number of session proposals was far in excess of what’s been seen before.
Security and privacy have become pervasive issues. Concerns about security interplay with notions of moving services onto the net and depending on remote organizations and facilities. Security and privacy are separate things, though interrelated. Privacy itself has become a multi-headed thing. Was traditionally privacy from the state or privacy from your neighbors, now there is a vast enteprise interested in data about you. There are people who think we have approached this wrong, that we should punish people from making nasty use of information rather than failure to keep it secret. Snowden revelations represent enormous breach of security in organizations that are supposed to be the best and are well funded. Some of the revelations are about efforts to undermine security in the national and international networking infrastructure. That suggests that we have a lot to do in improving security – hard to believe in selective compromises that can only be open to the good guys.
The Snowden breach also is a highwater mark of a trend that raises issues for archives and libraries – an example of a large and untidy database of material. This is not a leaked memo, or even the Pentagon papers. Here we have this big dataset cached in various places. The government is still not comfortable with this to the point that it cannot be used as reference material in classes taken by government employees as they would be mishandling classified documents. How are we going to manage these important caches of source documents, and who is going to do it?
There are any number of other security and privacy problems you can read about in the press. The term “data breach” suggests a singular event. There is evidence that many systems are compromised for a long period of time – that’s an important distinction. Seeing a spectacular example at present with Sony where it appears they may have lost control of their corporate IT infrastructure to a point where they may have to tear it down and build it again. It IETF is looking at design factors and algorithms – we need to do this in our community more systematically. There’s been good material coming out of Internet2 and Educause joint security task force. Some of this is easy – why are we sending things in the clear when we don’t need to? Underlying assumption that it’s a benign world out there. The whole infrastructure around the metadata harvesting protocol is open – who would want to inject bad data about repositories? We’re using these as major sources of inventories of research data – would be good if they’re reasonably accurate. CNI will convene, probably in February, to start building a shopping list of things relevant to our community.
Two things that are harder and more painful to deal with: One is the sacrifices we need to make to get licenses for certain kinds of material, especially consumer market materials. Look at the compromises public libraries have made to license materials for their patrons – pretty uncomfortable with privacy choices. Need to reflect long and hard across the entire cultural memory sector. The other thing is levels of assurance – how rigorous do you want to be with evidence in trusting someone. Some identities are issued with an email address, others want to see your passport and your mother. It’s easier to do it right the first time, but sometimes if you do it right it can take forever. We’re building a whole new apparatus about factual biography and author identity – no agreement or even discussion about what our expectations are around this. Do we want to trust people’s assertions, or do we want it verifiable? Part of the problem is it’s hard to understand how big the problem is.
In the commercial sphere it is stunning how much we don’t know about how much personal information is passed around and reused.
Another clear trend: Research data management. We are still waiting eagerly (and with growing impatience) for policies and ground rules from the funding agencies about implementing OSTP directives. In broader context seeing focus on data, data sharing, big data. Phil Bourne was appointed as the first assistant director of data science at the NIH – creation of that role underscores how important they see research data and data management. Seeing this in other agencies and in business. City governments are getting involved in big data and the emergence of centers in urban informatics. SHARE will be a backbone inventory and analytic tool for understanding research data responsibilities – has a clear idea where it’s heading and is starting go move along.
Still many things we’re not coping with well in this area – data around human subjects. Not sure we have a good conversation between those who are concerned about privacy and those who see what can be accomplished with information. A story that illustrates developments and fault lines. THere’s a whole alternative in social science emerging out there with studies that could never have been done within univerisites but fairly well respectful of privacy. Sometime earlier this year the proceedings of the national academies of science publish a paper jointly authored by researchers at Cornell and Facebook. Emotional contagion – if your circle of friends share depressing information then you will reflect depressing information back. Came up with idea to test this on Facebook. Need a lot of people in the experiment. They twiddled Facebook feed algorithm to bias towards depressing items and then did sentiment analysis. Then looked at people on sending end of those items. Around 60k people. They found there was a little truth in the theory. People started freaking out in various directions: academics (what IRB allowed this? where were the informed consent forms?); another group that said this seemed fairly harmless and can’t really be done with informed consent and should be viewed as a clever experiment. This hasn’t been resolved. Some people are worried that things that are product optimizations normally could be reframed as human subject experiments. Some people are wondering whether we don’t need a little regulation in this area. There was a conference at MIT on digital experiments. Large enterprises are doing thousands of tests a year, with sophisticated statistics, to tweak their optimizations. The part of the Facebook thing that was surprising is there were a lot of unhappy Facebook users offended about the news feed algorithm being messed with – without understanding that there are hundreds of engineers at Facebook messing with it all the time. Put a spotlight on how litlle people understand how much their interactions are shaped by algorithms in unpredictable ways.
People dont realize how personalized news has become – we don’t all see the same NY Times pages. What does it mean to try and preserve things in this environment? Intellectually challenging problem that deserves attention as we think about what are the important points to stress in information literacy. What’s appropriate ethically? What about research reproducibility? What evidence can we be collecting to support future research?
There was a CLER workshop on Sunday about things we have in archives and special collections that need to be restricted in some way or are in ambiguous status. Eg. things collected before 1900 when collecting and research practices were different. That is some of the only research we will ever have on some things and places, and we have to talk about them no matter how awkward.
Software – we often make casual statements about software preservation and sustainability. Time to take a closer look. There is massive confusion about what sustainability means, the difference between sustainability and preservation, and what those terms mean. Time for more nuance around this. Rates of obsolescence and change – in some sense desirable to keep everybody on current version, but the flip side is that vendors have enormous motivations to put people through painful frequent cycles of planned obsolescence. There is some evidence that there are better outcomes of backward compatibility with open source software. We need to understand forces obsolescence cycles and what that implies in areas like digital humanities where there’s not a lot of money to rewrite things every year. We’re seeing new tools on virtualization technologies for software preservation.
Did an executive roundtable on supporting digital humanities at scale. Linked closely to digital scholarship centers – these are important mechanisms for diffusion of information on technology and methodology. One of the striking things is that got a lot of people who are looking at or the issue of or planning for scholarship centers. There’s interest in looking at a workshop for people planning such a center. There are lots of things that have the word “center” in them, with widely varying meanings. It might be a real help to summarize the points of disagreement and the different kinds of things parked under these headings.
If you ask the qeustion how are we doing in terms of preserving and providing stewardship of cultural memory in our society (including but not limited to scholarly activity), nobody can answer. If you ask are we doing better this year than last we have no idea how to answer. How much would it cost to do 50% better than now? Can’t answer that either. There have been some point investigations – like the Keepers activity. Studies from Columbia and Cornell on what proportion of periodicals are archived. Other than copyright deposit at LC we have no mechanism to get recordings into institutions that care about the cultural record. We’re in a slow motion train wreck with the video and visual memory of the 20th century – it’s a big problem that, until recently, we couldn’t get a handle on. This will require a big infusion of funds in the interest of cultural memory. Indiana University took a systematic inventory of their problem and then was able to win a sizeable down payment from leadership to deal with it. NY Public have done a study and are just sharing results – their numbers are bigger and scarier than Indiana’s. Getting surveys done is getting a bit easier. There are probably horrible things waiting to be discovered in terms of preserving video games. Preserving the news is another area. Part of the difficulty is the massive restructuring in some of these industries. Helpful to think about this systematically in order to prioritize and measure our collective work.