Digital Humanities At Scale: Hathi Trust Research Center
John Unsworth, Brandeis
Beth Sandore Namachchivaya, UIUC
Public research arm of the Hathi Trust – which is a large corpus providing opportunity for new forms of computation investigation. Over 10 million total volumes currently. The bigger the data, the less able to move it to a researcher’s location. Future research will require computation moving to the data.
Requirements gathering: 2010-11 sponsored by Lib school at UIUC. Did a study interviewing all 22 researchers with Google Digital Humanities Research Awards.
Findings: Improve OCR quality where possible; enhance scanned image views for OCR reference and correction; metadata should expose the quality of OCR; Better granular metadata about languages (human correction preferred); need bibliographic records in useable form.
Goals for HTRC
– provide a persistent and sustainable structure to enable scholars to ask and answer new questions. Leveraging data storage and computational infrastructure at Indiana & Illinois; stimulate community development of new functionality and tools; use tools to enable discoveries.
– Enable scholars to fully utilize content while preventing IP misuse under US copyright law.
One of the early research projects, done by Ted Underwood et al at Illinois. Identify all 18th and 19th century published books in HathiTrust corpus, and apply topic modeling to create consistent overall subject metadata. Also did experiment to look at ratio between words entering the English lexicon before 1150 and after in three different genres. To do this kind of work you start by doing a lot of cleaning of data. The glory is in analyzing the data, which takes just a small amount of the time.
Cleaning the data: 1. clean up the OCR/assess error. 2. Identify parts of a volume (e.g. articles in a serial, poetry/prose). 3. Remove library bookplates, running headers, etc.
Cleaning/enriching the metadata: 1. “18??” 2. discard duplicate volumes / select early editions? 3. Add metadata you need for interpretation like gender or genre.
Things we could share: period lexicons / variant spellings; gazetteers of proper nouns; OCR correction rules for a period; document segmentation and/or cleaned and segmented text; FRBRization; Cleaned / enriched metadata; code to do all the above.
Philosophy – computation moves to data; web services architecture and protocols; registry of services and algorithms; Solr full text indexes; noSQL store as volume store; openID authentication; portal front-end, programmatic access; SEASR mining algorithms.
Infrastructure for computational analysis – algorithms must be co-located with data. Analysis or large parts of the corpus require significant parallel computing resources, requiring batch processing.
Can fair use be determined based on categorization of algorithm? Or is all computational use fair use? Even the pubic domain contact was generated by Google and comes with some contractual limitations on use. What kind (and how much) data gets shipped back to user as part of a result set? Researchers need context as well as the token that was a result. Could be a paragraph, a stanza, a page. Need to examine user contributed code to see if it abides by rules.
Phase 2 availability of resource – March 2013. WOrkshops @ Digital Humanities 2013 and JCDL. Fix the OCR and Metadata Shortage Community Challenge. Job opportunities – postdoc @ Illinois.