– Project partnership with Google publicly announced in 2004 December – scanning 7 million print volumes over 4-6 years. Direct scanning costs are borne by Google.
UM receives a copyof all digital files, including OCSR and metadata which can be used to build services. UM can share, with some restrictions. Each volume page produces 2.01 files on average – will be about 2.2 billion files, 380 TB of data. Sustained rate of 3.16 MB per second for four years.
Data characteristics – well defined file formats – image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Indefinite retention. Files are largely static. Much material is in copyright, so requires security practices.
Mbooks service – can search and look at books online.
There’s interest in using the OCR data for textual analysis research.