[CSG Winter 2007] MBooks at University of Michigan

Advertisements

– Project partnership with Google publicly announced in 2004 December – scanning 7 million print volumes over 4-6 years. Direct scanning costs are borne by Google.

UM receives a copyof all digital files, including OCSR and metadata which can be used to build services. UM can share, with some restrictions. Each volume page produces 2.01 files on average – will be about 2.2 billion files, 380 TB of data. Sustained rate of 3.16 MB per second for four years.

Data characteristics – well defined file formats – image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Indefinite retention. Files are largely static. Much material is in copyright, so requires security practices.

Mbooks service – can search and look at books online.

There’s interest in using the OCR data for textual analysis research.

Technorati Tags: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s