The Proteus Project

This page describes the UMass Center for Intelligent Information Retrieval's Proteus Books project that was funded by The Andrew W. Mellon Foundation. The Proteus Books project's aim is to build and evaluate research infrastructure for scanned books.

There are several large scanned book collections such as the Internet Archive much of which is unstructured and not easily used by scholars in the humanities. The Proteus infrastructure will help scholars navigate and use such collections more easily. Components of the infrastructure include automatically identifying a book's language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection. One of the key aims of the project is to do all these tasks efficiently at large scale.

There are several components of this project:

Language Identification

The project identified the language of 3,628,227 OCRed books at the Internet Archive. You can find the results on our Language Identification page.

From those 3.6M books, we identified where our language identification differed from the books metadata at the Internet Archive. We found 31,947 books that we believe to have the incorrect language. In addition, there were 17,272 books where the Internet Archive identified them as a language that we did not check for. You can find the results on our Language Identification Differences page.

Books that have the incorrect language at the Internet Archive may benefit from being re-OCRed with the correct language to improve the resulting text.

Duplicate Detection

Canonical text for English and Latin works have been acquired from the Perseus Digital Library. There are 803 English works and 401 Latin works.

These works were compared with the text of the English or Latin OCRed books from the Internet Archive to find full and partial duplicates[1]. Partial duplicates are books where there may be extra material such as footnotes or other works - for example Hamlet within The Tragedies of Shakespeare. Duplicates are quickly identified by looking at only the unique words within a book.

Duplicate Alignment

Once duplicates are identified, they were aligned to identify corresponding portions of the works[2].

Entity Extraction

This will develop the first steps of the entity linking system by improving entity detection performance on the diverse genres and topic domains of the IA book collection. This task involves identifying mentions of proper names prior to disambiguating them.

Finding Quotations of Canonical Works

While duplicate detection allows us to find matches of complete works, finding matching quotations is more fine grained. By searching for a quotation, for example Rosencrantz's "Take you me for a sponge" (Hamlet, Act IV Scene II), we can find all occurrences of that quotation even in books that are not copies of Hamlet. This allows us to see which passages attract the most scholarly interest over time.


[1] Yalniz I. Z., Can E. F., Manmatha R.. Partial Duplicate Detection for Large Book Collections. International Conference on Information and Knowledge Management (CIKM.11), Glasgow, UK, pp. 469-574, October 24-28, 2011
[2] Yalniz I. Z., Manmatha R.. A Fast Alignment Scheme for Automatic OCR Evaluation of Books. International Conference on Document Analysis and Recognition (ICDAR.11), Beijing, China, pp. 754-758, September 18-21, 2011