This page describes the UMass Center for Intelligent Information Retrieval's Proteus Books project that was funded by The Andrew W. Mellon Foundation. The Proteus Books project's aim is to build and evaluate research infrastructure for scanned books.
There are several large scanned book collections such as the Internet Archive much of which is unstructured and not easily used by scholars in the humanities. The Proteus infrastructure will help scholars navigate and use such collections more easily. Components of the infrastructure include automatically identifying a book's language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection. One of the key aims of the project is to do all these tasks efficiently at large scale.
There are several components of this project:
The project identified the language of 3,628,227 OCRed books at the Internet Archive. You can find the results on our Language Identification page.
From those 3.6M books, we identified where our language identification differed from the books metadata at the Internet Archive. We found 31,947 books that we believe to have the incorrect language. In addition, there were 17,272 books where the Internet Archive identified them as a language that we did not check for. You can find the results on our Language Identification Differences page.
Books that have the incorrect language at the Internet Archive may benefit from being re-OCRed with the correct language to improve the resulting text.
These works were compared with the text of the English or Latin OCRed books from the Internet Archive to find full and partial duplicates. Partial duplicates are books where there may be extra material such as footnotes or other works - for example Hamlet within The Tragedies of Shakespeare. Duplicates are quickly identified by looking at only the unique words within a book.
Once duplicates are identified, they were aligned to identify corresponding portions of the works.
This will develop the first steps of the entity linking system by improving entity detection performance on the diverse genres and topic domains of the IA book collection. This task involves identifying mentions of proper names prior to disambiguating them.
While duplicate detection allows us to find matches of complete works, finding matching quotations is more fine grained. By searching for a quotation, for example Rosencrantz's "Take you me for a sponge" (Hamlet, Act IV Scene II), we can find all occurrences of that quotation even in books that are not copies of Hamlet. This allows us to see which passages attract the most scholarly interest over time.