The Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst, has been conducting research, with partial support from the NSF, on document analysis, information extraction and retrieval for digital libraries.

For the DPLA, we propose the development and deployment of Proteus, an infrastructure to enable library patrons to discover and find connections within and across books. Examples include finding how quotations occur over time, discovering which parts of books are republished and in which combinations, recognizing names of people or locations and disambiguating them using Wikipedia, and searching the books’ contents, the entities mentioned in them, and the pictures they contain.

Many current techniques assume that documents are error-free and that the metadata is accurate. For example, current named entity recognizers are mostly trained on clean newswire and do not work as well on scanned books. Other attempts use human annotation to try and create error-free data, a laudable goal, but one that is extremely difficult for the amount of data envisioned in the DPLA.  Instead, Proteus incorporates state-of-the-art techniques that are trained to be more robust to errors in the recognition of scanned books. These methods are efficient, scalable to millions of books, and can optionally be structured to exploit user annotations where available.

We believe that people will use Proteus to explore books in ways not possible today or in ways we cannot even imagine. For example, they can use it to explore how later novelists have repurposed Shakespearean texts or how language has changed over time; they can use it to locate all the places in Grant’s Memoirs or compare annotated editions of Vergil’s Aeneid.

To communicate our vision, we have built an initial demonstration system that automatically annotates, links, indexes, and searches scanned books for text, pictures, and named entities. In what follows, we explain the architecture and provide a walkthrough of the Proteus Book Search system.