We have incorporated data from the Internet Archive's scanned book collection into a Proteus repository that supports several services. Most importantly, full text search uses UMass's Galago search engine, which supports state-of-the-art retrieval models with complex, structured queries.

To demonstrate handling objects at multiple levels of granularity, we support retrieval of books, pages, images, and entities. Book and page retrieval combine evidence from text content with metadata fields. (Other structures in documents, such as book chapters or articles in collections would be handled similarly.) Image retrieval uses evidence from the text surrounding figures extracted from books.

Entity retrieval is more unusual. Searches for "named entities"—objects of reference such as particular people, places, or organizations—return an automatically constructed "document" that consists of all the passages that refer to that entity, as well as a summary and alternate names from Wikipedia.

Blog Summary Widget

Browse the sample searches below and try out some on your own.