Comparison

 
 

There are a number of current digital libraries of books. Most use a commercial or public search engine to search metadata (bibliographic records) for its own books. Several—such as the Internet Archive (IA) and the Hathi Trust—provide both bibliographic and full text search of the optical character recognition (OCR) output using open source code such as Solr/Lucene.

These approaches, however, largely make the assumption that the metadata is accurate and that the optical character recognition output from scanned data is clean. In practice neither is true, a reality that can lead to difficulties.

  1. Without appropriate handling, OCR errors can degrade search accuracy for commonly used short queries. 

  2. Books contain a lot of information not directly accessible by straightforward full-text searches. For example, one might want to find the list of people and organizations in a book, or to search for specific pictures across books. While existing named entity recognizers can be applied for such tasks, they are almost universally trained on clean newswire and modern text, so perform poorly on older books and especially on books with OCR errors.

  3. Book metadata is not completely accurate (and probably will never be). This is a problem not just for libraries but even for commercial companies with large electronic catalogs. For example, a book's language is often inaccurately recorded.  Unfortunately for automatic processing, knowing the language improves OCR accuracy as well as language tasks such as named entity recognition. As a simple example, the Internet Archive's system tries to recognize the Fraktur font in old German books—even though their OCR is not capable of doing this—because that particular rendering of German is not reflected in the metadata. Such books have OCR outputs which are unusable. This issue is even more complicated because some books switch between languages, a situation rarely recorded in the metadata and certainly not recorded with specific enough information (which pages are in French?) to be used by most tools. Proteus incorporates approaches that analyze the “well formedness” of book's language to determine whether a book has been correctly recognized by an OCR.

  4. We note that there can be metadata issues even when the metadata is completely accurate. For example, different versions of Shakespeare's Hamlet may have different authors (e.g., the editor's name may be added to or supplement Shakespeare's name). Even the title may vary. This inconsistency makes it difficult to find different versions of Shakespeare's Hamlet using the metadata alone.

Because of these issues, Proteus leverages all metadata that it can find, but relies heavily on direct processing of the full text of a book. It includes methods that are explicitly trained to be robust in the face of OCR errors and inconsistent or inaccurate metadata. Finally, it includes sanity-checking algorithms that recognize when the metadata and the text appear to be at odds with each other.