Technologies
The Proteus architecture involves several stages of document analysis, indexing, and retrieval. We review some of the component technologies and discuss our research on improving them.
The document processing pipeline currently starts with the OCR output generated by the Internet Archive, or with digital texts from sources such as Project Gutenberg. The OCR output maintains the image coordinates for each character and word, which can be helpful when analyzing page layout and constructing unique identifiers.
The first stage of processing is language identification at the document level. This gives us a distribution over the languages contained in a document, e.g. 90% English, 6% French, 4% unknown. This allows us to load the appropriate NLP models for further processing or to filter out, early in the process, books in languages such as Russian that we don’t yet support. In addition, we also use language ID to correct the metadata for books that are erroneously classified and, worse, OCR’d with the wrong models. Also, a book with a large proportion of “unknown” words may simply be in a font, such as German Fraktur, which the Internet Archive’s OCR does not currently support.
We then proceed to layout analysis, which identifies informational structures on the page, such as paragraph breaks, running headers and footers, and page numbers. Paragraphs, sentences, and words broken by page boundaries are reassembled.
Next comes tokenization, which varies depending on the language. In general, this process breaks the stream of characters into more linguistically regular units. For instance “cheese!” becomes “cheese !”. In addition, words that are hyphenated across line breaks are disambiguated by looking at the set of other tokens in the book, falling back on a background dictionary. In a 20th century book, when “base- ball” is broken across lines, we should recover the canonical token “baseball”; in a 19th century book that contains, “base-ball” in the middle of various lines, the word would remain hyphenated. After tokenization, we perform sentence splitting to provide input to the remaining steps of linguistic processing.
Then morphological analysis assigns word classes—also known as, part of speech (POS) tags— to each token. For English, this is a matter of assigning words and punctuation marks to familiar classes of noun, verb, adjective, etc., combined with some morphological information such as “-ed past tense” or “-s plural”. For German, Latin, and languages with greater degrees of inflection on words, the tag set is correspondingly larger. These tags are especially useful in providing features for further processing for syntactic structure or name recognition. For instance, an unknown word, such as “frobnication”, with a familiar ending that marks it as a noun, should behave like similar nouns, for instance in forming entities such as “International Frobnication Corporation”.
Morphological information also allows us to perform lemmatization, which maps inflected forms to the standardized citation forms, e.g., the form under which dictionaries index them. While collapsing instances of “dog” and “dogs”, or “ox” and “oxen”, together provides some small gains for search in English, lemmatization is especially helpful for searches in Latin and other highly inflected languages.
Syntactic parsing determines the grammatical relations among the words in a sentence. For instance, in “Washington crossed the Delaware”, Washington and Delaware are the subject and object, respectively, of the verb “crossed”. This grammatical structure also allows us to infer constituents, or coherent phrases, as indicated by brackets in “[[[International Frobnication Corporation] ‘s] [initial public offering]] flopped”.
For a long time, information retrieval systems have used controlled vocabularies to ground the ambiguities of natural language. Similar in intent, though not in execution, is the use of named entities, which refer to particular objects independent of language. Named-entity recognition (NER) is often performed first: it detects the boundaries and names and attempts some classification of their types. NER systems distinguish between the person and place senses of “Washington” but not between the multiple places or people that could be indicated. The next step is entity disambiguation: given an authority list or database or entities, link each of the named-entity mentions, if possible, to one of the entries therein. In the current system, we use the DBpedia compilation of structured data from Wikipedia as an authority list. The metadata about each entity that the database contains is useful not only for disambiguation for providing more context to the user by displaying a short description or plotting entities with geographic coordinates on maps.
Language identification, layout analysis, tagging and lemmatization for non-English languages, syntactic parsing, and named-entity disambiguation were developed in our research group. In many cases, the off-the-shelf components were either too slow (as for language ID or parsing), or too inaccurate, as for named-entity recognition. For NER, state-of-the-art systems can achieve over 90% accuracy on newswire text but fall to 50% or below on book data. We developed methods that, using simple context information from books, get accuracy back to 62%. More work is needed to adapt these systems, in an unsupervised fashion, for good performance on a wide range of materials. In the area of syntactic dependency parsing, we developed efficient linear-time algorithms that run up to ten times faster than current systems with only a small cost in accuracy. The ability to do fast syntactic parsing allows our named-entity systems to use dependencies to improve their output.
For tokenization and sentence splitting, and morphological tagging and lemmatization for English, we use the open source Stanford CoreNLP tools.
The successive steps in the document-processing pipeline store their data in Text Encoding Initiative XML formats. We extend the lexical-token element <w> with attributes for the word’s syntactic head and relation to that head and for page coordinate information. Page coordinates provide convenient unique anchors for standoff markup, such as entity span information from conflicting systems (which may not properly nest for inline storage). Coordinates also enable highlighting search results on page images.
❖
So far, we have discussed ways of automatically annotating books and indexing these various representations. But books, and individual sections or passages within them, do not exist in isolation. We are interested in relational metadata that maps the connections among (parts of) documents. In particular, we have built systems for, and are conducting research on:
•partial duplicate detection, where a single work, or a substantial part thereof, may be published in several more or less divergent editions;
•translation detection, finding versions of the same text in different languages; and
•quotation detection, looking for the reuse of smaller passages on the order of ten words.
These problems can all be phrased as retrieval problems with documents as queries: “Find me all duplicates, or translations, or quotations, of book X.” The majority of books, in any case, do not have duplicates or translations, so we are interested in the more general case: “Within this large collection, find pairs of books or passages with a quotation, duplication, or translation relationship to each other.”
Processing individual documents and indexing them requires a fixed number of passes over all books or, at worst, sorting more frequent index terms. When we infer relationships among documents, however, we are dealing with operations that scale with the square of the number of documents, which would be prohibitive with the tens of millions of books or billions of passages for the planned DPLA.
We achieve efficient performance on these tasks with a combination of several strategies:
•Dimensionality reduction: Many fewer comparisons are required by compressing documents from sequences, or even bags, of words into more compact representations with a document’s hapax legomena or, for the translation detection task,overlapping strings in two languages.
•Recursive sequence alignment: Longest-common-subsequence algorithms with unique (hapax) words reduce the number of comparisons that need to be made when comparing a particular pair of books.
•Approximate nearest-neighbor: Locality sensitive hashing approaches reduce the problem of finding similar pairs to one of finding collisions when randomly mapping documents into bins. Instead of comparing all pairs of documents or passages, this approach allows us to hash each document exactly once.
•Feature indexing: To find instances of quotation, or the reuse of parts of documents, we start by indexing higher-order n-grams. This process can be made much more efficient by basing an index of 5-grams on an index of 4-grams, and so on. Any five-word subsequence that occurs more than once must contain two four-word subsequences that occur more than once.