Duplicate Book Alignment

Once duplicates of the canonical works are found, the OCRed text is aligned with the canonical "Ground Truth" text using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. See A Fast Alignment Scheme for Automatic OCR Evaluation of Books for more details.

For example, when comparing the canon text - which is the "Ground Truth" - of the graveyard scene from Hamlet (Act 5, scene 1) with an OCRed version such as The works of William Shakespeare:

OCR: see Takes the shiill Alas poor Yorick I knew him Horatio a f el low null of infinite jest of GT : see null null null Alas poor Yorick I knew him Horatio a null null null fellow of infinite jest of

In this example, the Ground Truth does not contain stage directions so "Takes the skull" is aligned with null. Also note that "skull" has not been correctly OCRed and appears as "shiill" along with "fellow" which the OCR interpreted as three separate words ("f", "el" and "low") perhaps due to the word being hyphenated on the scanned page. The full alignment of these two works can be found here.

