Once duplicates of the canonical works are found, the OCRed text is aligned with the canonical "Ground Truth" text using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of texts to generate the final alignment. See A Fast Alignment Scheme for Automatic OCR Evaluation of Books for more details.
For example, when comparing the canon text - which is the "Ground Truth" - of the graveyard scene from Hamlet (Act 5, scene 1) with an OCRed version such as The works of William Shakespeare:
In this example, the Ground Truth does not contain stage directions so "Takes the skull" is aligned with null
. Also note that "skull" has not been correctly OCRed and appears as "shiill" along with "fellow" which the OCR interpreted as three separate words ("f", "el" and "low") perhaps due to the word being hyphenated on the scanned page. The full alignment of these two works can be found here.
Select a canonical book: