Duplicate Detection

Starting with a collection of 3,628,227 scanned books from the Internet Archive, we identified 1,576,124 books that were written in English and 76,684 written in Latin. Then using the canonical text from 803 English books and 401 Latin books, we identified full and partial duplicated texts within the English and Latin scanned books from the Internet Archive. Partial duplicates are books where there may be extra material such as footnotes or other works - for example Hamlet within The Tragedies of Shakespeare. Duplicates are quickly identified by looking at only the unique words within a book. See Partial Duplicate Detection for Large Book Collections for details.

Below, you can select a language and single canonical work to see the duplicates found within the Internet Archive repository. You can also download the complete list of all English or Latin duplicates in JSON format.

Select a language:

English Latin

Select a canonical book: