Language Identification Differences

This page contains the results of the language identification run on 3,628,227 OCRed books from The Internet Archive (IA) and where they differ from the language the IA has listed.

Note: only languages over 5% are considered. So if a book was found to be 96% English and 4% Latin, we would consider it only English.

Books With Incorrect Language(s)

We found 31,947 books where we believe the language the Internet Archive has is incorrect. For example:

IA thinks hamlet00shakgoog is Language: {'ENG'}. We think it's: [('UND', '57.7'), ('GER', '38.4')]

is identified as English at the IA, however our Language Identification found the book to be in 57.7% "Unidentified" (probably due to OCR errors) and 38.4% German. Viewing the book at the IA shows that it is in German.

Note that the percentages do not sum to 100% because we only consider languages where the percentage is over 5%. In the above case we also found 1.6% English, 1.2% Spanish, and 1.1% Portuguese.

Books With Unknown Language(s)

There are 17,272 books that have a language we did not check for so we assume the IA language is correct. Upon inspection there are some books which are misidentified. For example the book agathoniabycgfg00goregoog's results are:

Unknown language: {'CZE'} for book: agathoniabycgfg00goregoog. We think it's: [('ENG', '100')] 

is identified as Czech but is actually English.

The book eachtrafhinncumh00finn is an example of a book correctly identified at the IA as Gaelic but our tool does not look for Gaelic:

Unknown language: {'GAE'} for book: eachtrafhinncumh00finn. We think it's:  [('UND', '87.5'), ('LAT', '6.8')] 

Evaluating Corrected Language

We randomly sampled 101 books from a subset of the 31,947 books that we had identified as being written in a different language than the Internet Archive had for them. We arranged to have the Internet Archive re-run their OCR process on those books with the correct language identified. Not surprisingly, OCR systems work better when they know the language of the scanned book.

We do not have "truth" OCR output for these books, so we used a surrogate process to evaluate whether the OCR improved under the new language. To evaluate the improvement in the OCR, we used 42 of the 101 books. These were books that we identified as being English, since that would be the easiest for us to verify being native English speakers. We then counted the number of unique misspelled words in the original OCR file compared with the re-OCRed file.

Using this method of evaluation we found a percent improvement of up to 84% with an average improvement of 41%.

If we restrict those 42 books to only those where the percent English is greater than the percent "Undetermined" - 31 in total - the average improvement increases to 50%.

Based on these results, we have provided the Internet Archive with the list of books that our system estimates have the wrong language. Our understanding is that the Internet Archive will be re-OCR'ing those books to make them more accessible by their own search engine.