Language Identification

This page contains the results of the language identification run on 3,628,227 OCRed books from The Internet Archive.

Ten languages are used: English (ENG), French (FRE), German (GER), Spanish (SPA), Italian (ITA), Latin (LAT), Portuguese (POR), Dutch (DUT), Danish (DAN) and Swedish (SWE). In addition, there is an "Undetermined Language" (UND), which indicates that there exists some text written either in a language which is not listed above or the text has OCR errors.

Each book is given a language distribution. For example given the book hamlettragedyinf01shak, the language identification results are:

hamlettragedyinf01shak ENG 36.2 FRE 0 GER 0 SPA 0 ITA 34.8 LAT 0 POR 0 DUT 0 DAN 0 SWE 0 UND 29

Which means the book is estimated to be 36.2% English, 34.8% Italian, and 29% Undetermined. The book is Shakespeare's Hamlet in both English and Italian. The "Undetermined" portion is most likely due to OCR errors.

Note that to view a book on the Internet Archive, take the 'label' (for example 'hamlettragedyinf01shak' above), and append it to the URL: https://archive.org/details/.

Download the entire language identification results here (342MB). Below, the results are grouped by the first letter of the book's title to make searching a bit easier. Download a version for use in a spreadsheet (172MB) that just has the percentages and a header.