Language identification is the process of determining which natural language given content is in. Traditionally, identification of written language - as practiced, for instance, in library science - has relied on manually identifying frequent words and letters known to be characteristic of particular languages. More recently, computational approaches have been applied to the problem, by viewing language identification as a special case of text categorization, a Natural Language Processing approach that relies on statistical methods.
In the field of library science, language identification is important for categorizing materials. As librarians often have to categorize materials which are in languages they are not familiar with, they sometimes rely on tables of frequent words and distinctive letters or characters to help them identify languages. While identifying a single such word or character may not suffice to distinguish a language from another with a similar orthography, identifying several is often highly reliable.
There are several statistical approaches to language identification using different techniques to classify the data. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods. Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques. Bennedetto, et al's work has largely been discredited as relatively naive and inaccurate.
Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web.
For a more recent method, see Řehůřek and Kolkus (2009). This method can detect multiple languages in an unstructured piece of text and works robustly on short texts of only a few words: something that the n-gram approaches struggle with.
An older statistical method by Grefenstette was based on the prevalence of certain function words (e.g., "the" in English).
- Joshua Goodman. Extended Comment on Language Trees and Zipping. arXiv:cond-mat/0202383 [cond-mat.stat-mech]
- Benedetto, D., E. Caglioti and V. Loreto. Language trees and zipping. Physical Review Letters, 88:4 (2002), Complexity theory.
- Cavnar, William B. and John M. Trenkle. "N-Gram-Based Text Categorization". Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval (1994) .
- Cilibrasi, Rudi and Paul M.B. Vitanyi. "Clustering by compression". IEEE Transactions on Information Theory 51(4), April 2005, 1523-1545.
- Dunning, T. (1994) "Statistical Identification of Language". Technical Report MCCS 94-273, New Mexico State University, 1994.
- Goodman, Joshua. (2002) Extended comment on "Language Trees and Zipping". Microsoft Research, Feb 21 2002. (This is a criticism of the data compression in favor of the Naive Bayes method.)
- Grefenstette, Gregory. (1995) Comparing two language identification schemes. Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data (JADT 1995).
- Poutsma, Arjen. (2001) Applying Monte Carlo techniques to language identification. SmartHaven, Amsterdam. Presented at CLIN 2001.
- The Economist. (2002) "The elements of style: Analysing compressed data leads to impressive results in linguistics"
- Radim Řehůřek and Milan Kolkus. (2009) "Language Identification on the Web: Extending the Dictionary Method" Computational Linguistics and Intelligent Text Processing
- Algorithmic information theory
- Artificial grammar learning
- Family name affixes
- Kolmogorov complexity
- Language Analysis for the Determination of Origin
- Machine translation
- S.M. Mohammadzadeh: Language identification/detection related documents (26 February 2011).
- LID - Language Identification in Python: algorithm and code example of an n-gram based LID tool in Python and Scheme by Damir Cavar.
- jExSLI: simple classifier in Java.
- lid Language Identifier: by Lingua-Systems; C/C++ library and Perl Extension (online demo).
- lc4j, a language categorization Java library, by Marco Olivo.
- Microsoft Extended Linguistic Services for Windows 7: including Microsoft Language Detection.
- Windows 7 API Code Pack for .NET: including managed interfaces for the above.
- NTextCat - free Language Identification API for .NET (C#): 280+ languages available out of the box. Recognizes language and encoding (UTF-8, Windows-1252, Big5, etc.) of text. Mono compatible.
- cldr-R library for Chromium-Author's Compact Language Detection code.
- language-detection: open-source language detection library for Java (Apache License 2.0 (lang-guess is a fork of this code).
- Language Identification Web Service: language detection API (JSON and XML) that detects 100+ languages in texts, websites and documents
- Language Detection API: simple language identification API
- AlchemyAPI: language identification API, available as SDK and through a RESTfull API (web-based demonstration).
- PetaMem Language Identification: provides a choice between ngram, nvect and smart methods.
- Open Xerox LanguageIdentifier, available in web-based form or through API.
- Language Detector, Online identification from text or URL and API available for developers.
- What Language Is This? Online language identifier: web-based tool written by Henrik Falck.
- Rosette Language Identifier: product by Basis Technology.
- Language Identifier: product by Sematext; exposes Java API and is available through REST/Webservice.
- G2LI (Global Information Infrastructure Laboratory's Language Identifier).
- Rosoka Cloud by IMT Holdings provides language ID, entity and relationship extraction RESTfull web services available through Amazon Web Services Marketplace.
- Semantria sentiment and text analytics API which features language detection