Language identification

From Wikipedia, the free encyclopedia
Jump to: navigation, search
For language identifiers, see Language code. For assistance in identifying languages for Wikipedia purposes, see Wikipedia:Language recognition chart.

In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

Overview[edit]

There are several statistical approaches to language identification using different techniques to classify the data. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods.[citation needed] Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques. Bennedetto, et al's work has largely been discredited as relatively naive and inaccurate.

Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web.

For a more recent method, see Řehůřek and Kolkus (2009). This method can detect multiple languages in an unstructured piece of text and works robustly on short texts of only a few words: something that the n-gram approaches struggle with.

An older statistical method by Grefenstette was based on the prevalence of certain function words (e.g., "the" in English).

References[edit]

See also[edit]

External links[edit]

Libraries[edit]

Web services[edit]