Corpus language

From Wikipedia, the free encyclopedia

A corpus language is a language that has no living speakers, though a number of the actual productions of the native speakers have been preserved in some way (usually in written records).[1] Examples of corpus languages are Ancient Greek, Latin, the Egyptian Language, Old English and Elamite.

Some corpus languages left a very large corpus, like Ancient Greek and Latin, and therefore can be totally reconstructed, even though some details of the pronunciation may be unclear. Such languages can be used even today, as is the case with Sanskrit and Latin. Others have such a limited corpus that some important words, e.g. some pronouns, are not found in the corpus. Examples for this are Ugaritic and Gothic. Languages that are only attested by a few words, often names, and a few phrases (called Trümmersprachen in German linguistics, literally "rubble languages") can only be reconstructed in a very limited way and often their genetic relationship to other languages remains unclear. Examples are the Lombardic language and Dadanitic, a Semitic language that may be close to classical Arabic.

Corpus languages are studied using the methods of corpus linguistics, but corpus linguistics can be used (and is commonly used) for the study of the recorded productions of living languages.

Not all extinct languages are "corpus languages," since many languages have disappeared leaving no, or very inadequate, recorded production of their speakers.


  1. ^ Langslow, D.R. 2002 "Approaching bilingualism in corpus languages" in James Noel Adams, Mark Janse, Simon Swain (edd.) Bilingualism in Ancient Society: Language Contact and the Written Text Oxford: OUP.

See also[edit]