List of text corpora

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and unstructured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

English language[edit]

European languages[edit]

Middle Eastern Languages[edit]

East Asian Languages[edit]

Parallel corpora of diverse languages[edit]

  • Europarl Corpus - proceedings of the European Parliament from 1996–2011
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[12]
  • OPUS: Open source Parallel Corpus in many many languages [13]
  • Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.[14]
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie) [15] (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[16]
  • GRALIS parallel texts for various slavic languages, compiled by the institute for slavic languages at Graz University (Branko Tošović et al.)

Comparable Corpora[edit]

See also[edit]


  1. ^ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at
  2. ^ "PhraseFinder".  A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  3. ^ "Corpus Resource Database (CoRD)". Department of English, University of Helsinki. 
  4. ^ "Under Update". Retrieved 12 January 2014. 
  5. ^
  6. ^ (in Spanish) "Molinolabs - corpus". Retrieved 12 January 2014. 
  7. ^ "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". Retrieved 12 January 2014. 
  8. ^ "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". Retrieved 12 January 2014. 
  9. ^ "Available from CLARIN". 
  10. ^ a b "University of Tehran NLP Lab". Retrieved 12 January 2014. 
  11. ^ "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". Retrieved 12 January 2014. 
  12. ^ "EUR-Lex Corpus". Retrieved 27 October 2016. 
  13. ^ "OPUS - an open source parallel corpus". Retrieved 12 January 2014. 
  14. ^ "Tatoeba - Number of sentences per language". Retrieved 13 January 2014. 
  15. ^ Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174. 
  16. ^ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  17. ^ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of The 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  18. ^ Kilgarriff, A. (2012, September). Getting to know your corpus. In International Conference on Text, Speech and Dialogue (pp. 3-15). Springer Berlin Heidelberg.
  19. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  20. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
  21. ^ Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  22. ^ Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  23. ^ Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)