List of text corpora

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

English language[edit]

European languages[edit]

Middle Eastern Languages[edit]

  • Hamshahri Corpus (Persian)
  • Persian in MULTEXT-EAST corpus (Persian)[9]
  • Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus[10]
  • TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling[10]
  • Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN 964-8699-32-1
  • (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
  • Neo-Assyrian Text Corpus Project
  • Quranic Arabic Corpus (Classical Arabic)
  • Electronic Text Corpus of Sumerian Literature
  • Open Richly Annotated Cuneiform Corpus

East Asian Languages[edit]

Parallel corpora of diverse languages[edit]

  • Europarl Corpus - proceedings of the European Parliament from 1996–2011
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[12]
  • OPUS: Open source Parallel Corpus in many many languages [13]
  • Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.[14]
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie) [15] (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[16]
  • GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)

Comparable Corpora[edit]

See also[edit]


  1. ^ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at
  2. ^ "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  3. ^ "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
  4. ^ "Under Update". Retrieved 12 January 2014.
  5. ^
  6. ^ (in Spanish) "Molinolabs - corpus". Retrieved 12 January 2014.
  7. ^ "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". Retrieved 12 January 2014.
  8. ^ "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". Retrieved 12 January 2014.
  9. ^ "Available from CLARIN".
  10. ^ a b "University of Tehran NLP Lab". Retrieved 12 January 2014.
  11. ^ "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". Retrieved 12 January 2014.
  12. ^ "EUR-Lex Corpus". Retrieved 27 October 2016.
  13. ^ "OPUS - an open source parallel corpus". Retrieved 12 January 2014.
  14. ^ "Tatoeba - Number of sentences per language". Retrieved 13 January 2014.
  15. ^ Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174.
  16. ^ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  17. ^ Ralf Steinberger Ralf; Bruno Pouliquen; Anna Widiger; Camelia Ignat; Tomaž Erjavec; Dan Tufiş; Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.
  18. ^ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  19. ^ Kilgarriff, A. (2012, September). Getting to know your corpus. In International Conference on Text, Speech and Dialogue (pp. 3-15). Springer Berlin Heidelberg.
  20. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  21. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
  22. ^ Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  23. ^ Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  24. ^ Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)