List of text corpora

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

English language[edit]

European languages[edit]

Slavic[edit]

East Slavic[edit]

South Slavic[edit]

West Slavic[edit]

German[edit]

Middle Eastern Languages[edit]

  • Hamshahri Corpus (Persian)
  • Persian in MULTEXT-EAST corpus (Persian)[9]
  • Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus[10]
  • TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling[10]
  • Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 2005, 322 pp. ISBN 964-8699-32-1
  • Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan, Department of English Language and Linguistics
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
  • Neo-Assyrian Text Corpus Project
  • Quranic Arabic Corpus (Classical Arabic)
  • Electronic Text Corpus of Sumerian Literature
  • Open Richly Annotated Cuneiform Corpus
  • Asosoft text corpus[11]

East Asian Languages[edit]

South Asian Languages[edit]

Parallel corpora of diverse languages[edit]

  • Europarl Corpus - proceedings of the European Parliament from 1996–2011
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[14]
  • OPUS: Open source Parallel Corpus in many many languages[15]
  • Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.[16]
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)[17] (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[18]
  • GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)


Comparable Corpora[edit]

See also[edit]

References[edit]

  1. ^ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  2. ^ "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  3. ^ "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
  4. ^ (in Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
  5. ^ "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
  6. ^ "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
  7. ^ "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
  8. ^ "Portál | Český národní korpus".
  9. ^ Zdravkova, Katrina; Tufiş, Dan; Simov, Kiril; Radziszewski, Adam; Qasemizadeh, Behrang; Priest-Dorman, Greg; Petkevič, Vladimír; Oravecz, Csaba; Krstev, Cvetana; Kotsyba, Natalia; Kaalep, Heiki-Jaan; Ide, Nancy; Garabík, Radovan; Dimitrova, Ludmila; Derzhanski, Ivan; Barbu, Ana-Maria; Erjavec, Tomaž (2010-05-14). "Available from CLARIN". http://nl.ijs.si/me/v4/. External link in |journal= (help)
  10. ^ a b "University of Tehran NLP Lab". ece.ut.ac.ir. Retrieved 12 January 2014.
  11. ^ Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
  12. ^ "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
  13. ^ D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
  14. ^ "EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.
  15. ^ "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
  16. ^ "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 13 January 2014.
  17. ^ Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174.
  18. ^ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  19. ^ Ralf Steinberger Ralf; Bruno Pouliquen; Anna Widiger; Camelia Ignat; Tomaž Erjavec; Dan Tufiş; Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.
  20. ^ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  21. ^ Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. 7499. pp. 3–15. CiteSeerX 10.1.1.452.8074. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6.
  22. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  23. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
  24. ^ Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  25. ^ Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  26. ^ Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)