List of text corpora
Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and unstructured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
- Google Books Ngram Corpus
- American National Corpus
- Bank of English
- British National Corpus
- Corpus Juris Secundum
- Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online.
- Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
- International Corpus of English
- Oxford English Corpus
- Scottish Corpus of Texts & Speech
- Corpus Resource Database (CoRD), more than 80 English language corpora.
- Bulgarian National Corpus
- Croatian Language Corpus
- Croatian National Corpus
- Czech National Corpus
- Google Books Ngram Corpus
- Russian National Corpus
- General Internet Corpus of Russian
- Slovenian National Corpus
- Thesaurus Linguae Graecae (Ancient Greek)
- Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
- National Corpus of Polish
- German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
- Free corpus of German mistakes from people with dyslexia
- Spanish text corpus by Molino de Ideas, which contains 660 million words.
- CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania 
- Reference Corpus of Contemporary Portuguese (CRPC)
- Turkish National Corpus
Middle Eastern Languages
- Hamshahri Corpus (Persian a.k.a. Farsi)
- Persian in MULTEXT-EAST corpus (Persian a.k.a. Farsi)
- Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
- TEP: Tehran English-Persian Parallel Corpus
- TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling
- Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 322 pp. ISBN 964-8699-32-1
- Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan Department of English language and linguistics
- Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
- Neo-Assyrian Text Corpus Project
- Quranic Arabic Corpus (Classical Arabic)
East Asian Languages
Parallel corpora of diverse languages
- Europarl Corpus - proceedings of the European Parliament from 1996–2011
- EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database
- OPUS: Open source Parallel Corpus in many many languages 
- Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.
- NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)  (legacy repo)
- SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.
- GRALIS parallel texts for various slavic languages, compiled by the institute for slavic languages at Graz University (Branko Tošović et al.)
- WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus (eng, fre, deu, ita)
- Disambiguating Similar Language Corpora Collection (DSLCC) (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
- Wikipedia Comparable Corpora (41 million aligned Wikipedia articles for 253 language pairs)
- The TenTen Corpus Family – comparable web corpora the target size 10 billion words. These corpora are available in the corpus management system Sketch Engine, currently, there exist TenTen corpora for than 30 (such as English TenTen corpus, Arabic TenTen corpus,, Spanish TenTen corpus, Russian Tenten corpus, and others). The overview of existing TenTen corpora can be found at https://www.sketchengine.co.uk/documentation/tenten-corpora/
- Timestamped JSI web corpora – web corpora of news articles crawled from a list of RSS feeds. Newsfeed corpora are being prepared in the framework of the project implemented by the Jožef Stefan Institute at Slovenian scientific research institute. and published in Sketch Engine. More information about the project is on the project websites.
- Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
- "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
- "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
- "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
- (in Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
- "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
- "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
- "Available from CLARIN".
- "University of Tehran NLP Lab". ece.ut.ac.ir. Retrieved 12 January 2014.
- "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
- "EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.
- "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
- "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 13 January 2014.
- Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174.
- Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
- Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of The 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
- Kilgarriff, A. (2012, September). Getting to know your corpus. In International Conference on Text, Speech and Dialogue (pp. 3-15). Springer Berlin Heidelberg.
- Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
- Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia-Social and Behavioral Sciences, 95, 12-19.
- Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
- Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
- Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)