Croatian Language Corpus

The Croatian Language Corpus (CLC) (Template:Lang-hr, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ).

Background

The CLC was initially funded as a sub-project of the research program Riznica (Croatian Language Repository) by the Ministry of Science, Education, and Sports of the Republic of Croatia (MZOŠ) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program The Croatian Language Repository (CLR) that was granted by the MZOŠ (cf. Ćavar and Brozović Rončević, 2012^[1]). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.

Goals

One of the main goals of the CLC project is to create a publicly available Croatian corpus that is annotated on multiple levels, i.e. lemmatized, morphologically segmented and morpho-syntactically annotated, phonemically transcribed and syllabified, and syntactically parsed. While the current version of the corpus provides resources from the Croatian language standard, several corpora from different development phases of Croatian are created as well, including the digitizations of manuscripts and Croatian dictionaries.

Format and Availability

From the outset, the collected and digitized texts in the CLC were annotated using the Text Encoding Initiative (TEI) P5 XML standard. Currently approx. 90 mil. tokens are available in the TEI P5 XML format. The corpus can be accessed online via the Philologic^[2] interface (see The ARTFL Project,^[3] Department of Romance Languages and Literatures, The University of Chicago). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.

Content

The CLC is assembled from selected text of Croatian, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of Croatian, i.e. from the second half of the 19th century on.

The CLC consists of:

fundamental Croatian literature (e.g. novels, short stories, drama, poetry)
non-fiction
scientific publications from various domains and University textbooks
school books
translated literature from outstanding Croatian translators
online journals and newspapers
books from the pre-standardization period of Croatian that are adapted to nowadays standard Croatian

Cooperation

The realization of the CLC was made possible in cooperation with:

Školska knjiga d.d.
Croatian Academy of Sciences and Arts (HAZU)
Stoljeća hrvatske književnosti, Matica hrvatska

References

^ Ćavar and Brozović Rončević, 2012
^ Philologic
^ "The ARTFL Project". Archived from the original on 2009-12-04. Retrieved 2011-05-22.

External links

Croatian Language Corpus (CLC) website and Philologic interface
(in Croatian) Croatian National Corpus, another Croatian corpus by the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb

[1] Ćavar and Brozović Rončević, 2012

[2] Philologic

[3] "The ARTFL Project". Archived from the original on 2009-12-04. Retrieved 2011-05-22.

[1]

[2]

[3]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine

v t e Croatian language
Features	Alphabet Phonology Grammar
Varieties	Shtokavian Chakavian Kajkavian Burgenland Croatian Molise Croatian
Names	Patronymic names List of exonyms Months
History and literature	Literature 1967 Declaration
Promotion and purism	Croatian National Corpus Croatian Language Days Council for Standard Croatian Language Norm Institute of Croatian Language and Linguistics Croatian Encyclopedia Linguistic purism Studies
Related topics	Croatian Sign Language