Parallel text

A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla (Gr. for "sixfold") placed six versions of the Old Testament side by side. Note also the most famous example, the Rosetta Stone.

Large collections of parallel texts are called parallel corpora (see text corpus). Alignments of parallel corpora at sentence level are prerequisite for many areas of linguistic research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task.

Types of parallel corpora

Four main corpora types can be distinguished.

A noisy parallel corpus contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document.

A comparable corpus is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned.

A quasi-comparable corpus includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.

The rarest parallel corpora are corpora that contain translations of the same document into two or more languages, aligned at the sentence level at least.

Noise in corpora

Large corpora used as training sets for machine translation algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events.

However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between bilingual elements represented in both corpora and monolingual elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.^[1]

Bitext

In the field of translation studies a bitext is a merged document composed of both source- and target-language versions of a given text.

Bitexts are generated by a piece of software called an alignment tool, or a bitext tool, which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a bitext database or a bilingual corpus, and can be consulted with a search tool.

Bitexts and translation memories

The concept of the bitext shows certain similarities with that of the translation memory. Generally, the most salient difference between a bitext and a translation memory is that a translation memory is a database in which its segments (matched sentences) are stored in a way that is totally unrelated to their original context; the original sentence order is lost. A bitext retains the original sentence order. However, some implementations of translation memory, such as Translation Memory eXchange (TMX) (a standard XML format for exchanging translation memories between computer-assisted translation (CAT) programs, allow preserving the original order of sentences.

Bitexts are designed to be consulted by a human translator, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance.

In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up.^[2]

Noise

External links

Parallel corpora

The JRC-Acquis Multilingual Parallel Corpus of the total body of European Union (EU) law: Acquis Communautaire with 231 language pairs.^[3]
European Parliament Proceedings Parallel Corpus 1996-2011
The Opus project aims at collecting freely available parallel corpora
Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles
COMPARA - Portuguese/English parallel corpora
TERMSEARCH - English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.
TradooIT - English/French/Spanish - Free Online tools
Nunavut Hansard - English/Inuktitut parallel corpus
ParaSol - A parallel corpus of Slavic and other languages
Glosbe: Multilanguage parallel corpora with online search interface
InterCorp: A multilingual parallel corpus 20+ languages aligned with Czech, online search interface
myCAT - Olanto, concordancer (open source AGPL) with online search on JCR and UNO corpus
TAUS, with online search interface.
linguatools multilingual parallel corpora, online search interface.

Documentation

Alignment tools

References

^ Wołk, K. (2015). "Noisy-Parallel and Comparable Corpora Filtering Methodology for the Extraction of Bi-Lingual Equivalent Data at Sentence Level". Computer Science (16.2): 169–184.
^ Harris, B. Bi-text, a new concept in translation theory, Language Monthly (UK) 54, p. 8-10, March 1988.
^ Ralf Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.{{cite conference}}: CS1 maint: multiple names: authors list (link)

[1] Wołk, K. (2015). "Noisy-Parallel and Comparable Corpora Filtering Methodology for the Extraction of Bi-Lingual Equivalent Data at Sentence Level". Computer Science (16.2): 169–184.

[2] Harris, B. Bi-text, a new concept in translation theory, Language Monthly (UK) 54, p. 8-10, March 1988.

[3] Ralf Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006.{{cite conference}}: CS1 maint: multiple names: authors list (link)

[1]

[2]

[3]