Tatoeba

Tatoeba
Type of site	Open educational resources (OER)
Available in	56 languages of the interface; content in 420 languages (November 2022)
Owner	Association Tatoeba
Created by	Trang Ho, Allan Simon
URL	tatoeba.org
Commercial	No
Registration	Optional
Launched	2006
Current status	Online; beta
Content license	Creative Commons Attribution 2.0 (some sentences under Creative Commons Zero, audio varies)

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. Its name comes from the Japanese phrase "tatoeba" (例えば), meaning "for example". It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans. It is hosted by Association Tatoeba, a French non-profit organization funded through donations.

As of November 2022, the Tatoeba Corpus has over 10,800,000 sentences in 420 languages. 55 of these languages have 10,000 or more sentences. About 1 million sentences have audio recordings.^[1]

The sentences are interrelated within a graph, facilitating translations in different languages. As of November 2022, the Tatoeba Graph lists over 21,800,000 links between sentences. 237 language pairs have over 10,000 translated sentences.^[2]

History

In 2006, Trang Ho was frustrated that unlike some of their Japanese counterparts, German bilingual dictionaries didn't feature full-text search of usage examples with translations.^[3] It led her to imagine her ideal dictionary^[4] and to build a prototype hosted on SourceForge under the name "multilangdict."^[5] The main focus was already the crowdsourcing of translated sentences: "A Wikipedia type of thing, except people add sentences, not articles."

Alongside her studies at the University of Technology of Compiègne, Trang Ho gradually improved her website with a few classmates. She rebuilt the project from scratch twice and rebranded it as Tatoeba. In September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by Hyogo University professor Yasuhito Tanaka and maintained by Jim Breen and Paul Blay — were imported into the Tatoeba Corpus.^[6] In December 2008, Trang Ho released the first version of the current codebase built around a more flexible data model.^[7] The following month, the website moved to the tatoeba.org domain.^[8]

Over the 2009-2010 academic year, Allan Simon — then a student at SUPINFO — became a core developer of Tatoeba. Together with Trang Ho and other young developers, they made Tatoeba more social: sentence lists, user profiles, private messaging, and Facebook-inspired Wall. They also introduced significant features like sentence linking, tagging, and "translation of translation" search. In November 2010, Tatoeba passed the 600,000 sentences mark. Within a year, the number of sentences added daily had increased almost 50-fold.^[9]

Between 2014 and 2016, a new team of developers formed around Trang Ho.^[10] They mentored students at the Google Summer of Code 2014^[11] and added features to improve corpus quality.

Over the 2018-2020 period, support from the Mozilla Foundation as part of the Common Voice project allowed Tatoeba to make its platform more open and user-friendly.^[12]^[13]

Openness

Reading

Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba Corpus is displayed next to its likely translations in other languages; translations and "translations of translations" are differentiated. Sentences are tagged for content such as subject matter, dialect, or vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Sentences can be browsed by language, tag, and other criteria.

A simplified diagram of Tatoeba's underlying data structure.

Editing

Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, users are encouraged to add original sentences or translations in their native or strongest language.^[14]

Users can freely edit their sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.

Operation

Tatoeba received a grant from Mozilla Drumbeat in December 2010.^[15]^[16]

Some work on the Tatoeba infrastructure was sponsored by Google Summer of Code, 2014 edition.^[11]

In May 2018 they received a $25,000 Mozilla Open Source Support (MOSS) program grant.^[12]

In August 2019 they received a $15,000 Mozilla Open Source Support (MOSS) program grant.^[13]

Access to content

Content licensing

By default, the sentences of the Tatoeba Corpus are published under a Creative Commons Attribution 2.0 license,^[17] freeing it for academic and other use. Users can also contribute sentences under Creative Commons Zero, though translations of those sentences currently can't share the same license.^[18]

Audio recordings of the sentences use the speaker's choice of license, such as CC BY 4.0, BY-SA, BY-NC, or no public license at all.^[19]

Offline use

Visitors can download tab-delimited sentence pairs ready for import into Anki and similar Spaced Repetition Software at the Tatoeba website.^[2]

Related projects

Second-language acquisition

The JMdict Japanese-English dictionary selects its example sentences from the Tatoeba Corpus.^[20] OpenRussian is a free Russian dictionary built primarily from the content of Wiktionary and Tatoeba.^[21] Selected content from Tatoeba in Esperanto is available in the multilingual DVD Esperanto Elektronike published by E@I.^[22]

Regional or minority languages

Some language digital activists contribute to open collaborative projects like Tatoeba, Wikipedia, and Common Voice to promote their minority language in digital spaces.^[23] Regional languages like Kabyle, Catalan, or Basque can register more than a hundred members on Tatoeba.^[24]

Language technology

From 2008 to 2011, Francis Bond used the Tatoeba Corpus for his research on the Japanese language.^[26]^[27]

Since 2013, Jörg Tiedemann has been spreading Tatoeba parallel corpora more widely in the machine translation community by sharing them on the OPUS repository and organizing the "Tatoeba Translation Challenge".^[28]^[29] With the rise of deep learning, researchers increasingly use Tatoeba's data sets to train and evaluate their massively multilingual models in tasks like machine translation,^[30] language identification,^[31] semantic search,^[32] and speech recognition.^[33]

References

^ "Number of sentences per language - Tatoeba". tatoeba.org. Retrieved 1 November 2022.
^ ^a ^b "Download sentences - Tatoeba". tatoeba.org. Retrieved 1 November 2022.
^ Trang. "The story of Tatoeba". Retrieved 8 November 2022.
^ "Trang's ideal dictionary.pdf". Google Docs. Retrieved 8 November 2022.
^ "Trang's dictionary project". sourceforge.net.
^ "Tanaka Corpus". EDRDG Wiki. Electronic Dictionary Research and Development Group. 3 February 2011. Retrieved 20 March 2011.
^ Tatoeba Stream #3 - Going back in time, retrieved 8 November 2022
^ Trang. "New address : tatoeba.org". Retrieved 8 November 2022.
^ Trang. "Some stats". Retrieved 8 November 2022.
^ AlanF. "Update on development". Retrieved 8 November 2022.
^ ^a ^b "Google Summer of Code 2014 Organization Association Tatoeba". www.google-melange.com. Retrieved 26 September 2022.
^ ^a ^b "MOSS award for Tatoeba". Retrieved 26 September 2022.
^ ^a ^b "A second MOSS award". Retrieved 26 September 2022.
^ "Quick Start Guide".
^ Ho, Trang (17 January 2011). "Grant from Mozilla Drumbeat". Tatoeba Project Blog. Retrieved 20 March 2011.
^ Moltke, Henrik (30 December 2010). "Best Drumbeat Projects: Tatoeba – a free and open database of sentences". Yoyodyne.cc. Archived from the original on 2 January 2011. Retrieved 20 March 2011. ...the Mozilla Foundation wants to encourage and help the Tatoeba project by giving it a USD 2.5K Mozilla Drumbeat Grant.
^ "Terms of use". Tatoeba.org. Retrieved 20 March 2011.
^ "How to contribute under CC0". en.wiki.tatoeba.org. Retrieved 25 October 2021.
^ "All public lists containing "audio" (140) - Tatoeba". tatoeba.org. Retrieved 25 October 2021.
^ "WWWJDIC - INFORMATION". www.edrdg.org. Retrieved 13 November 2022.
^ "About OpenRussian". en.openrussian.org. Retrieved 16 November 2022.
^ "Esperanto Elektronike | E@I". 13 October 2017. Retrieved 1 November 2022.
^ "Rising Voices - Meet Prasanta Hembram, a Santali language digital activist from India". Rising Voices. 28 June 2022. Retrieved 15 November 2022.
^ "Languages of members - Tatoeba". tatoeba.org. Retrieved 15 November 2022.
^ "Google Scholar". scholar.google.com. Retrieved 13 November 2022.
^ Francis Bond, 栗林孝行 [Takayuki Kuribayashi], 橋本力 [Hashimoto Chikara] (2008) HPSGに基づくフリーな日本語ツリーバンクの構築 [A free Japanese Treebank based on HPSG]. In 14th Annual Meeting of The Association for Natural Language Processing, Tokyo.
^ Eric Nichols, Francis Bond, Darren Scott Appling and Yuji Matsumoto (2010) Paraphrasing Training Data for Statistical Machine Translation. Journal of Natural Language Processing, 17(3), pages 101–122.
^ "OPUS - an open source parallel corpus". 30 July 2013. Archived from the original on 30 July 2013. Retrieved 13 November 2022.
^ Tiedemann, Jörg (13 October 2020). "The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT". arXiv:2010.06354 [cs.CL].
^ NLLB Team; Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al (25 August 2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv:2207.04672 [cs.CL].
^ "Language identification · fastText". fasttext.cc. Retrieved 16 November 2022.
^ Hu, Junjie; Ruder, Sebastian; Siddhant, Aditya; Neubig, Graham; Firat, Orhan; Johnson, Melvin (4 September 2020). "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization". arXiv:2003.11080 [cs.CL].
^ Wang, Changhan; Pino, Juan; Wu, Anne; Gu, Jiatao (9 June 2020). "CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus". arXiv:2002.01320 [cs.CL].

External links

Media related to Tatoeba at Wikimedia Commons
Official website
Video of Trang Ho introducing Tatoeba at MozFest 2019
Tatoeba's statistics
Tatoeba Translation Challenge

[1] "Number of sentences per language - Tatoeba". tatoeba.org. Retrieved 1 November 2022.

[:0-2] "Download sentences - Tatoeba". tatoeba.org. Retrieved 1 November 2022.

[3] Trang. "The story of Tatoeba". Retrieved 8 November 2022.

[4] "Trang's ideal dictionary.pdf". Google Docs. Retrieved 8 November 2022.

[5] "Trang's dictionary project". sourceforge.net.

[6] "Tanaka Corpus". EDRDG Wiki. Electronic Dictionary Research and Development Group. 3 February 2011. Retrieved 20 March 2011.

[7] Tatoeba Stream #3 - Going back in time, retrieved 8 November 2022

[8] Trang. "New address : tatoeba.org". Retrieved 8 November 2022.

[9] Trang. "Some stats". Retrieved 8 November 2022.

[10] AlanF. "Update on development". Retrieved 8 November 2022.

[:1-11] "Google Summer of Code 2014 Organization Association Tatoeba". www.google-melange.com. Retrieved 26 September 2022.

[:2-12] "MOSS award for Tatoeba". Retrieved 26 September 2022.

[:3-13] "A second MOSS award". Retrieved 26 September 2022.

[14] "Quick Start Guide".

[15] Ho, Trang (17 January 2011). "Grant from Mozilla Drumbeat". Tatoeba Project Blog. Retrieved 20 March 2011.

[16] Moltke, Henrik (30 December 2010). "Best Drumbeat Projects: Tatoeba – a free and open database of sentences". Yoyodyne.cc. Archived from the original on 2 January 2011. Retrieved 20 March 2011. ...the Mozilla Foundation wants to encourage and help the Tatoeba project by giving it a USD 2.5K Mozilla Drumbeat Grant.

[17] "Terms of use". Tatoeba.org. Retrieved 20 March 2011.

[18] "How to contribute under CC0". en.wiki.tatoeba.org. Retrieved 25 October 2021.

[19] "All public lists containing "audio" (140) - Tatoeba". tatoeba.org. Retrieved 25 October 2021.

[20] "WWWJDIC - INFORMATION". www.edrdg.org. Retrieved 13 November 2022.

[21] "About OpenRussian". en.openrussian.org. Retrieved 16 November 2022.

[22] "Esperanto Elektronike | E@I". 13 October 2017. Retrieved 1 November 2022.

[23] "Rising Voices - Meet Prasanta Hembram, a Santali language digital activist from India". Rising Voices. 28 June 2022. Retrieved 15 November 2022.

[24] "Languages of members - Tatoeba". tatoeba.org. Retrieved 15 November 2022.

[25] "Google Scholar". scholar.google.com. Retrieved 13 November 2022.

[26] Francis Bond, 栗林孝行 [Takayuki Kuribayashi], 橋本力 [Hashimoto Chikara] (2008) HPSGに基づくフリーな日本語ツリーバンクの構築 [A free Japanese Treebank based on HPSG]. In 14th Annual Meeting of The Association for Natural Language Processing, Tokyo.

[27] Eric Nichols, Francis Bond, Darren Scott Appling and Yuji Matsumoto (2010) Paraphrasing Training Data for Statistical Machine Translation. Journal of Natural Language Processing, 17(3), pages 101–122.

[28] "OPUS - an open source parallel corpus". 30 July 2013. Archived from the original on 30 July 2013. Retrieved 13 November 2022.

[29] Tiedemann, Jörg (13 October 2020). "The Tatoeba Translation Challenge -- Realistic Data Sets for Low Resource and Multilingual MT". arXiv:2010.06354 [cs.CL].

[30] NLLB Team; Costa-jussà, Marta R.; Cross, James; Çelebi, Onur; Elbayad, Maha; Heafield, Kenneth; Heffernan, Kevin; Kalbassi, Elahe; Lam, Janice; Licht, Daniel; Maillard, Jean; Sun, Anna; Wang, Skyler; Wenzek, Guillaume; Youngblood, Al (25 August 2022). "No Language Left Behind: Scaling Human-Centered Machine Translation". arXiv:2207.04672 [cs.CL].

[31] "Language identification · fastText". fasttext.cc. Retrieved 16 November 2022.

[32] Hu, Junjie; Ruder, Sebastian; Siddhant, Aditya; Neubig, Graham; Firat, Orhan; Johnson, Melvin (4 September 2020). "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization". arXiv:2003.11080 [cs.CL].

[33] Wang, Changhan; Pino, Juan; Wu, Anne; Gu, Jiatao (9 June 2020). "CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus". arXiv:2002.01320 [cs.CL].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine