Tatoeba

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Kevlar67 (talk | contribs) at 17:40, 22 May 2022 (added Category:Translation websites using HotCat). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Tatoeba
Type of site
Open collection of sentences with translations
Available in56 languages of the interface; content in 412 languages (April 2022)
OwnerAssociation Tatoeba
Created byTrang Ho, Allan Simon
URLtatoeba.org
CommercialNo
RegistrationOptional
Launched2006
Current statusOnline; beta
Content license
Creative Commons Attribution 2.0 (some sentences under Creative Commons Zero, audio varies)

Tatoeba is a free content, collection of example sentences with translations geared towards foreign language learners. Its name comes from the Japanese phrase "tatoeba" (例えば), meaning "for example". It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans. It is hosted by Association Tatoeba, a French non-profit organization funded through donations.

Content

As of November 2021, the Tatoeba Corpus has over 10,000,000 sentences in 409 languages. The top 10 languages make up 71% of the corpus. 116 of these languages have 1,000 or more sentences. The top 18 languages have over 100,000 sentences each.[1] As of April 2022, almost 1,000,000 sentences in 38 languages had audio recordings.

Tatoeba is also the current home of the Tanaka Corpus, a public-domain series of about 150,000 English–Japanese sentence pairs compiled by Hyogo University professor Yasuhito Tanaka first released in 2001, and where it is undergoing its latest revisions.[2][3]

History

The project was initiated by Trang Ho in 2006 and originally hosted on SourceForge under the name "multilangdict".[4]

Interface

Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba database is displayed next to its likely translations in other languages; direct and indirect translations are differentiated. Sentences are tagged for content such as subject matter, dialect, or vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Sentences can be browsed by language, tag, and other criteria.

Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, users are encouraged to add original sentences or translations in their native or strongest language.[5]

Translations are linked to the original sentence automatically. Users can freely edit their sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.

Database structure

A simplified diagram of Tatoeba's underlying data structure.

The sentences are interrelated within a graph, facilitating translations in different languages in a many-to-many fashion.[6]

License

The entire Tatoeba database is published under a Creative Commons Attribution 2.0 license,[7] freeing it for academic and other use. Users can also contribute sentences under Creative Commons Zero, though translations of those sentences currently can't share the same license.[8]

Audio recordings of the sentences use the speaker's choice of license, such as CC BY 4.0, BY-SA, BY-NC, or no public license at all.[9]

Grants

Tatoeba received a grant from Mozilla Drumbeat in December 2010.[10][11]

Some work on the Tatoeba infrastructure was sponsored by Google Summer of Code, 2014 edition.[12]

In May 2018 they received a $25,000 Mozilla Open Source Support (MOSS) program grant.[13]

In August 2019 they received a $15,000 Mozilla Open Source Support (MOSS) program grant.[14]

Usage

Parallel text corpora such as Tatoeba are used for a variety of natural language processing tasks such as machine translation. The Tatoeba data has been used as data for treebanking Japanese[15] and statistical machine translation,[16] as well as the WWWJDIC Japanese–English dictionary and the Bilingual Sentence Pairs and Japanese Reading and Translation Practice on www.ManyThings.org.

Offline edition

Selected content from Tatoeba – 83,932 phrases in Esperanto along with all their translations into other languages – has appeared in the third edition of the multilingual DVD Esperanto Elektronike ("Electronic Esperanto") published in 6,000 copies by E@I in July 2011.

Tab-delimited data ready for import into Anki and similar software can be downloaded directly at the Tatoeba Website.

See also

References

  1. ^ "Number of sentences per language - Tatoeba". tatoeba.org. Retrieved 25 October 2021.
  2. ^ "Tanaka Corpus". EDRDG Wiki. Electronic Dictionary Research and Development Group. 3 February 2011. Retrieved 20 March 2011.
  3. ^ Breen, Jim (2 March 2011). "WWWJDIC – Information". WWWJDIC. Monash University. Retrieved 20 March 2011.
  4. ^ "Trang's dictionary project". sourceforge.net.
  5. ^ "Quick Start Guide".
  6. ^ Ho, Trang (23 February 2010). "How to be a good contributor in Tatoeba". Tatoeba Project Blog. Retrieved 20 March 2011.
  7. ^ "Terms of use". Tatoeba.org. Retrieved 20 March 2011.
  8. ^ "How to contribute under CC0". en.wiki.tatoeba.org. Retrieved 25 October 2021.
  9. ^ "All public lists containing "audio" (140) - Tatoeba". tatoeba.org. Retrieved 25 October 2021.
  10. ^ Ho, Trang (17 January 2011). "Grant from Mozilla Drumbeat". Tatoeba Project Blog. Retrieved 20 March 2011.
  11. ^ Moltke, Henrik (30 December 2010). "Best Drumbeat Projects: Tatoeba – a free and open database of sentences". Yoyodyne.cc. Archived from the original on 2 January 2011. Retrieved 20 March 2011. ...the Mozilla Foundation wants to encourage and help the Tatoeba project by giving it a USD 2.5K Mozilla Drumbeat Grant.
  12. ^ "Google Summer of Code 2014 Organization Association Tatoeba".
  13. ^ "MOSS award for Tatoeba".
  14. ^ "A second MOSS award".
  15. ^ Francis Bond, 栗林 孝行 [Takayuki Kuribayashi], 橋本 力 [Hashimoto Chikara] (2008) HPSGに基づくフリーな日本語ツリー バンクの構築 [A free Japanese Treebank based on HPSG]. In 14th Annual Meeting of The Association for Natural Language Processing, Tokyo.
  16. ^ Eric Nichols, Francis Bond, Darren Scott Appling and Yuji Matsumoto (2010) Paraphrasing Training Data for Statistical Machine Translation. Journal of Natural Language Processing, 17(3), pages 101–122.

External links