Speech translation

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

How it works[edit]

A speech translation system would typically integrate the following three software technologies: automatic speech recognition (ASR), machine translation (MT) and voice synthesis (TTS).

The speaker of language A speaks into a microphone and the speech recognition module recognizes the utterance. It compares the input with a phonological model, consisting of a large corpus of speech data from multiple speakers. The input is then converted into a string of words, using dictionary and grammar of language A, based on a massive corpus of text in language A.

The machine translation module then translates this string. Early systems replaced every word with a corresponding word in language B. Current systems do not use word-for-word translation, but rather take into account the entire context of the input to generate the appropriate translation. The generated translation utterance is sent to the speech synthesis module, which estimates the pronunciation and intonation matching the string of words based on a corpus of speech data in language B. Waveforms matching the text are selected from this database and the speech synthesis connects and outputs them.[1]


In 1983, NEC Corporation demonstrated speech translation as a concept exhibit at the ITU Telecom World (Telecom '83).[2]

The first individual generally credited with developing and deploying a commercialized speech translation system capable of translating continuous free speech is Robert Palmquist, with his release of an English-Spanish large vocabulary system in 1997. This effort was funded in part by the Office of Naval Research[3][4][5][6]

In 1999, the C-Star-2 consortium demonstrated speech-to-speech translation of 5 languages including English, Japanese, Italian, Korean, and German.[7][8]


Apart from the problems involved in the text translation, it also has to deal with special problems occur in speech-to-speech translation, incorporating incoherence of spoken language, fewer grammar constraints of spoken language, unclear word boundary of spoken language, the correction of speech recognition errors and multiple optional inputs. Additionally, speech-to-speech translation also has its advantages compared with text translation, including less complex structure of spoken language and less vocabulary in spoken language.[citation needed]

Research and development[edit]

Research and development has gradually progressed from relatively simple to more advanced translation. International evaluation workshops were established to support the development of speech-translation technology. They allow research institutes to cooperate and compete against each other at the same time. The concept of those workshop is a kind of contest: a common dataset is provided by the organizers and the participating research institutes create systems that are evaluated. In this way, efficient research is being promoted.

The International Workshop on Spoken Language Translation (IWSLT), organized by C-STAR, an international consortium for research on speech translation, has been held since 2004. "Every year, the number of participating institutes increases, and it has become a key event for speech translation research."[1]


When many countries begin to research and develop speech translation, it will be necessary to standardize interfaces and data formats to ensure that the systems are mutually compatible. International joint research is being fostered by speech translation consortiums (e.g. the C-STAR international consortium for joint research of speech translation and A-STAR for the Asia-Pacific region). They were founded as "international joint-research organization[s] to design formats of bilingual corpora that are essential to advance the research and development of this technology ... and to standardize interfaces and data formats to connect speech translation module internationally".[1]


Today, speech translation systems are being used throughout the world. Examples include medical facilities, schools, police, hotels, retail stores, and factories. These systems are applicable anywhere that spoken language is being used to communicate. A popular application is Jibbigo that works offline.

Challenges and future prospects[edit]

Currently, speech translation technology is available as product that instantly translates free form multi-lingual conversations. These systems instantly translate continuous speech. Challenges in accomplishing this include overcoming speaker-dependent variations in style of speaking or pronunciation are issues that have to be dealt with in order to provide high quality translation for all users. Moreover, speech recognition systems must be able to remedy external factors such as acoustic noise or speech by other speakers in real-world use of speech translation systems.

For the reason that the user does not understand the target language when speech translation is used, a method "must be provided for the user to check whether the translation is correct, by such means as translating it again back into the user's language".[1] In order to achieve the goal of erasing the language barrier worldwide, multiple languages have to be supported. This requires speech corpora, bilingual corpora and text corpora for each of the estimated 6,000 languages said to exist on our planet today.

As the collection of corpora is extremely expensive, collecting data from the Web would be an alternative to conventional methods. "Secondary use of news or other media published in multiple languages would be an effective way to improve performance of speech translation." However, "current copyright law does not take secondary uses such as these types of corpora into account" and thus "it will be necessary to revise it so that it is more flexible."[1]

See also[edit]


  1. ^ a b c d e "Overcoming the Language Barrier with Speech Translation Technology" by Satoshi, Nakamura in Science & Technology Trends - Quarterly Review No.31 April 2009
  2. ^ NEC/021219-1. "NEC Global - Press Release". www.nec.co.jp. Retrieved 2017-09-23.
  3. ^ In my view as a trained linguist, this is a quantum leap in technology. This product is the crown jewel of machine translation." Memorandum for the Record submitted by LTC Carrol, USMC
  4. ^ https://www.wired.com/gadgets/miscellaneous/news/2003/03/58150 Wired Magazine.
  5. ^ http://news.minnesota.publicradio.org/programs/allthings/listings/atc20030407.shtml National Public Radio
  6. ^ http://search.aol.com/aol/search?q=Supercomputing+Online+SpeechGear&s_it=spelling&v_t=tb50-ff-babylon-chromesbox-en-us Super Computing Online
  7. ^ https://www.npr.org/templates/story/story.php?storyId=1054389 National Public Radio
  8. ^ "A Japanese-to-English Speech Translation System: ATR-MATRIX" by Takezawa, Morimoto, Sagisaka, Campbell, Iida, Sugaya, Yokoo, Yamamoto in Proceedings of the International Conference on Spoken Language Processing 1998