Chinese speech synthesis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Chinese). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what is the correct pronunciation of certain phonemes.

Approaches taken[edit]


Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language to produce a mark-up language called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information.[1] Their synthesiser takes a "corpus-based" approach, which means it can sound very natural in most cases but can err in dealing with unusual phrases if they can't be matched with the corpus. The amount of data involved is not disclosed by iFlyTek but can be seen from the commercial products that iFlyTek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly compressed data for a single Chinese voice. iFlyTek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

The iFlyTek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work.

A corpus-based approach is also taken by Tsinghua University's SinoSonic, with the Harbin voice data taking 800 Megabytes. As of 2007 (and 2011), the download link for SinoSonic has not yet been activated. (Vapourware?)

Concatenation (KeyTip)[edit]

A less complex approach is taken by's KeyTip Putonghua Reader, which contains 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase). These recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; the synthesizer is also inflexible in terms of speed and expression. However, because this synthesizer does not rely on a corpus, there is no noticeable degradation in performance when it is given more unusual or awkward phrases.


The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has started experimenting with Chinese synthesis. It was used by Google Translate from May 2010[2] until December 2010.[3]


Ekho is another open source TTS, which simply concatenates sampled syllables. It currently supports Cantonese, Mandarin, and Korean. Some of the Mandarin syllables have been pitched-normalised in Praat. A modified version of these is used in Gradint's "synthesis from partials".

Online Demos and Bell Labs[edit]

There is an online interactive demonstration for NeoSpeech speech synthesis,[4] which accepts Chinese characters and also pinyin if it's enclosed in their proprietary "VTML" markup.[5]

iFlyTek has two demos available online.[6][7]

Bell Labs have an online Mandarin text-to-speech demo[8] dated 1997, but it is now non-functional (the server that the query is to be submitted to does not exist in the DNS) and the contact email is no longer valid. However, their approach was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31, 1997, ISBN 978-0-7923-8027-6), and the former employee who was responsible for the project, Chilin Shih (who now works at the University of Illinois), has some notes about her methods on her website.[9]

Mac OS[edit]

Mac OS had Chinese speech synthesizers available up to version 9. This was removed in 10.0 and reinstated in 10.7 (Lion).[10]

See also[edit]


  1. ^
  2. ^
  3. ^
  4. ^
  5. ^ for example <vtml_phoneme alphabet="x-pinyin" ph="ni3hao3"></vtml_phoneme>; see pages 7 and 25-27 of
  6. ^ Anhui USTC iFlyTek Co., Ltd Demo
  7. ^ Anhui USTC iFlyTek Co., Ltd Beta 1.0
  8. ^ Mandarin TTS
  9. ^ Home Page: Chilin Shih
  10. ^ Voice packs are automatically downloaded as needed when selected in System Preferences, Speech Settings, Text to Speech, System Voice, Customize. Three Chinese female voices are available in the system. One each for Mainland China, Hong Kong and Taiwan.

External links[edit]