Chinese speech synthesis
Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Chinese). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what is the correct pronunciation of certain phonemes.
Anhui USTC iFlyTek Co., Ltd (iFlyTek) published a W3C paper in which they adapted Speech Synthesis Markup Language to produce a mark-up language called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information. Their synthesiser takes a "corpus-based" approach, which means it can sound very natural in most cases but can err in dealing with unusual phrases if they can't be matched with the corpus. The amount of data involved is not disclosed by iFlyTek but can be seen from the commercial products that iFlyTek have licensed their technology to; for example, Bider's SpeechPlus is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly-compressed data for a single Chinese voice. iFlyTek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".
The iFlyTek corpus appears to be heavily dependent on Chinese characters, and it is not possible to synthesize from pinyin alone. It is sometimes possible by means of CSSML to add pinyin to the characters to disambiguate between multiple possible pronunciations, but this does not always work.
A corpus-based approach is also taken by Tsinghua University's SinoSonic, with the Harbin voice data taking 800 Megabytes. As of 2007 (and 2011), the download link for SinoSonic has not yet been activated. (Vapourware?)
A less complex approach is taken by cjkware.com's KeyTip Putonghua Reader, which contains 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprising 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase). These recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; the synthesizer is also inflexible in terms of speed and expression. However, because this synthesizer does not rely on a corpus, there is no noticeable degradation in performance when it is given more unusual or awkward phrases.
The lightweight open-source speech project eSpeak, which has its own approach to synthesis, has started experimenting with Chinese synthesis. It was used by Google Translate from May 2010 until December 2010.
Ekho is another open source TTS, which simply concatenates sampled syllables. It currently supports Cantonese, Mandarin, and Korean. Some of the Mandarin syllables have been pitched-normalised in Praat. A modified version of these is used in Gradint's "synthesis from partials".
Online Demos and Bell Labs
Bell Labs have an online Mandarin text-to-speech demo dated 1997, but it is now non-functional (the server that the query is to be submitted to does not exist in the DNS) and the contact email is no longer valid. However, their approach was described in a monograph "Multilingual Text-to-Speech Synthesis: The Bell Labs Approach" (Springer, October 31, 1997, ISBN 978-0-7923-8027-6), and the former employee who was responsible for the project, Chilin Shih (who now works at the University of Illinois), has some notes about her methods on her website.
- for example <vtml_phoneme alphabet="x-pinyin" ph="ni3hao3"></vtml_phoneme>; see pages 7 and 25-27 of https://ondemand.neospeech.com/vt_eng-Engine-VTML-v3.9.0-3.pdf
- Anhui USTC iFlyTek Co., Ltd Demo
- Anhui USTC iFlyTek Co., Ltd Beta 1.0
- Mandarin TTS
- Home Page: Chilin Shih
- Voice packs are automatically downloaded as needed when selected in System Preferences, Speech Settings, Text to Speech, System Voice, Customize. Three Chinese female voices are available in the system. One each for Mainland China, Hong Kong and Taiwan.
- Anhui USTC iFlyTek Co., Ltd homepage