From Wikipedia, the free encyclopedia
  (Redirected from ESpeakNG)
Jump to navigation Jump to search

Original author(s)Jonathan Duddington
Developer(s)Reece Dunn
Initial releaseFebruary 2006; 15 years ago (2006-02)
Stable release
1.50 / 2 December 2019; 21 months ago (2019-12-02)
Written inC
Operating systemLinux
TypeSpeech synthesizer

eSpeakNG is a compact, open-source, software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeakNG's language support is done using rule files with feedback from native speakers.

Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA[1] open source screen reader for Windows, as well as Android,[2] Ubuntu[3] and other Linux distributions. Its predecessor eSpeak was recommended by Microsoft in 2016[4] and was used by Google Translate for 27 languages in 2010;[5] 17 of these were subsequently replaced by commercial voices.[6]

The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial versions of some languages were based on information found on Wikipedia.[7] Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.


In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English.[8] On 17 February 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007.[9] Development on Speak continued until version 1.14, when it was renamed to eSpeak.

Development of eSpeak continued from 1.16 (there was not a 1.15 release)[9] with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion,[10] with separate source and binary downloads made available on Sourceforge.[9] From eSpeak 1.27, eSpeak was updated to use the GPLv3 license.[11] The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for macOS.[12] The last development release of eSpeak was 1.48.15 on 16 April 2015.[13]

eSpeak uses the Usenet scheme to represent phonemes with ASCII characters.[14]

eSpeak NG[edit]

On 25 June 2010,[15] Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms.

On 4 October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.[16][17]

On 8 December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence.[18][19] The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.

On 11 December 2015, the espeak-ng fork was started.[20] The first release of espeak-ng was 1.49.0 on 10 September 2016,[21] containing significant code cleanup, bug fixes, and language updates.


eSpeakNG can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language (SSML).

Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.

eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Usenet system.

Phonetic representations can be included within text input by including them within double square-brackets. For example: espeak-ng -v en "Hello [[w3:ld]]" will say About this soundHello world in English.

Synthesis method[edit]

ESpeakNG intro by eSpeakNG in English

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

1. step — text to phoneme translation[edit]

There are many languages (notably English) which don't have straightforward one-to-one rules between writing and pronunciation; therefore, the first step in text-to-speech generation has to be text-to-phoneme translation.

  1. input text is translated into pronunciation phonemes (e.g. input text xerox is translated into zi@r0ks for pronunciation).
  2. pronunciation phonemes are synthesized into sound e.g., zi@r0ks is voiced as About this soundzi@r0ks in monotone way

To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech: About this soundz'i@r0ks with intonation

For comparison two samples with and without prosody data:

  1. [[DIs Iz m0noUntoUn spi:tS]] is spelled About this soundin monotone way
  2. [[DIs Iz 'Int@n,eItI2d sp'i:tS]] is spelled About this soundintonated way

If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step — sound synthesis from prosody data[edit]

The eSpeakNG provides two different types of formant speech synthesis using its two different approaches. With its own eSpeakNG synthesizer and a Klatt synthesizer:[22]

  1. The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by additive synthesis adding together sine waves to make the total sound. Unvoiced consonants e.g. /s/ are made by playing recorded sounds,[23] because they are rich in harmonics, which makes additive synthesis less effective. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded sample of unvoiced sound.
  2. The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. But, it also produces sounds by subtractive synthesis by starting with generated noise, which is rich in harmonics, and then applying digital filters and enveloping to filter out necessary frequency spectrum and sound envelope for particular consonant (s, t, k) or sonorant (l, m, n) sound.

For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.


eSpeakNG performs text-to-speech synthesis for the following languages:[24][25]

  1. Abaza
  2. Achinese
  3. Afar
  4. Afrikaans[26]
  5. Albanian[27]
  6. Amharic
  7. Ancient Greek
  8. Arabic1
  9. Aragonese[28]
  10. Armenian (Eastern Armenian)
  11. Armenian (Western Armenian)
  12. Assamese
  13. Azerbaijani
  14. Bashkir
  15. Basque
  16. Basic English
  17. Belarusian
  18. Bengali
  19. Bhojpuri
  20. Bishnupriya Manipuri
  21. Bosnian
  22. Bulgarian[28]
  23. Breton
  24. Burmese
  25. Cantonese[28]
  26. Catalan[28]
  27. Cebuano
  28. Cherokee
  29. Chichewa
  30. Chinese (Mandarin)
  31. Corsican
  32. Croatian[28]
  33. Czech
  34. Chuvash
  35. Church Slavonic
  36. Danish[28]
  37. Dutch[28]
  38. Dzongkha
  39. English (American)[28]
  40. English (British)
  41. English (Caribbean)
  42. English (Lancastrian)
  43. English (Received Pronunciation)
  44. English (Scottish)
  45. English (West Midlands)
  46. Esperanto[28]
  47. Estonian[28]
  48. Finnish[28]
  49. Filipino
  50. French (Belgian)[28]
  51. French (France)
  52. French (Swiss)
  53. Frisian
  54. Galician
  55. Georgian[28]
  56. German[28]
  57. Greek (Modern)[28]
  58. Greenlandic
  59. Guarani
  60. Gujarati
  61. Hakka Chinese3
  62. Haitian Creole
  63. Hausa
  64. Hawaiian
  65. Hebrew
  66. High Valyrian
  67. Hindi[28]
  68. Hmong
  69. Hungarian[28]
  70. Icelandic[28]
  71. Igbo
  72. Indonesian[28]
  73. Ido
  74. Interlingua
  75. Interlingue
  76. Irish[28]
  77. Italian[28]
  78. Japanese4[29]
  79. Kannada[28]
  80. Kazakh
  81. Khmer
  82. Klingon
  83. Kʼicheʼ
  84. Kirundi
  85. Kinyarwanda
  86. Konkani[30]
  87. Korean
  88. Kurdish[28]
  89. Kyrgyz
  90. Quechua
  91. Ladakhi
  92. Lao
  93. Latin
  94. Ladino
  95. Latgalian
  96. Latvian[28]
  97. Lang Belta
  98. Lingua Franca Nova
  99. Lepcha
  100. Limbu
  101. Lithuanian
  102. Lojban[28]
  103. Luxembourgish
  104. Macedonian
  105. Maithili
  106. Malagasy
  107. Malay[28]
  108. Malayalam[28]
  109. Maltese
  110. Māori
  111. Marathi,[28]
  112. Mongolian
  113. Nahuatl (Classical)
  114. Navajo
  115. Nepali[28]
  116. Norwegian (Bokmål)[28]
  117. Northern Sotho
  118. Nogai
  119. Odia
  120. Oromo
  121. Occtian
  122. Papiamento
  123. Palauan
  124. Pashto
  125. Persian[28]
  126. Persian (Latin alphabet)2
  127. Polish[28]
  128. Portuguese (Brazilian)[28]
  129. Portuguese (Portugal)
  130. Punjabi[31]
  131. Pyash (a constructed language)
  132. Romanian[28]
  133. Russian[28]
  134. Russian (Latvia)
  135. Samoan
  136. Sanskrit
  137. Scottish Gaelic
  138. Serbian[28]
  139. Shan (Tai Yai),
  140. Sharda
  141. Sesotho
  142. Shona
  143. Sindhi
  144. Sinhala
  145. Slovak[28]
  146. Slovenian
  147. Somali
  148. Spanish (Spain)[28]
  149. Spanish (Latin American)
  150. Swahili[26]
  151. Swedish[28]
  152. Tajik
  153. Tamil[28]
  154. Tatar
  155. Telugu
  156. Tibetan
  157. Tswana
  158. Thai
  159. Turkmen
  160. Turkish[28]
  161. Tatar
  162. Uyghur
  163. Ukrainian
  164. Urdu
  165. Uzbek
  166. Vietnamese (Central Vietnamese)[28]
  167. Vietnamese (Northern Vietnamese)
  168. Vietnamese (Southern Vietnamese)
  169. Volapük
  170. Welsh
  171. Wolof
  172. Xhosa
  173. Yiddish
  174. Yoruba
  175. Zulu
  1. Currently, only fully diacritized Arabic is supported.
  2. Persian written using English (Latin) characters.
  3. Currently, only Pha̍k-fa-sṳ is supported.
  4. Currently, only Hiragana and Katakana are supported.

See also[edit]


  1. ^ Switch to eSpeak NG in NVDA distribution #5651
  2. ^ eSpeak TTS for Android
  3. ^ espeak-ng package in Ubuntu
  4. ^
  5. ^ Google blog, Giving a voice to more languages on Google Translate, May 2010
  6. ^ Google blog, Listen to us now, December 2010.
  7. ^ eSpeak Speech Synthesizer 3. LANGUAGES
  8. ^
  9. ^ a b c
  10. ^ Subversion history (revision 1)
  11. ^ Subversion history (revision 56)
  12. ^
  13. ^
  14. ^ van Leussen, Jan-Wilem; Tromp, Maarten (26 July 2007). "Latin to Speech": 6. CiteSeerX Cite journal requires |journal= (help)
  15. ^
  16. ^
  17. ^
  18. ^ Taking ownership of the eSpeak project and its future
  19. ^ Vote for new main eSpeak developer
  20. ^ Rebrand the espeak program to espeak-ng.
  21. ^ espeak-ng 1.49.0
  22. ^ Dennis H. Klatt (1979). "Software for a cascade/parallel formant synthesizer" (PDF). J. Acoustical Society of America, 67(3) March 1980.
  23. ^ List of recorded fricatives in eSpeakNG
  24. ^
  25. ^
  26. ^ a b Butgereit, L., & Botha, A. (2009, May). Hadeda: The noisy way to practice spelling vocabulary using a cell phone. In The IST-Africa 2009 Conference, Kampala, Uganda.
  27. ^ Hamiti, M., & Kastrati, R. (2014). Adapting eSpeak for converting text into speech in Albanian. International Journal of Computer Science Issues (IJCSI), 11(4), 21.
  28. ^ a b c d e f g h i j k l m n o p q r s t u v w x y z aa ab ac ad ae af ag ah ai aj ak al am an ao ap Kayte, S., & Gawali, D. B. (2015). Marathi Speech Synthesis: A review. International Journal on Recent and Innovation Trends in Computing and Communication, 3(6), 3708-3711.
  29. ^ Pronk, R. (2013). Adding Japanese language synthesis support to the eSpeak system. University of Amsterdam.
  30. ^ Mohanan, S., Salkar, S., Naik, G., Dessai, N. F., & Naik, S. (2012). Text Reader for Konkani Language. Automation and Autonomous System, 4(8), 409-414.
  31. ^ Kaur, R., & Sharma, D. (2016). An Improved System for Converting Text into Speech for Punjabi Language using eSpeak. International Research Journal of Engineering and Technology, 3(4), 500-504.

External links[edit]