Gnuspeech

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Gnuspeech
Platform Cross-platform
Type Text-to-speech
License GNU General Public License

Gnuspeech is an extensible text-to-speech computer software package that produces artificial speech output based on real-time articulatory speech synthesis by rules. That is, it converts text strings into phonetic descriptions, aided by a pronouncing dictionary, letter-to-sound rules, and rhythm and intonation models; transforms the phonetic descriptions into parameters for a low-level articulatory speech synthesizer; uses these to drive an articulatory model of the human vocal tract producing an output suitable for the normal sound output devices used by various computer operating systems; and does this at the same or faster rate than the speech is spoken for adult speech.

Design[edit]

The synthesizer is a tube resonance, or waveguide, model that models the behavior of the real vocal tract directly, and reasonably accurately, unlike formant synthesizers that indirectly model the speech spectrum.[1] The control problem is solved by using René Carré’s Distinctive Region Model[2] which relates changes in the radii of eight longitudinal divisions of the vocal tract to corresponding changes in the three frequency formants in the speech spectrum that convey much of the information of speech. The regions are, in turn, based on work by the Stockholm Speech Technology Laboratory[3] of the Royal Institute of Technology (KTH) on "formant sensitivity analysis" - that is, how formant frequencies are affected by small changes in the radius of the vocal tract at various places along its length.[4]

History[edit]

Gnuspeech was originally commercial software produced by the now-defunct Trillium Sound Research for the NeXT computer as various grades of "TextToSpeech" kit. Trillium Sound Research was a technology transfer spin-off company formed at the University of Calgary, Alberta, Canada, based on long-standing research in the computer science department on computer-human interaction using speech, where papers and manuals relevant to the system are maintained.[5] The initial version in 1992 used a formant-based speech synthesiser. When NeXT ceased manufacturing hardware, the synthesizer software was completely re-written[6] and also ported to NSFIP (NextStep For Intel Processors) using the waveguide approach to acoustic tube modeling based on the research at the Center for Computer Research in Music and Acoustics (CCRMA) at Stanford University, especially the Music Kit. The synthesis approach is explained in more detail in a paper presented to the American Voice I/O Society in 1995.[7] The system used the onboard 56001 Digital Signal Processor (DSP) on the NeXT computer and a Turtle Beach add-on board with the same DSP on the NSFIP version to run the waveguide (also known as the tube model). Speed limitations meant that the shortest vocal tract length that could be used for speech in real time (that is, generated at the same or faster rate than it was "spoken") was around 15 centimeters, because the sample rate for the waveguide computations increases with decreasing vocal tract length. Faster processor speeds are progressively removing this restriction, an important advance for producing children's speech in real time.

Trillium ceased trading in the late 1990s and the Gnuspeech project was first entered into the GNU Savannah repository under the terms of the GNU General Public License in 2002, as an official GNU software.

Portability[edit]

Various associated modules used to help in developing the original spoken English databases are being ported and they could be used for other languages. The whole software suite is suitable for psychoacoustic and linguistic research, but is currently only complete for the NeXT. A main module - Monet - is available for Mac OS X. Monet allows the creation and modification of the rules used to form and concatenate the speech sound parameters for different languages, with the exception of the rules used for intonation. However, the rule-based intonation can be manually varied.

References[edit]

  1. ^ COOK, P.R. (1989) Synthesis of the singing voice using a physically parameterized model of the human vocal tract. International Computer Music Conference, Columbus Ohio
  2. ^ CARRE, R. (1992) Distinctive regions in acoustic tubes. Speech production modelling. Journal d'Acoustique, 5 141 to 159
  3. ^ Now Department for Speech, Music and Hearing
  4. ^ FANT, G. & PAULI, S. (1974) Spatial characteristics of vocal tract resonance models. Proceedings of the Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden
  5. ^ Relevant U of Calgary website
  6. ^ The Tube Resonance Model Speech Synthesizer
  7. ^ HILL, D.R., MANZARA, L. & TAUBE-SCHOCK, C-R. (1995) Real-time articulatory speech-synthesis-by-rules. Proc. AVIOS '95 14th Annual International Voice Technologies Conf, San Jose, 12-14 September 1995, 27-44

External links[edit]