Talk:Speech synthesis

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Former featured article Speech synthesis is a former featured article. Please see the links under Article milestones below for its original nomination page (for older articles, check the nomination archive) and why it was removed.
Main Page trophy This article appeared on Wikipedia's Main Page as Today's featured article on June 3, 2004.
Article milestones
Date Process Result
November 16, 2005 Featured article candidate Promoted
November 7, 2006 Featured article review Demoted
Current status: Former featured article

Microsoft Sam glitch[edit]

Does anyone know why when you have MS Sam read "soi" or "soy" it makes a really odd airy sound? There are other errors but I can't remember —Preceding unsigned comment added by (talk) 11:30, 13 September 2008 (UTC)

Older comments[edit]

I think perhaps the different synthesis techniques are long enough to warrant their own pages Nohat 05:20 19 Jun 2003 (UTC)

Could somebody extend on 'Formant synthesis', both on the technical side and terminology, is any system using filting technics on basic waves + noise considered 'formant' synthesis? is it a specific technique, or just the general term of synthesising phenomenons?

Trillium Sound Research Inc (now defunct) offered unlimited vocabulary articulatory speech synthesis on the NeXT computer in 1994, so it is not accurate to say that articulatory speech synthesis is only of academic interest and not far enough advanced for commercial application. It was NeXT Computer that failed, not the synthesis, which was rated the best synthesis available at the time. That software is now the basis of the GnuSpeech project -- a port of the original NeXT software to Linux. It is under a GPL. The basis is an acoustic tube model, so it is low level articulatory synthesis with the necessary databases for varying the tube cross-sections, using the Fant/Carre research on formant sensitivity analysis and control regions. Provision is made for adding the higher level parameters such as tongue height, jaw opening, etc, but this extension is still undeveloped, and would rely on deriving relationships between these higher-level parameters and the low-level tube cross-section parameters. Other ports are possible/likely.

use in Weatheradio[edit]

Not sure where to put it, but the National Weather Service in the U.S. uses it on all Weatheradio stations now. The new voice sounds excellent, and i think uses a hybrid of patched voice and true synthesis. The Weather Channel also may use this for their Vocal Local announcements during the local forecast (but not on their Weatherscan channel). –radiojon 02:47, 2004 Jun 4 (UTC)

The NWS "Tom" and "Donna" AKA "Mara" voices are the SpeechWorks Speechify (now merged with Realspeak) American English voices "Tom" and "Mara" (no longer available), which use a purely concatenative system. [1] Nohat 06:57, 14 Apr 2005 (UTC)

External links[edit]

I think this article has too many (16) external links in the section, "Examples of current systems". I don't know much about the different systems we link to, but either A) each system we list is important in the topic of speech synthesis and needs to be mentioned in the article, or B) not all of these systems are important. In case A), we should use internal links: Don't use external links where we'll want Wikipedia links. In case B), we should just pick one or two examples, or link to a page which contains these links (Wikipedia is not a link repository.) — Matt 14:08, 17 Jun 2004 (UTC)

  • I'm not sure there's such a thing as "too many external links". "Wikipedia is not a link repository" applies to articles which consist of nothing but links, which this article is clearly not. I don't see what the point of removing all or some of the links would be. Nohat 18:59, 2004 Jun 23 (UTC)
  • I agree that once an article starts "collecting" external links, its hard to stop ("If website A is listed, why not website B"). Nohat, I don't agree with your assessment that the "Wikipedia is not a link repository" statement only applies to a particular type of article. Link spamming and excessive linking is becoming a major problem WP:WPSPAM. There is a major update and serious discussion underway here WP:EL with more defined do's and don'ts. The trend I believe will be toward fewer external links. One recommendation now appearing in the guidelines is to eliminate all but a few highly relevant links and placing a link to DMOZ that points to a directory of websites that relate to the article's topic. Speech synthesis in particular had many links to commercial websites promoting products and services, but these were removed per guidelines. I went ahead and removed several more today because they were promotional (one selling a book, free for a limited time, required registration, etc.). Any discussion on these edits would be welcome. Calltech 15:06, 5 December 2006 (UTC)

Request for references[edit]

Hi, I am working to encourage implementation of the goals of the Wikipedia:Verifiability policy. Part of that is to make sure articles cite their sources. This is particularly important for featured articles, since they are a prominent part of Wikipedia. The Fact and Reference Check Project has more information. If some of the external links are reliable sources and were used as references, they can be placed in a References section too. See the cite sources link for how to format them. Thank you, and please leave me a message when a few references have been added to the article. - Taxman 19:43, Apr 22, 2005 (UTC)

Early Voices Described as "Robotic" Seems Circular[edit]

Primitive speech synthesis devices sound robotic. A robotic voice is produced by a primitive speech synthesis device. This is circular. The popular idea of what a robot's voice sounds like comes from early attempts at speech synthesis. Film and television makers must have imitated what had been produced by early efforts at synthesis when creating robotic characters. Would be more accurate I think to say that the idea of a robotic voice came from efforts to produce speech synthesis. Saying that early speech synthesizers were robotic gets it backwards.

I'm not sure it's quite so simple as that. Interestingly, there has only ever been one speech synthesis system that spoke in a monotone (and not very popular or often-used one at that)—yet, the most common feature of "robotic" voices when imitated by humans is a monotone. Clearly this notion of what a robot sounds like was not based on listening to actual synthesized speech. It is more likely that the idea of "robotic" voices came from what people imagined a synthetic voice would sound like, rather than what actual synthetic voices sounded like.
Regardless of all this, to the contemporary reader, the idea of the voice sounding "robotic" is probably a fairly safe if perhaps preposterous in the literal sense base point to explain what old speech synthesis systems sounded like. Nohat 06:50, 25 October 2005 (UTC)

Open source software[edit]

Are there any open source speech synthesis projects? It would be great to summarize how the best few are doing or note the lack if there are none. — Hippietrail 17:36, 15 April 2006 (UTC)

Possible copyvio[edit]

A possible copyvio concern has arisen in the Feature Article review. User:Marskell wrote "I believe the Concatenative Synthesis section may be a text dump from here". This is a serious concern that should be addressed inmediately/ Joelito (talk) 19:30, 7 November 2006 (UTC)

External links cleanup[edit]

External links section was getting filled with lots of links to similar websites. WP is not a directory of links WP:NOT:

"Wikipedia articles are not mere collections of external links or internet directories. There is nothing wrong with adding one or more useful content-relevant links to an article; however, excessive lists can dwarf articles and detract from the purpose of Wikipedia"

I went ahead and removed most of the external links and added DMOZ category for speech synthesis (per WP recommendation). If you feel that any of the deleted links contribute substantially more than the others, please feel free to leave a comment here and we all can discuss. Thanks! Calltech 18:43, 20 December 2006 (UTC)

Fair use rationale for Image:MS Sam.ogg[edit]

Nuvola apps important.svg

Image:MS Sam.ogg is being used on this article. I notice the image page specifies that the image is being used under fair use but there is no explanation or rationale as to why its use in this Wikipedia article constitutes fair use. In addition to the boilerplate fair use template, you must also write out on the image description page a specific explanation or rationale for why using this image in each article is consistent with fair use.

Please go to the image description page and edit it to include a fair use rationale. Using one of the templates at Wikipedia:Fair use rationale guideline is an easy way to ensure that your image is in compliance with Wikipedia policy, but remember that you must complete the template. Do not simply insert a blank template on an image page.

If there is other fair use media, consider checking that you have specified the fair use rationale on the other images used on this page. Note that any fair use images lacking such an explanation can be deleted one week after being tagged, as described on criteria for speedy deletion. If you have any questions please ask them at the Media copyright questions page. Thank you.

BetacommandBot (talk) 13:25, 8 March 2008 (UTC)

How on earth could this be copyrighted? It's a voice saying a sentence. You can't copyright arbitrary audio from text-to-speech synthesizer. You can only copyright a specific recording. (talk) 03:42, 20 September 2009 (UTC)

Text to speech based on Festival in Unix[edit] installed on a unix box. type any text (english only) and output as downloadable file wav or mp3. Voice is british accent and kind of croaky, but understandable. More clear in the wave format. —Preceding unsigned comment added by (talk) 21:04, 27 May 2008 (UTC)

Suggest that Heterogeneous Relation Graph (HRG) and Delta should be described here[edit]

These comprise an important phase of most modern TTS systems and should be discussed here. I don't currently have the time to add this section, but if no one else gets around to it, I'll come back and write up a few things when I'm less busy. Twikir (talk) 04:08, 15 April 2009 (UTC)Twikir


There's a program online that might count as a speech synthesizer. It's called "CrapTalker". Should it be added to the links?

Sohzq (talk) 13:47, 26 May 2009 (UTC)

Text-to-speech voices[edit]

Can a new section and article be made comparing the Text-to-speech voices ? Besides the conventional microsoft Sam and microsoft Anna, some other voices might exist ?

Also, does a voice like the [Monster, Alien, or Amplifier Halloween voice changer exist ? These voices were featured in Fun with Dick & Jane; (see here and here) may be somewhat harder to make understand dough, but could still be used in some applications —Preceding unsigned comment added by (talk) 11:55, 16 June 2009 (UTC)

Overview of text processing figure[edit]

Shouldn't the first block of the linguistic analysis component be "Phrasing" rather than "Phasing"? Broloks (talk) 16:55, 3 October 2009 (UTC)

A new alternatve front end?[edit]

The current front end to his technology seems to be soley text processing. But a recently reported study here demonstrated a method of reconstructing words based on the brain waves of patients simply thinking of those words, by monitoring the superior temporal gyrus of their brains. The 2012 study by Pasley et. al., reported in the journal PLoS Biology [2], used fMRI to track blood flow in the brains of 15 patients who were undergoing surgery for epilepsy or tumours, while playing audio of a number of different speakers reciting words and sentences. With the aid of a computer model, when patients were presented with words to think about, the team was able to guess which word the participants had chosen. Potential therapeutic implications have been suggested. Thanks. Martinevans123 (talk) 19:03, 14 February 2012 (UTC)

Robotics project attention needed[edit]

  • Refs - large amounts of text have no refs
  • Content - are all topics covered?
  • MoS compliance
  • Reassess

Chaosdruid (talk) 11:39, 24 March 2012 (UTC)

Needs more examples[edit]

Not to state the obvious, here, but this article needs more examples (i.e., sound files) of what kind of results the different types of speech synthesis can give. Right now the article has only two examples, neither tied to a specific section, and only one of which gives enough information to tell how it was generated. - dcljr (talk) 08:13, 14 January 2013 (UTC)

kurzweil reading machine[edit]

According to a source already cited in the article, Klatt, D. (1987) "Review of Text-to-Speech Conversion for English" Journal of the Acoustical Society of America 82(3):737-93, The Kurzweil Reading Machine is the first commercial text-to-speech synthesis system. That is on p. 770 Klatt has a diagram with a box for "Kurzweil Reading Machine, 1976" and under it it says "first commercial system". Do other sources disagree, or where is the best place to put this in the article? Silas Ropac (talk) 16:38, 8 February 2013 (UTC)

NeoSpeech & Natural Voices[edit]

After a quick short research looking for the most natural speech synthesis voices I found these two. I appreciate both but NeoSpeech seems more natural.

--TudorTulok (talk) 14:33, 11 March 2013 (UTC)


> Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there

Is this applied "computational fluid dynamics" by any other name? Are the taking a computer-tomography headshot and a full dental panoramic shot of a person, turn those into a 3D model of the human vocal tract, which is then used as a virtual wind tunnel in a CFD software suite? If yes, the article could include a link. (talk) 21:48, 21 August 2015 (UTC)

I think that Computational fluid dynamics (CFD) is not suitable for the articulatory synthesis, because CFD requires huge computational power for the simulation of turbulence typically seen on the fricative and plosive. Instead of it, more simplified physical models called distributed element model — such as the waveguide synthesis once used on Daisy Bell demo, or its variant, the Tube Resonance Model (TRM) used on Gnuspeech — seem to be practical. For details, see article "Articulatory synthesis". --Clusternote (talk) 08:43, 27 August 2015 (UTC)

Ride of the Vocaloids.[edit]

> This article says: Speech Synthesis is the artificial production of human speech

Is song synthesis a sub-field of speech synthesis or considered something completely different? If they are considered different, where is the boundary between them? (cue Richard Wagner's sprechgesang or the priestly intonations made during the catholic / orthodox christian holy liturgy). The current article doesn't discuss this dichtomy properly. (talk) 22:16, 21 August 2015 (UTC)

The singing synthesis (ja) is an interdisciplinary field between the sound synthesis (focused on the harmonic structure), and the speech synthesis (focused on the language model and acoustic model). And in my eyes, a clear definition of boundaries between them, seem hard to find due to the several reasons:
  • The researchers and developers of singing synthesis are very few, and they seem not have any conflicts of interest with the neighbor fields, so they and their neighbors might not feel the needs to clarify the boundaries for protecting each interests.
  • For the researchers and developers, probably it is an obvious thing that the singing synthesis is a customized version of speech synthesis which is optimized suitable for the song (for example, the clearness of fundamental frequency for playing melody, the harmonicity of spectral content for playing harmony, and possibly the smoothness and the delimitation between the pronunciations, as seen on Vocaloid.
  • In a view point of signal processing, speech synthesis and sound synthesis are similar in dealing with audio signals. However, in a view point of cognitive science (a science about how human brain recognizes the various media), the recognition processes of the speech and the music take the different routes on a human brain. (see Language processing in the brain, Cognitive neuroscience of music) And a combination of these, the song's recognition process is probably described with coexistence of above two routes, and the additional interference between them. This interference, sometime called synergy, may be a main difficulty for defining the boundaries.
In my opinion, the singing synthesis should be described on a new dedicated article, to avoid the inappropriate narrowing of its potential caused by the thoughtless definition of boundaries. --Clusternote (talk) 02:45, 27 August 2015 (UTC)
P.S.  I found a song by the Voder (an early speech synthesizer) circa 1939, as listened on a video. Its melody seems played with a pitch controller on the Voder. The inventors of vocoder & voder considered the musical application as the future plan, and Werner Meyer-Eppler in Germany wrote a paper in 1948 on his own perspectives. The separation of elemental technologies (speech synthesis, vocoder as musical application, and singing synthesis) may have occurred after that. --Clusternote (talk) 04:52, 29 August 2015 (UTC)


Everybody, User:Wtshymanski, User:1989, please stop removing the "1,234,567,890 times (unintelligible noise)". These are literally heard in the clip. Largoplazo (talk) 16:30, 13 February 2017 (UTC)