|This is the talk page for discussing improvements to the Speech recognition article.|
|This article is of interest to the following WikiProjects:|
- 1 Voice Recognition
- 2 Textbook Link ?
- 3 Old comments
- 4 copyvio
- 5 amend
- 6 speech recognition vs voice recognition
- 7 Open source software
- 8 SimonSays
- 9 non-encyclopedic?
- 10 some clean ups
- 11 Missing history
- 12 Redirected from Speech to text
- 13 Broader Implications of Speech Recognition
- 14 Software?
- 15 Remove external link
- 16 Hello.
- 17 Books Section
- 18 Possible external link
- 19 Vandalism here and else where from 18.104.22.168
- 20 "machine-readable text?"
- 21 ADD LINK TO VOICE DATA COLLECTION?
- 22 Shameful anti-Gates bias!
- 23 Future of
- 24 Merge with Speech-to-Text Reporter
- 25 Proposed addition: LumenVox Speech Engine
- 26 Speech-to-Text vs Text-to-Speech
- 27 Laptop?
- 28 Conflation Error, Corrected?
- 29 Linkfarms
- 30 Request for Review of Potential New Article: LumenVox
- 31 Update Links (Euro Fighter)
- 32 Current Research
- 33 An error in the History section
- 34 Quality & refs
- 35 Stress and fatigue voice characteristics
"Voice Recognition" is analysis of the spectral patterns of one's speech to verify if that voice belongs to a registered individual. Voice recognition is used in authentication systems. "Speech Recognition" is analysis of the speech stream to parse semantic content, frequently used for command and control. Why is even wikipedia conflating these two terms? —Preceding unsigned comment added by 22.214.171.124 (talk) 13:50, 22 April 2009 (UTC)
- Probably because Voice Recognition is often mentioned together with Speech recognition in published literature. A second reason would be that VR techniques come from SR. OrenBochman (talk) 14:25, 28 March 2012 (UTC)
Textbook Link ?
http://www.cs.colorado.edu/%7Emartin/SLP/Updates/ —Preceding unsigned comment added by 126.96.36.199 (talk) 10:26, 7 September 2007 (UTC) I have no idea how to use wikipedia, but someone needs to revert this page back to how it was a couple edits back. —Preceding unsigned comment added by 188.8.131.52 (talk) 09:19, 27 November 2009 (UTC)
Content from Speech Recognition, now a redirect here.
My understanding is that the entropics HTK toolkit, while available, is copyright microsoft. I would suggest looking elsewhere for places to start... probably CMU sphinx, as evil and difficult as it is to use.
NotSoAnonymousCoward 17 Nov, 2005
The phonemes identification is produced when the sound of speech arrives at the computer as analogue wave forms. At the end of the process, these phonemes make up words, which recognise the inputs either they are continuous or discretes. To gather every single word is necessary a hard and long work, which goes on in training samples (known as corpora). There are also some problems when trying this. First of all, the recognition of the kind of speech, depending on a deliberated speech or a continuous one; besides, the difficulty to identify any speaker with the trouble individual speech brings. The pollution in the system produced by outer noises is also another problem. And at the end, to overcome the Grammar mistakes caused by differences in accents, dialects and spoken languages.
Charles Matthews 09:20, 6 May 2004 (UTC)
Note on the Technical Issues Section: I recently added (as an anonymous user) the first part of this section (up to SPHINX) and deleted the original content which was not really technical and which was mostly speculative rather than factual. Many parts of the original section are wrong or irrelevant. Someone put back the original part which makes the whole thing look like it was stitched together. Perhaps the original part (after SPHINX) should be called something else, e.g. "challenges in speech recognition". -Dan
- If it's wrong, then correct it. If it's mislabeled, then label it correctly. Don't just delete--that's not we do things here. Nohat 23:08, 18 Apr 2005 (UTC)
- I restored the above content. It's generally considered bad form to delete content from Talk pages unless it's been archived to an archive page. Nohat 23:56, 18 Apr 2005 (UTC)
Other old content from Speech Recognition, now moved here as it is too vague and too out of place. Perhaps someone can boil down some observations in it into a paragraph on the technical difficulty on speech recognition.
Some other key technical problems in speech recognition are:
- Inter-speaker differences and also intra-speaker variations are often large and difficult to account for. It is not clear which characteristics of speech are speaker-independent.
- Speech recognition system are based on simplified stochastic models, that do not match the real speech accurately.
- The interpretation of many phonemes, words and phrases are context sensitive. For example, phonemes are often shorter in long words than in short words. Words have different meanings in different sentences, e.g. "Philip lies" could be interpreted either as Philip being a liar, or that Philip is lying on a bed.
- Co-articulation of phonemes and words, depending on the input language, can make the task of speech recognition considerably more difficult. Some Languages, such as English, have large amounts of co-articulation in conversational speech (consider for example the sentence "what are you going to do?", which can be pronounced as "whatchagonnado?", which has no resemblance to the "correctly" pronounced sentence). Other languages have almost no co-articulation, and are therefore much easier to recognize. Japanese for example is strictly sylable based, and has no co-articulations, which makes it much easier to recognize than English.
- Intonation and speech timbre can completely change the correct interpretation of a word or sentence, e.g. "Go!", "Go?" and "Go." can clearly be recognised by a human, but not so easily by a computer.
- Words and sentences can have several valid interpretations such that the speaker leaves the choice of the correct one to the listener.
- Written language may need punctuation according to strict rules that are not strongly present in speech, and are difficult to infer without knowing the meaning (commas, ending of sentences, quotations).
The "understanding" of the meaning of spoken words is regarded by some as a separate field, that of natural language understanding. However, there are many examples of sentences that sound the same, but can only be disambiguated by an appeal to context: one famous T-shirt worn by Apple Computer researchers stated, I helped Apple wreck a nice beach, which, when spoken, sounds like I helped Apple recognize speech.
A general solution of many of the above problems effectively requires human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer. In particular, statistical language models are often employed for disambiguation and improvement of the recognition accuracies.
For foreign speakers an unintended side-effect of using speech recognition technology is that they can improve their pronunciation while trying to make the computer understand what they're saying.
-- 18 Apr 2004; moved by Dan
Possible copyright violation from http://www.wombatnation.com/2004/04/speech-recognition/ Arvindn 08:09, 7 May 2005 (UTC)
Yes, a large portion of the content does appear to have been copied directly from a post I made to my blog in April, 2004, albeit with a bit of reorganization and editing. All the material on my blog is licensed under a non-commercial, attribution Creative Commons license. I certainly don't mind having content I've authored show up on Wikipedia, and given the nature of Wikipedia articles, I don't expect attribution. However, it would been nice to have been notified about it directly. Thanks very much, Arvind, for bringing it to my attention, as a link to this page showed up in my website referer log today. RobertStewart 00:45, 9 May 2005 (UTC)
It looks like someone copied in the material on March 4, 2005. Nohat, I saw your comment, "this section is not written in a very encyclopedic style--it is too breezy and sounds like it's written from a single POV." That's spot on! I thought I was just providing a useful summary on my blog, not writing material for an encyclopedia article. As I stated above, you're welcome to use whatever you want of what I wrote. The edits to date have certainly improved it, but I agree that it (at least the parts derived from my original post) could use a lot more editing to make the content more suitable as an encyclopedia article. I'll try to do some editing myself to update things that have changed over time. RobertStewart 01:05, 9 May 2005 (UTC)
"Speech recognition systems have found use where the speed of text input is required to be extremely fast." It is hard to believe it could outrank the keyboard with a very proficient typist. Even if the speaker trains to speak extremely fast (which is only possible for short amounts of time due to the huge consumption of neural processing) there still wouldn't be a market and thus no software for it.
speech recognition vs voice recognition
I was under the impression that speech recognition differs from voice recognition (which uses voice or voiceprints to identify someone). If so, I don't think a search for voice recognition should redirect someone here to speech recognition.
Am I too off target?
- You're thinking of voice authentication, which as of this writing is an article which does not exist yet. You are correct, voice recognition should be a disambiguation page pointing to speech recognition and voice authentication. Nohat 23:42, 10 April 2006 (UTC)
Open source software
Are there any open source speech recognition projects? It would be great to summarize how the best few are doing or note the lack if there are none. — Hippietrail 17:35, 15 April 2006 (UTC)
There are some, all of the following are under a MIT-like license:
Quoting: Start-ups are also making an impact in speech recognition, most notably SimonSays Voice Technologies. A Toronto-based company, Simonsays has made several breakthroughs in robust server-based speech recognition. Though SimonSays currently possesses a smaller market share, they are certainly a company to watch.
I would rather like to have some references. This sounds too much like an adverisement. SiriusGrey 17:06, 18 April 2006 (UTC)
In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting (if you are very brave) The last bit about being brave seems kinda POV... sentence should be reworded? Ben Tibbetts 23:12, 10 May 2006 (UTC)
- "In this entry, we will the use of hidden Markov model (HMM) because notably it is very widely used in many systems. (Language modeling has many other applications such as smart keyboard and document classification; to the corresponding entries.)". In the section "Performance of speech recognition systems" is rather unclear. "In this entry"? Remains of the copy from the blog, possibly? Musically ut 15:11, 29 July 2007 (UTC)
some clean ups
I have clean up some of the descriptions, mathematics in the previous version up to the point of "HMM-based speech recognition". The previous description was biased towards commercial speech recognition. So it is easy to mislead readers on some basic facts.
Things I will point out is that on one hand, dictation engine could have high recognition performance. However, for individual speaker, the recognition rate could be varied from speaker to speaker. Notably, when the user speak English with different accents, then the performance will not usually be 98%. So this is generally more a claim, than a fact.
Another thing I will point out is the mathematical explanation of the basic principle of speech recognition. The previous authors obviously mistaken the use of noisy channel formulation (The P(W|A) thing if you don't know what I mean) with HMM. This I will call a mistake, because without the max term appear in finding the best words sequence, it is actually not trivial to remove the term P(A). The theory of speech recognition is actually quite sophisticated. The explanation in the old version lacks of certain mathematical vigor.
This version is still lacking behind a certain standard. For example, I strongly agreed the comment in non-encyclopedic and simon said. The former is really POV and the latter is really just an ad. I also disprove of using Bill Gate's quote in speech recognition. (He is really not a researcher of speech recognition or any speech related research at all.
We also need more scholarly articles and references to support the content. Hopefully, we could add that in future.
This article jumps into the technical issues without giving any context to the history of speech recognition advances over the years. In the intro there is a very short summary of applications, but there are no dates or names associated with them. I guess patent filings would give a good history of who did what when in this field. --DeweyQ 15:59, 22 July 2006 (UTC)
I agree that the history is very poor for this topic. For some reason, the content seems to imply that speech recognition software is limited to, or at least largely aimed at, medical applications. My awareness of the history of this idea is that it was going to "make keyboards redundant within X years" ... its failure to do so is in my opinion what the history should cover. 184.108.40.206 01:18, 7 November 2007 (UTC)
- "One of the most notable domains for the commercial application of speech recognition in the United States ..." another article which reads like ONLY the US have such technologies ... read i.e. the German version, and learn how Philips (from the Netherlands) took leadership on speech recognition. If I had a wish free at christmas: stop US-propaganda in Wikipedia. --220.127.116.11 (talk) 16:31, 3 November 2008 (UTC)
I also miss a history section on the topic! I.e. the German article covers it, and usually English WP articles with a much smaller amount feature a history section! --PutzfetzenORG (talk) 12:34, 11 February 2012 (UTC)
Redirected from Speech to text
Funnily enough, the Speech-to-text article was redirected but the Speech_to_text article wasn't. I have rectified this. rmccue 01:01, 23 July 2006 (UTC) hi
Broader Implications of Speech Recognition
This article focuses heavily on the "nuts and bolts" technical discussion of speech recognition technology, and gives scant coverage to the actual business uses, which are many. For example, this article could be expanded to include the impact phone-based speech reco systems have had on customer service (positive and negative). There's also an interesting movement to displace outdated and cumbersome touch-tone IVRs with speech-enabled systems.
Nezzo 15:29, 5 September 2006 (UTC)
No mention of the programs that one could use to perform speech recognition (i.e., NaturallySpeaking or ViaVoice)? Or how about cell phones and car navigation systems with voice commands? Personally, I thought this was lacking from this article. RobertM525 03:36, 12 October 2006 (UTC)
Yes, as a user I am more or less stunned that the article doesn't even mention Dragon NaturallySpeaking, which is clearly the best speech recognition program out there. I don't know of anyone who disagrees with that, but I would be happy to hear differently. 18.104.22.168 01:46, 5 December 2006 (UTC)Gene Venable 4 December 2006.
Well, that's a POV. In my experience, Dragon is not the best. Probably better to stay away from naming specific companies, although perhaps a reference list of the most well known is useful. A list of applications would be much more useful (such as the cell phones, or car navigation). NWebb 19:17, 31 January 2007 (UTC)
Removed a spam link (several times) to a website called ivrdictionary. This is a thinly veiled attempt to put advertising on Wikipedia. Links were added by several anonymous users within a tight IP range. Website purports to list ivr terminology, but in reality it prominently displays an advertisement to Angel dot com, which is a commercial company that sells IVR related products. The same links were added to other articles that are related to IVR technology. Calltech 16:56, 17 November 2006 (UTC)
Would anyone like to take a stab at making an overview for the following almost-the-same-thing subjects. I know someone round here is just salivating at the prospect of 4 slightly different articles saying the same things but I personally find the whole thing confusing and unrequired. Please put an overview.
--I'll bring the food 01:15, 26 November 2006 (UTC)
- I think your first three are pretty much all the same thing. Don't forget Automatic Speech Recognition, (ASR), voice recognition, Direct Voice Input, (DVI), voice command, speech interface, natural language processing, etc etc. Martinevans123 (talk) 00:31, 12 December 2007 (UTC)
- The latest edit by Tbutzon to the opening paragraph is very welcome. Am sure similar improvements could be made to the whole of this article. But the statement that ASR....."converts spoken words to text" is not strictly true. I agree in most applications it will, but a visual representation of the output from ASR, including standard graphemes, may or may not be produced, even when the recognition is successful. That’s a system design/ HMI question. But I'm note sure how to correct this. Martinevans123 (talk) 09:27, 7 January 2008 (UTC)
I find it odd that this article has a section titled 'Books' this section currently conatins a single book on the subject (despite there being many and links to a online bookshop website where the book can be purchased. To me this reads like an advertisement especially as the book presented seems possibly less specialised on the subject of speech recognition than many other books that are out there but of course are not all listed. Canderra 01:41, 18 January 2007 (UTC)
I think that VoxForge (www.voxforge.org) should be added as an external link (but I am the VoxForge maintainer, so I cannot add the link myself). LDC is listed, and it *sells* Speech Corpora. VoxForge is trying to create a free English speech corpus for use in creating acoustic models for open source speech recognition engines. We are similar to the BAS – Bavarian Archive for Speech Signals site, which provides a free database of spoken German, which also has an external link. Kmaclean 18:42, 10 May 2007 (UTC)
Vandalism here and else where from 22.214.171.124
Please note the "wes is gay ..." comment here. Look then at the history of changes and look at the other things this AC has changed recently. Vandalism pure and simple. —Preceding unsigned comment added by 126.96.36.199 (talk) 18:34, 4 October 2007 (UTC)
I would agree with User: Three-quarter-ten that the task of most ASR is to take an input of human vocal utterances and to deduce from them, by means of phonemes, syllables or word shapes, a series or "words" conforming to an expected syntax or natural lanuage. But the visible (or audible) output is going to be a system design decision, e.g. the output might be - where to route a bag in an airport baggage handling depot, might be a grade on a pupil progress chart, or might be a string of words spoken in a different language i.e. not necessarily "written text" at all. Martinevans123 (talk) 23:12, 15 January 2008 (UTC)
- Very good point. I revised my edit from "converts spoken words to machine-readable text, that is, to a string of character codes" to "converts spoken words to machine-readable input, for example, to a string of character codes". I guess that the irreducible common denominator, regardless of the system, is that the output of speech recognition is a binary string to be used as some sort of input. With a string of character codes to be input into a Word document being a very archetypical example. Thanks! — ¾-10 00:33, 16 January 2008 (UTC)
- Binary code certainly, but I'm not sure where "character codes" come in. That's usually the output of an alphabetic keyboard/ keypad, which the ASR usually circumvents entirely. Martinevans123 (talk) 00:46, 16 January 2008 (UTC)
- I see what you mean. From the point of view of the average user, what they "put in" is speech and what they "get out" is a string of character codes, which is to say, "typed" text in their Word document. I'm going to try another revision: "converts spoken words to machine-readable input (for example, to the binary code for a string of character codes)". That's a little dense for the lay reader, but it's more accurate than what I had before. If anybody has any ideas for yet a better phrasing (accurate but lay-friendly), feel free. — ¾-10 01:17, 16 January 2008 (UTC)
ADD LINK TO VOICE DATA COLLECTION?
It is my first time editing here and I just added new text requesting people to donate their voice for the development of this technology in a number of European Languages. Please, advice if this is okay or not. If not please remove and notify. Thanks, Helga
- Solicitations of any kind are not proper for inclusion in Wikipedia artilces. OccamzRazor (talk) 00:49, 11 May 2008 (UTC)
Shameful anti-Gates bias!
The article makes no mention of Windows Vista, even though it has one of the most advanced spoken command recognition and speech dictation capability among home and office use affordable environment. 188.8.131.52 (talk) 14:34, 27 May 2008 (UTC)
- And? Just add it - better than howling around here. By the way, those "capabilities" may work in English, maybe ... maybe ... it is not working in the German Vista version.
There is no mention of the expected future for speech recognition. Any studies on the expansion of markets, acceptability of users, rate of increase of robustness to noise, etc?
It has been stated that speech recognition machines may exceed a humans understanding by the year 2012. Can anyone confer? —Preceding unsigned comment added by 184.108.40.206 (talk) 05:40, 8 June 2008 (UTC)
Merge with Speech-to-Text Reporter
Speech-to-text currently redirects to this article, however Speech-to-Text Reporter is not even mentioned on this article. Speech-to-text reporting is obviously a subset of speech recognition, but as the corresponding article presents no in-text citations and relatively little information anyway, I recommend a merge rather than an inset summary. Neelix (talk) 23:15, 10 November 2008 (UTC)
Agree. Martinevans123 (talk) 20:00, 11 November 2008 (UTC)
- Completely disagree. Speech to text descirbes the process whereby people (stenogtraphers or palantypists) create a live sylabic feed that a computer compares to a dictionary to display words on screen. The computer does not recognise directly what the speaker is saying, it is the human operator. —Preceding unsigned comment added by 220.127.116.11 (talk) 15:44, 7 April 2009 (UTC)
- You seem to be simply suggesting that because the computer speech recognition is not optimal in this case, the operator (or a second operator) has to check the output word-by-word before it is saved? But isn't that true for many, if not most, applications which have a visual HMI component? Martinevans123 (talk) 16:12, 7 April 2009 (UTC)
- But reading Speech-to-Text Reporter I now see that it is, as you say, basically a real-time audio-typing function, albeit with a special keyboard. It thus seems to have nothing to do with computer speech recognition and I have to change my view to "Completely disagree" also. Apologies for missing the point first time here. It may be a "subset of speech recognition", but only insofar as is any natural human speech recognition. Martinevans123 (talk) 17:48, 16 April 2009 (UTC)
Proposed addition: LumenVox Speech Engine
Full disclosure: I am employed by LumenVox, which sells a commercial automatic speech recognizer. I make this disclosure in compliance with the WP:SCOIC guidelines. While I am a longtime user of Wikipedia, I have no experience as a contributor and would appreciate any help or guidance from editors.
Essentially I would like to suggest that our product, the LumenVox Speech Engine, be added to the list of "Commercial software/middleware" in this page. A description of the product can be found on our Web site at http://www.lumenvox.com/products/speech_engine/
I am not sure precisely what sorts of third-party references are needed to justify inclusion into this list, but if any editors can supply me with the type of references that would be required to justify notability, I can happily provide references. I don't see any references for the other applications in the list. I do believe the product (and the company) meet the notability guidelines.
- It looks like someone added this in the mean time. Is it accurate? -- kenb215 talk 20:18, 14 April 2009 (UTC)
Speech-to-Text vs Text-to-Speech
Aren't those 2 different things? Speech-to-text is like Dragon Naturally Speaking whereas Text-to-Speech would be the email readers or the speech reader for the Kindle 2. Shouldn't that be clarified? Harriska2 (talk) —Preceding undated comment added 16:39, 16 April 2009 (UTC).
Conflation Error, Corrected?
An individual speaker can make a speech with their voice, but isn't always the owner of the speech that they have just spoken; a speech can be written down on paper in words (or binary), and repeated.
A voice is a completely different thing, a voice is what an individual generates in their mind, and owns, even before it leaves their head (even if it's then recorded, and finally repeated elsewhere afterwards, as a speech / waveform).
A voice might be pre-molecular in origin, it might not be, but might also be beyond any systematic explanation, so what, its still a voice (just like a tree is a tree).
The semantics, the lack of scientific categorisation, and use of terms, or lack of, has completely messed-up this article.
Wiki needs definitive linked definitions to the following:
Voice recognition, as a generic term (with links to the following two types of voice recognition).
Speaker recognition, as a means of voice evaluation / verification, etc..
- Thanks Ronz. For the benefit of anyone feeling that the lists should be restored, the See Also section has wikilinks to List of speech recognition software and Speech recognition in Linux which cover most of the notable ones without turning the main article into a link farm Kiore (talk) 11:44, 13 July 2009 (UTC)
Request for Review of Potential New Article: LumenVox
I am an employee of a company that I believe deserves an article on Wikipedia, but I am reluctant to post the article myself due to my obvious conflict of interest (I believe in the past my company had some employees post articles which were then deleted). It was previously suggested to me that I write a version of it in my user space and ask for it to be reviewed and eventually created by other editors. I have written a draft of the article at Stephen Keller (talk) 00:04, 6 March 2010 (UTC)and would like feedback on whether it is sufficiently NPOV, researched, and if it meets the notability guidelines. Any help is appreciated.
Update Links (Euro Fighter)
Reference 3 is a dead link, isn`t it? (Euro Fighter) I don't know the original page, but my suggestion is: http://www.eurofighter.com/capabilities/technology/voice-throttle-stick/direct-voice-input.html —Preceding unsigned comment added by 18.104.22.168 (talk) 14:00, 10 May 2010 (UTC)
The discussion of current research is controversial, provocative and not supported by citation. The implication that funding for speech recognition has been reduced since 2001 is factually false. The DARPA projects EARS and GALE were both very well funded.
It is also misleading to say that performance has plateaued. Large vocabulary speech recognition has always been a very difficult task. Most of the performance improvement over the last 35 years has been due to the steady accumulation of small incremental improvements. The published results from the EARS and GALE research continues this trend. In fact the EARS project was noted for a reduction in error rate on certain tasks of nearly a factor of two, one of the largest single-year improvements in speech recognition performance.
It is true that DARPA has changed focus in that GALE no longer supports recognition of English, but only of Mandarin and Arabic. However, it has always been DARPA practice in speech recognition funding to continually shift focus to harder tasks as sufficient progress is made on the earlier tasks. It is sign of success, not of failure. Jay Page (talk) 16:42, 24 July 2010 (UTC)
- it is true that is not DARPA anymore funding this research. But then other entities should bring on this research. DARPA is always on the forefront of research. According to this article the situation hasn't changed much from 2001 to 2006. http://robertfortner.posterous.com/the-unrecognized-death-of-speech-recognition If there are other results you should write them with the bibliography. —Preceding unsigned comment added by 22.214.171.124 (talk) 10:54, 4 November 2010 (UTC)
An error in the History section
There seems to be an error in the first paragraph of the History section, it says the IBM shoebox was first exhibited in New York's world fair of 1964, even though the article for the IBM shoebox states,just as the official IBM site, that it was shown to the public in 1962 at the Seattle's World Fair
Excuse me if my way of speaking is confusing, English is not my first language and I haven't practiced it in years. —Preceding unsigned comment added by 126.96.36.199 (talk) 05:06, 12 October 2010 (UTC)
Quality & refs
I am not happy with the refernces (just 8, 3 of which are irrelevant) or the overall quality of the article. It talks a lot about applications, but doe sot really deal with the technical challenges etc. I would say a 70% rewrite is needed. Not that I can pay attention to it now, but it should be suggested for someone in the field to rewrite. History2007 (talk) 13:32, 7 February 2011 (UTC)