Speech analytics

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Speech analytics is the process of analyzing recorded calls to gather information, brings structure to customer interactions and exposes information buried in customer contact center interactions with an enterprise. [1] Although it often includes elements of automatic speech recognition, where the identities of spoken words or phrases are determined, it may also include analysis of one or more of the following:

  • the topic(s) being discussed
  • the emotional character of the speech
  • the amount and locations of speech versus non-speech (e.g. call hold time or periods of silence)

One use of speech analytics applications is to spot spoken keywords or phrases, either as real-time alerts on live audio or as a post-processing step on recorded speech. This technique is also known as audio mining. Other uses include categorization of speech, for example in the contact center environment, to identify calls from unsatisfied customers.

Speech analytics in contact centers can be used to extract critical business intelligence that would otherwise be lost. By analyzing and categorizing recorded phone conversations between companies and their customers, useful information can be discovered relating to strategy, product, process, operational issues and contact center agent performance. This information gives decision-makers insight into what customers really think about their company so that they can quickly react. In addition, speech analytics can automatically identify areas in which contact center agents may need additional training or coaching, and can automatically monitor the customer service provided on calls.


There are three main approaches "under the hood": the phonetic approach; large-vocabulary continuous speech recognition (LVCSR, more commonly known as speech-to-text, full transcription or ASR - automatic speech recognition), and direct phrase recognition.

Some speech analytics vendors use the "engine" of a 3rd party and there are some speech analytics vendors that have developed their own proprietary engine.


This is the fastest approach for processing, mostly because the size of the grammar is very small. The basic recognition unit is a phoneme. There are only few tens of unique phonemes in most languages, and the output of this recognition is a stream (text) of phonemes, which can then be searched.

LVCSR (large-vocabulary continuous speech recognition)[edit]

Much slower processing, since the basic unit is a set of words (bi-grams, tri-grams etc.), it needs to have hundreds of thousands of words to match the audio against. The output however is a stream of words, making it richer to work with. It can surface new business issues, the queries are much faster, and the accuracy is higher than the phonetic approach[citation needed]. Most importantly because the complete semantic context is in the index it is possible to find and focus on business issues very rapidly.

Direct Phrase Recognition[edit]

Rather than first converting speech into phonemes or text, this approach directly analyzes speech, looking for specific phrases that have been pre-defined as being important to the business. Because no data is lost in conversion using this approach, the results of this method generally provide the highest data reliability[citation needed].

Extended speech emotion recognition and prediction[edit]

The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA) multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. [2]


Making a meaningful comparison of the accuracy of different speech analytics systems can be difficult. The output of LVCSR systems can be scored against reference word-level transcriptions to produce a value for the word error rate (WER), but because phonetic systems use phones as the basic recognition unit, rather than words, comparisons using this measure cannot be made.

When speech analytics systems are used to search for spoken words or phrases, what matters to the user is the accuracy of the search results that are returned. Because the impact of individual recognition errors on these search results can vary greatly, measures such as word error rate are not always helpful in determining overall search accuracy from the user perspective.

Measures such as precision and recall, commonly used in the field of information retrieval, are typical ways of quantifying the response of a speech analytics search system.[3] Precision measures the proportion of search results that are relevant to the query. Recall measures the proportion of the total number of relevant items that were returned by the search results. Where a standardised test set has been used, measures such as precision and recall can be used to directly compare the search performance of different speech analytics systems.

These measures of accuracy can be illustrated by the following example. Imagine a user searches a set of audio files for a specific phrase, and the search returns 10 files. If 9 of the 10 search results do in fact contain the search phrase, the precision is 90% (9 out of 10). If the total number of files that actually contain the phrase is 18 then the recall is 50% (9 out of 18).

Data reliability[edit]

According to the US Government Accountability Office,[4] “data reliability refers to the accuracy and completeness of computer-processed data, given the uses they are intended for.” In the realm of Speech Recognition and Analytics, “completeness” is measured by the “detection rate”, and usually as accuracy goes up, the detection rate goes down[citation needed].

Business value[edit]

Speech analytics provides advanced functionality that gleans valuable intelligence from thousands—even millions—of customer calls, so managers can take quick action. Contact centers record customer conversations but, the sheer number of recordings can exceed the ability to review and analyze. Speech analytics solutions can mine recorded customer interactions to surface the intelligence essential for building effective cost containment and customer service strategies. Used in combination with other workforce optimization suite components like quality monitoring and agent scorecards, Speech analytics can pinpoint cost drivers, trends, and opportunities, identify strengths and weaknesses with processes and products, and help understand how the marketplace perceives offerings.

Speech analytics is designed with the business user in mind. It can provide automated trend analysis to show what’s happening in contact centers. The solution can isolate the words and phrases used most frequently within a given time period, as well as indicate whether usage is trending up or down. This information makes it easy for supervisors, analysts, and others in the organization to spot changes in consumer behavior and take action to reduce call volumes—and increase customer satisfaction.

See also[edit]


  1. ^ "The Why Factor in Speech Analytics About". Destination CRM. Retrieved 2013-10-30. 
  2. ^ S.E. Khoruzhnikov; et al. (2014). "Extended speech emotion recognition and prediction". Scientific and Technical Journal of Information Technologies, Mechanics and Optics. 14 (6): 137.  line feed character in |journal= at position 33 (help)
  3. ^ C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Chapter 8.
  4. ^ "Assessing the Reliability of Computer-Processed Data" (PDF). Assessing the Reliability of Computer-Processed Data. United States General Accounting Office.