Mean opinion score
Mean opinion score (MOS) is a test that has been used for decades in telephony networks to obtain the human user's view of the quality of the network. Historically, and implied by the word opinion in its name, MOS was a subjective measurement where listeners would sit in a "quiet room" and score call quality as they perceived it; per ITU-T recommendation P.800, "The talker should be seated in a quiet room with volume between 30 and 120 m3 and a reverberation time less than 500 ms (preferably in the range 200-300 ms). The room noise level must be below 30 dBA with no dominant peaks in the spectrum." Measuring Voice over IP (VoIP) is more objective, and is instead a calculation based on performance of the IP network over which it is carried. The calculation is defined in the ITU-T PESQ P.862 standard. Like most standards, the implementation is somewhat open to interpretation by the equipment or software manufacturer. Moreover, due to technological progress of phone manufacturers, a calculated MOS of 3.9 in a VoIP network may actually sound better than the formerly subjective score of > 4.0.
In multimedia (audio, voice telephony, or video) especially when codecs are used to compress the bandwidth requirement (for example, of a digitized voice connection from the standard 64 kilobit/second PCM modulation), the MOS provides a numerical indication of the perceived quality from the users' perspective of received media after compression and/or transmission. The MOS is expressed as a single number in the range 1 to 5, where 1 is lowest perceived audio quality, and 5 is the highest perceived audio quality measurement.
The MOS is generated by averaging the results of a set of standard, subjective tests where a number of listeners rate the heard audio quality of test sentences read aloud by both male and female speakers over the communications medium being tested. A listener is required to give each sentence a rating using the following rating scheme:
|4||Good||Perceptible but not annoying|
The MOS is the arithmetic mean of all the individual scores, and can range from 1 (worst) to 5 (best).
Compressor/decompressor (codec) systems and digital signal processing (DSP) are commonly used in voice communications, and can be configured to conserve bandwidth, but there is a trade-off between voice quality and bandwidth conservation. The best codecs provide a lot of bandwidth conservation relatively small degradation of voice quality. Bandwidth can be measured quantitatively, but voice quality requires human interpretation, although estimates of voice quality can be made by automatic test systems.
A similar process can be used to evaluate subjective video quality.
As an example, the following are mean opinion scores for one implementation of different codecs:
|Mean opinion score
One consideration when planning a VoIP deployment is the bandwidth usage for a particular codec versus the potential MOS. For example, G.711, with a sample size of 64kbit/s, achieves a maximum MOS of 4.1, whereas G.729, with a much smaller sample size of 8kbit/s, can achieve a MOS of 3.9. G.729 is "compressed eight times smaller than G.711 while sounding almost as good."
A drawback of obtaining MOS estimations is that it may be more time-consuming and expensive as it requires hiring experts to make estimations. When a voice coding system is under development, or the developer has to test and compare a couple of audio systems, it's very important to have a possibility for a quick check.
- You will have to be very quiet.
- There was nothing to be seen.
- They worshipped wooden idols.
- I want a minute with the inspector.
- Did he need any money?
An empirical formula exist to guess the predicted MOS score from packet losses in percent and the voice payload per packets in milliseconds:
This is an empirical fit to tests run on an voice over IP over ATM networks, done by Yamamoto/Beerends from KPN research. The formula applies to other IP transports as well.
- Subjective video quality
- MUSHRA ITU BS.1534 Recommendation
- PSQM Perceptual Speech Quality Measure (ITU-T P.861 - withdrawn and replaced with PESQ ITU-T P.862)
- PESQ Perceptual Evaluation of Speech Quality, is mechanism for automated assessment of the speech quality enjoyed by the user of a telephone system. It is standardised as ITU-T recommendation P.862 (02/01).
- POLQA Perceptual Objective Listening Quality Assessment, is replacing PESQ. It is standardised as ITU-T recommendation P.863.
- PEVQ Perceptual Evaluation of Video Quality, a measurement algorithm for the automated assessment of video quality.
- PEAQ Perceptual Evaluation of Audio Quality, a measurement algorithm for the automated assessment of audio quality.
- Absolute Category Rating
- http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.5576 Impact of network performance parameters on the end-to-end perceived speech quality