MUSHRA stands for MUltiple Stimuli with Hidden Reference and Anchor and is a methodology for subjective evaluation of audio quality, to evaluate the perceived quality of the output from lossy audio compression algorithms. It is defined by ITU-R recommendation BS.1534-3. The MUSHRA methodology is recommended for assessing "intermediate audio quality". For very small audio impairments, Recommendation ITU-R BS.1116-3 (ABC/HR) is recommended instead.
The main advantage over the Mean Opinion Score (MOS) methodology (which serves a similar purpose) is that it requires fewer participants to obtain statistically significant results. This is because all codecs are presented at the same time, on the same samples, so that a paired t-test or a repeated measures anova can be used for statistical analysis. Also, the 0-100 scale makes it possible to rate very small differences. In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. The recommendation specifies that a low-range and a mid-range anchor should be included in the test signals. These are typically a 7 kHz and a 3.5 kHz low-pass version of the reference. The purpose of the anchor(s) is to make the scale be closer to an "absolute scale", making sure that minor artifacts are not rated as having very bad quality. This is particularly important when comparing or pooling results from different labs.
Listeners in MUSHRA-Tests
Both, MUSHRA and ITU BS.1116 tests call for trained expert listeners who know what typical artifacts sound like and where they are likely to occur. Expert listeners also have a better internalization of the rating scale which leads to a better retest reliability than untrained listeners. Thus, fewer listeners are needed to achieve significant results.
It is assumed that preferences are similar for expert listeners and naive listeners and thus results of expert listeners are also predictive for consumers. In agreement with this assumption Schinkel-Bielefeld et al. found no differences in the rank order between expert listeners and untrained listeners when using test signals containing only timbre and no spatial artifacts. However, Rumsey et al. showed that for signals containing spatial artifacts, expert listeners weigh spatial artifacts slightly stronger than untrained listeners, who primarily focus on timbre artifacts.
In addition to this, it has been shown that expert listeners make more use of the possibility to listen to smaller sections of the signals under test repeatedly and compare more between the signals under test and the reference. This can be seen as a shift from preference rating to audio quality rating, which means rating the differences between the signal under test and the uncompressed original, which is the actual goal of a MUSHRA-test.
Pre- or Postscreening
The MUSHRA guideline mentions several possibilities to assess the reliability of a listener.
The easiest and most common is to discard listeners who rate the hidden reference below 90 MUSHRA points for more than 15 percent of all test items. As this should be rated with 100 MUSHRA points, this is obviously a mistake. While it can happen that the hidden reference and a high-quality signal are confused, a rating of lower than 90 should only be given when the listener is certain that this signal is different from the original.
The other possibility to assess a listener’s performance is eGauge, a framework based on the analysis of variance. It computes Agreement, Repeatability and Discriminability, though only the latter two are recommended for pre- or postscreening. Agreement analyses how well a listener agrees with the rest of the listeners. Repeatability looks at the variance when rating the same test signal again in comparison to the variance of the other test signals and Discriminability analyses if listeners can distinguish between test signals of different conditions. As eGauge requires listening to every test signal twice, it is more effort to apply this than to post screen listeners based on the hidden references. However, if a listener has proven a reliable listener in gauge, he or she can also be considered a reliable listener for future listening tests, provided the character of the test does not change. For example a reliable listener for stereo listening test is not necessarily equally good in perceiving artifacts in 5.1 or 22.2 format test items.
It is important to choose critical test items, that means test items which are difficult to encode and are likely to produce artifacts. At the same time, the test items should be ecological valid. Meaning they should be representative of broadcast material and not some synthetic signals especially designed to be difficult to encode. A method to choose critical material is presented by Ekkeroot et al. who propose a ranking by elimination procedure. While this is a good way to choose the most critical test items, it does not ensure to include a variety of test items prone to different artifacts.
Ideally the character of a MUSHRA test item should not change too much for the whole duration of that item. Otherwise it can be difficult for the listener to decide on a rating if different parts of the items display different or stronger artifacts than others. Often shorter items lead to less variability than longer ones, as they are more stationary. However, even when trying to choose stationary items, ecologically valid stimuli will very often have sections that are slightly more critical than the rest of the signal. Thus, listeners who focus on different sections of the signal may evaluate it differently. In this case more critical listeners seem to be better in identifying the most critical regions of a stimulus than less critical listeners.
Language of Test Items
While in ITU-T P.800 tests which are commonly used to evaluate telephone quality codecs the tested speech items should always be in the native language of the listeners, this is not necessary in MUSHRA tests. A study with Mandarin Chinese and German listeners found no significant difference between rating foreign language and native language test items. However, listeners needed more time and compared more when evaluating the foreign language items. So it seems that listeners compensate for any difficulties they may have in rating foreign language items. Such compensation is not possible in ITU-T P.800 ACR tests where items are heard only once and no comparison to the reference is possible. There, foreign language items are rated as being of lower quality when listeners’ language proficiency is low.
- ITU-R recommendation BS.1534
- ITU-R BS.1116 (February 2015). "Methods for the subjective assessment of small impairments in audio systems".
- Schinkel-Bielefeld, N., Lotze, N. and Nagel, F. (May 2013). "Audio quality evaluation by experienced and inexperienced listeners". The Journal of the Acoustical Society of America. 33(5): 3246.
- Rumsey, Francis; Zielinski, Slawomir; Kassier, Rafael; Bech, Søren (2005-05-31). "Relationships between experienced listener ratings of multichannel audio quality and naïve listener preferences". The Journal of the Acoustical Society of America. 117 (6): 3832–3840. ISSN 0001-4966. doi:10.1121/1.1904305.
- Gaëtan, Lorho,; Guillaume, Le Ray,; Nick, Zacharov, (2010-06-13). "eGauge—A Measure of Assessor Expertise in Audio Quality Evaluations". Proceeding of the Audio Engineering Society. 38th International Conference on Sound Quality Evaluation.
- Jonas, Ekeroot; Jan, Berg; Arne, Nykänen (2014-04-25). "Criticality of Audio Stimuli for Listening Tests – Listening Durations During a Ranking Task". 136th Convention of the Audio Engineering Society.
- Max, Neuendorf,; Frederik, Nagel, (2011-10-19). "Exploratory Studies on Perceptual Stationarity in Listening Test - Part I: Real World Signals from Custom Listening Tests".
- Frederik, Nagel,; Max, Neuendorf, (2011-10-19). "Exploratory Studies on Perceptual Stationarity in Listening Test - Part II: Synthetic Signals with Time Varying Artifacts".
- Nadja, Schinkel-Bielefeld (2017-05-11). "Audio Quality Evaluation in MUSHRA Tests–Influences between Loop Setting and a Listeners’ Ratings". 142nd Convention of the Audio Engineering Society.
- ITU-T P.800 (August 1996). "P.800 : Methods for subjective determination of transmission quality".
- Nadja, Schinkel-Bielefeld,; Zhang, Jiandong,; Qin, Yili,; Katharina, Leschanowsky, Anna; Fu, Shanshan, (2017-05-11). "Is it Harder to Perceive Coding Artifact in Foreign Language Items? – A Study with Mandarin Chinese and German Speaking Listeners".
- Blašková, Lubica; Holub, Jan (2008). "How do Non-native Listeners Perceive Quality of Transmitted Voice?" (PDF). Communications. 10.4: 11–15.
- RateIt: A GUI for performing MUSHRA experiments
- MUSHRAM - A Matlab interface for MUSHRA listening tests
- A Max/MSP interface for MUSHRA listening tests
- A Browser Based Audio Evaluation Tool, for running many different tests including MUSHRA - No coding needed
- mushraJS+Server: based on mushraJS with mochiweb server, which is erlang web server