An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples (sample A, the first reference, and sample B, the second reference) followed by one unknown sample X that is randomly selected from either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.
ABX tests can easily be performed as double-blind trials, eliminating any possible unconscious influence from the researcher or the test supervisor. Because samples A and B are provided just prior to sample X, the difference does not have to be discerned from assumption based on long-term memory or past experience. Thus, the ABX test answers whether or not, under ideal circumstances, a perceptual difference can be found.
ABX tests are commonly used in evaluations of digital audio data compression methods; sample A is typically an uncompressed sample, and sample B is a compressed version of A. Audible compression artifacts that indicate a shortcoming in the compression algorithm can be identified with subsequent testing. ABX tests can also be used to compare the different degrees of fidelity loss between two different audio formats at a given bitrate.
ABX tests can be used to audition input, processing, and output components as well as cabling: virtually any audio product or prototype design.
ABX test equipment utilizing relays to switch between two different hardware paths can help determine if there are perceptual differences in cables and components. Video, audio and digital transmission paths can be compared. If the switching is microprocessor controlled, double-blind tests are possible.
Loudspeaker level and line level audio comparisons could be performed on an ABX test device offered for sale as the ABX Comparator by QSC Audio Products from 1998 to 2004. Other hardware solutions have been fabricated privately by individuals or organizations for internal testing.
If only one ABX trial were performed, random guessing would incur a 50% chance of choosing the correct answer, the same as flipping a coin. In order to make a statement having some degree of confidence, many trials must be performed. By increasing the number of trials, the likelihood of statistically asserting a person's ability to distinguish A and B is enhanced for a given confidence level. A 95% confidence level is commonly considered statistically significant. The company QSC, in the ABX Comparator user manual, recommended a minimum of ten listening trials in each round of tests.
|Number of trials||10||11||12||13||14||15||16||17||18||19||20||21||22||23||24||25|
|Minimum number correct||9||9||10||10||11||12||12||13||13||14||15||15||16||16||17||18|
QSC recommended that no more than 25 trials be performed, as listener fatigue can set in, making the test less sensitive (less likely to reveal one's actual ability to discern the difference between A and B). However a more sensitive test can be obtained by pooling the results from a number of such tests using separate individuals or tests from the same listener conducted in between rest breaks. For a large number of total trials N, a significant result (one with 95% confidence) can be claimed if the number of correct responses exceeds . Important decisions are normally based on a higher level of confidence, since an erroneous "significant result" would be claimed in one of 20 such tests simply by chance.
The foobar2000 and the Amarok audio players support software-based ABX testing, the latter using a third-party script. aveX is an open-source software mainly developed for Linux which also provides test-monitoring from a remote computer. ABX patcher is an ABX implementation for Max/MSP. More ABX software can be found at the archived PCABX website.
The ABX test is able to assert if A is identifiably different from B, however it must be performed correctly in order to produce a meaningful result. For example, all test results must be counted in order for the result to be valid. This include previous failed tests, which might not be made public , while the successful ones are, or repeated tests. All tests performed should be summed, and the p value calculated from the sum, not an individual test. Other problems might arise from the abx equipment itself, such as a tell from the equipment or poor volume matching in the case of audio tests.
Algorithmic Audio Compression Evaluation
Since ABX testing requires human beings for evaluation of lossy audio codecs, it is time-consuming and costly. Therefore, cheaper approaches have been developed, e.g. PEAQ, which is an implementation of the ODG.
In MUSHRA, the listener is presented with the reference (labeled as such), a certain number of test samples, a hidden version of the reference and one or more anchors. A 0-100 RATING scale makes it possible to rate very small differences.
Alternative general methods are used in discrimination testing, such as paired comparison, duo–trio, and triangle testing. Of these, duo–trio and triangle testing are particularly close to ABX testing. Schematically:
- AXY – one known, two unknown (one equals A, other equals B), test is which unknown is the known: X = A (and Y = B), or Y = A (and X = B).
- XXY – three unknowns (two are A and one is B or one is A and two are B), test which is the odd one out: Y = 1, Y = 2, or Y = 3.
In this context, ABX testing is also known as "duo–trio" in "balanced reference" mode – both knowns are presented as references, rather than one alone.
- David Clark (1982). "High-Resolution Subjective Testing Using a Double-Blind Comparator". AES Journal 30 (5).
- QSC ABX Comparator user manual. (1998) p. 10
- David Carlstrom. "Probability of Experimental Result Being the Same as Random Guesses". ABX Web Page. Retrieved 2011-12-14.] at
- "ABX Testing". Boston Audio Society. 1990. Retrieved 2012-06-12. "The large relays in this box make a soft clunk that is different for the two sources and is audible in a quiet room; Meyer has identified X 10 out of 10 times without any signal! While the sound is quiet enough to be masked when any music is playing, testing hygiene dictates that the relay box be enclosed or otherwise muffled. Meyer handed out a sheet photocopied from the ABX manual which showed typical level-matching required for reliable detection of differences between sources with 1/3 octave frequency-response aberrations."
- Meilgaard, Morten; Gail Vance Civille, B. Thomas Carr (1999). Sensory evaluation techniques (3 ed.). CRC Press. pp. 68–70. ISBN 0-8493-0276-5.