Repeatability or test–retest reliability is the variation in measurements taken by a single person or instrument on the same item and under the same conditions. A less-than-perfect test–retest reliability causes test–retest variability. Such variability can be caused by, for example, intra-individual variability and intra-observer variability. A measurement may be said to be repeatable when this variation is smaller than some agreed limit.
Test–retest variability is practically used, for example, in medical monitoring of conditions. In these situations, there is often a predetermined "critical difference", and for differences in monitored values that are smaller than this critical difference, the possibility of pre-test variability as a sole cause of the difference may be considered in addition to, for examples, changes in diseases or treatments.
According to the Guidelines for Evaluating and Expressing the Uncertainty of NIST Measurement Results, the following conditions need to be fulfilled in the establishment of repeatability:
- the same measurement procedure
- the same observer
- the same measuring instrument, used under the same conditions
- the same location
- repetition over a short period of time.
Repeatability methods were developed by Bland and Altman (1986).
The repeatability coefficient is a precision measure which represents the value below which the absolute difference between two repeated test results may be expected to lie with a probability of 95%.
Desirability of repeatability
Test–retest reliability is desirable in measures of constructs that are not expected to change over time. For example, if you use a certain method to measure an adult's height, and then do the same again two years later, you would expect a very high correlation; if the results differed by a great deal, you would suspect that the measure was inaccurate. The same is true for personality traits such as extraversion, which are believed to change only very slowly. In contrast, if you were trying to measure mood, you would expect only moderate test–retest reliability, because people's moods are expected to change from day to day. Very high test–retest reliability would be bad, because it would suggest that you were not picking up on these changes.
Attribute Agreement Analysis for Defect Databases
An attribute agreement analysis is designed to simultaneously evaluate the impact of repeatability and reproducibility on accuracy. It allows the analyst to examine the responses from multiple reviewers as they look at several scenarios multiple times. It produces statistics that evaluate the ability of the appraisers to agree with themselves (repeatability), with each other (reproducibility), and with a known master or correct value (overall accuracy) for each characteristic – over and over again.
Because the same test is administered twice and every test is parallel with itself, differences between scores on the test and scores on the retest should be due solely to measurement error. This sort of argument is quite probably true for many physical measurements. However, this argument is often inappropriate for psychological measurement, because it is often impossible to consider the second administration of a test a parallel measure to the first.
The second administration of a psychological test might yield systematically different scores than the first administration due to the following reasons:
1. The attribute that is being measured may change between the first test and the retest. For example, a reading test that is administered in September to a third grade class may yield different results when retaken in June. We would expect some change in children’s reading ability over that span of time, a low test–retest correlation might reflect real changes in the attribute itself.
2. The experience of taking the test itself can change a person’s true score. For example, completing an anxiety inventory could serve to increase a person’s level of anxiety.
3. Carryover effect, particularly if the interval between test and retest is short. When retested, people may remember their original answer, which could affect answers on the second administration.
- Types of Reliability The Research Methods Knowledge Base. Last Revised: 20 October 2006
- Fraser, C. G.; Fogarty, Y. (1989). "Interpreting laboratory results". BMJ (Clinical research ed.) 298 (6689): 1659–1660. doi:10.1136/bmj.298.6689.1659. PMC 1836738. PMID 2503170.
- George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston: Allyn & Bacon.
- Davidshofer, Kevin R. Murphy, Charles O. (2005). Psychological testing : principles and applications (6th ed. ed.). Upper Saddle River, N.J.: Pearson/Prentice Hall. ISBN 0-13-189172-3.