Error detection and correction
From Wikipedia, the free encyclopedia
| This article needs additional citations for verification. Please help improve this article by adding reliable references. Unsourced material may be challenged and removed. (August 2008) |
In mathematics, computer science, telecommunication, and information theory, error detection and correction are techniques to ensure that data is transmitted without errors, even across unreliable media or networks.
Contents |
[edit] General definitions of terms
Definitions of error detection and error correction:
- Error detection is the ability to detect the presence of errors caused by noise or other impairments during transmission from the transmitter to the receiver[1].
- Error correction is the additional ability to reconstruct the original, error-free data.
There are two basic ways to design the channel code and protocol for an error correcting system:
- Automatic repeat-request (ARQ): The transmitter sends the data and also an error detection code, which the receiver uses to check for errors, and requests retransmission of erroneous data. In many cases, the request is implicit; the receiver sends an acknowledgement (ACK) of correctly received data, and the transmitter re-sends anything not acknowledged within a reasonable period of time.
- Forward error correction (FEC): The transmitter encodes the data with an error-correcting code (ECC) and sends the coded message. The receiver never sends any messages back to the transmitter. The receiver decodes what it receives into the "most likely" data. The codes are designed so that it would take an "unreasonable" amount of noise to trick the receiver into misinterpreting the data.
It is possible to combine the two, so that minor errors are corrected without retransmission, and major errors are detected and a retransmission requested. The combination is called hybrid automatic repeat-request.
[edit] Error detection schemes
Several schemes exist to achieve error detection. The general idea is to add some redundancy, i.e., some extra data, to a message, that enables detection of any errors in the delivered message. Most such error-detection schemes are systematic: the transmitter sends the original data bits, and attaches a fixed number of check bits, which are derived from the data bits by some deterministic algorithm. The receiver applies the same algorithm to the received data bits and compares its output to the received check bits; if the values do not match, an error has occurred at some point during the transmission. In a system that uses a "non-systematic" code, such as some raptor codes, the original message is transformed into an encoded message that has at least as many bits as the original message.
In general, any hash function may be used to compute the redundancy. However, some functions are of particularly widespread use, due to their simplicity, or their suitability of detecting certain kinds of errors, such as the cyclic redundancy check's performance in detecting burst errors.
Other mechanisms of adding redundancy are repetition schemes and error-correcting codes. Repetition schemes are rather inefficient but very simple to implement. Error-correcting codes can provide strict guarantees on the number of errors that can be detected.
[edit] Repetition schemes
Variations on this scheme exist. Given a stream of data that is to be sent, the data is broken up into blocks of bits, and in sending, each block is sent some predetermined number of times. For example, if we want to send "1011", we may repeat this block three times each.
Suppose we send "1011 1011 1011", and this is received as "1010 1011 1011". As one group is not the same as the other two, we can determine that an error has occurred. This scheme is not very efficient, and can be susceptible to problems if the error occurs in exactly the same place for each group (e.g. "1010 1010 1010" in the example above will be detected as correct in this scheme).
The scheme however is extremely simple, and is in fact used in some transmissions of numbers stations.[citation needed]
[edit] Parity schemes
- Main article: Parity bit
A simple parity bit is an error detection mechanism that can only detect an odd number of errors.
The stream of data is broken up into blocks of bits, and the number of 1 bits is counted. Then, a "parity bit" is set (or cleared) if the number of one bits is odd (or even). (This scheme is called even parity; odd parity can also be used.) If the tested blocks overlap, then the parity bits can be used to isolate the error, and even correct it if the error affects a single bit: this is the principle behind the Hamming code.
There is a limitation to parity schemes. A parity bit is only guaranteed to detect an odd number of bit errors (one, three, five, and so on). If an even number of bits (two, four, six and so on) are flipped, the parity bit appears to be correct, even though the data is corrupt.
Extension and variations on the parity bit mechanism are horizontal redundancy checks, vertical redundancy checks and "double", "dual" or "diagonal" parity (used in RAID-DP).
[edit] Checksums
- Main article: Checksum
A checksum of a message is a modular arithmetic sum of message code words of a fixed word length (e.g., byte values). The sum is often negated by means of a one's-complement prior to transmission as the redundancy information in order to detect errors resulting in all-zero messages.
Checksum schemes include parity bits, check digits, and longitudinal redundancy check. Some checksum schemes, such as the Luhn algorithm and the Verhoeff algorithm, are specifically designed to detect errors commonly introduced by humans in writing down or remembering identification numbers.
[edit] Cyclic redundancy checks
- Main article: Cyclic redundancy check
The cyclic redundancy check (CRC) considers a block of data as the coefficients to a polynomial over a finite field, and then divides by a fixed, predetermined polynomial. The remainder of the division serves as the redundancy for the message.
CRCs have favorable properties in that they are specifically suited for detecting burst errors. They are easily implemented in hardware, and are widely used in various protocols.
Parity bits are trivial CRCs of size 1 bit.
[edit] Cryptographic hash functions
Cryptography requires messages to be protected not only against modification due to transmission errors but also against intentional changes by an adversary. The output of a cryptographic hash function can provide both authenticity as well as integrity guarantees of a message.
[edit] Error-correcting codes
Any error-correcting code can be used for error detection. A code with minimum Hamming distance d can detect up to d-1 errors in a code word. Using error-correcting codes for error detection can be favorable if strict integrity guarantees are desired, and the capacity of the transmission channel can be modeled.
Codes with minimum Hamming distance d=2 are degenerate cases of error-correcting codes, and can be used to detect single errors. The parity bit is an example of a single-error-detecting code.
The Berger code is an early example of a unidirectional error(-correcting) code that can detect any number of errors on an asymmetric channel, provided that only transitions of 0-bits to 1-bits or 1-bits to 0-bits can occur.
[edit] Error correction
[edit] Automatic repeat request
Automatic Repeat-reQuest (ARQ) is an error control method for data transmission which makes use of error detection codes, acknowledgment and/or negative acknowledgement messages and timeouts to achieve reliable data transmission. An acknowledgment is a message sent by the receiver to the transmitter to indicate that it has correctly received a data frame.
Usually, when the transmitter does not receive the acknowledgment before the timeout occurs (i.e. within a reasonable amount of time after sending the data frame), it retransmits the frame until it is either correctly received or the error persists beyond a predetermined number of retransmissions.
A few types of ARQ protocols are Stop-and-wait ARQ, Go-Back-N ARQ and Selective Repeat ARQ.
ARQ is appropriate if the communication channel has varying or unknown capacity, such as is the case on the Internet. However, it requires the availability of a back channel, results in possibly increased latency due to retransmissions, and maintainance of buffers and timers for retransmissions as well as network congestion can put a strain on server and network capacity.[2]
[edit] Error-correcting code
An error-correcting code (ECC) or forward error correction (FEC) code is a system of adding redundant data, or parity data, to a message, such that it can be recovered by a receiver even when a number of errors (up to the capability of the code being used) were introduced, either during the process of transmission, or on storage. Since the receiver does not have to ask the sender for retransmission of the data, a back-channel is not required in forward error correction, and it is therefore suitable for simplex communication such as broadcasting. Error-correcting codes are frequently used in lower-layer communication, as well as for reliable storage in media such as CDs, DVDs, and dynamic RAM.
Error-correcting codes are usually distinguished between convolutional codes and block codes:
- Convolutional codes are processed on a bit-by-bit basis. They are particularly suitable for implementation in hardware, and the Viterbi decoder allows optimal decoding.
- Block codes are processed on a block-by-block basis. Early examples of block codes are repetition codes, Hamming codes and multidimensional parity-check codes. They were followed by a number of efficient codes, of which Reed-Solomon codes are the most notable ones due to their widespread use these days. Turbo codes and low-density parity-check codes (LDPC) are relatively new constructions that can provide almost optimal efficiency.
Shannon's theorem is an important theorem in forward error correction, and describes the minimum amount of redundancy required to enable reliable communication over a channel that has a certain error probability or signal-to-noise ratio (SNR). This minimum required redundancy gives a theoretical upper bound for the attainable spectral efficiency in bit/s/Hz of a channel coding scheme (an error-correcting scheme, typically combined width a digital modulation scheme) for a specific SNR. For example, if the SNR is 0 dB, the spectral efficiency can not be higher than 1 bit/s/Hz, resulting in that the information rate in bit/s (the useful bitrate, excluding redundant ECC) can not be higher than the bandwidth in hertz.
The minimum redundancy required on a channel is expressed in terms of the channel capacity. Any error-correcting code must then have a code rate that does not exceed the capacity of the channel. The actual code rate required depends on the error-correcting code used, and may be lower. This is because Shannon's proof was only of existential nature, and did not show how to construct optimal codes reaching the channel capacity.
[edit] Hybrid schemes
Hybrid ARQ is a combination of ARQ and forward error correction. There are two basic approaches[2]:
- Messages are always transmitted with FEC parity data (and error-detection redundancy). A receiver decodes a message using the parity information, and requests retransmission using ARQ only if the parity data was not sufficient for successful decoding (identified through a failed integrity check).
- Messages are transmitted without parity data (only with error-detection information). If a receiver detects an error, it requests FEC information from the transmitter using ARQ, and uses it to reconstruct the original message.
The latter approach is particularly attractive on the binary erasure channel when using a rateless erasure code.
[edit] Applications
Applications that require low latency (such as telephone conversations) cannot use Automatic Repeat reQuest (ARQ); they must use Forward Error Correction (FEC). By the time an ARQ system discovers an error and re-transmits it, the re-sent data will arrive too late to be any good.
Applications where the transmitter immediately forgets the information as soon as it is sent (such as most television cameras) cannot use ARQ; they must use FEC because when an error occurs, the original data is no longer available. (This is also why FEC is used in data storage systems such as RAID and distributed data store).
Applications that use ARQ must have a return channel. Applications that have no return channel cannot use ARQ.
Applications that require extremely low error rates (such as digital money transfers) must use ARQ.
[edit] The Internet
In a typical TCP/IP stack, error detection is performed at multiple levels:
- Each Ethernet frame carries a CRC-32 checksum. The receiver discards frames if their checksums do not match.
- The IPv4 header contains a header checksum of the contents of the header (excluding the checksum field). Packets with checksums that don't match may be discarded or processed, depending on application.
- The checksum was omitted from the IPv6 header, because most current link layer protocols have error detection.
- UDP has an optional checksum. Packets with wrong checksums may be discarded or retained depending on application.
- TCP has a checksum of the payload, TCP header (excluding the checksum field) and source- and destination addresses of the IP header. Packets found to have incorrect checksums are discarded and eventually get retransmitted when the sender receives a triple-ack or a timeout occurs.
[edit] Deep-space telecommunications
NASA has used many different error correcting codes. For missions between 1969 and 1977 the Mariner spacecraft used a Reed-Muller code. The noise these spacecraft were subject to was well approximated by a "bell-curve" (normal distribution), so the Reed-Muller codes were well suited to the situation.
The Voyager 1 & Voyager 2 spacecraft transmitted color pictures of Jupiter and Saturn in 1979 and 1980.
- Color image transmission required 3 times the amount of data, so the Golay (24,12,8) code was used.[citation needed][3]
- This Golay code is only 3-error correcting, but it could be transmitted at a much higher data rate.
- Voyager 2 went on to Uranus and Neptune and the code was switched to a concatenated Reed-Solomon code-Convolutional code for its substantially more powerful error correcting capabilities.
- Current DSN error correction is done with dedicated hardware.
- For some NASA deep space craft such as those in the Voyager program, Cassini-Huygens (Saturn), New Horizons (Pluto) and Deep Space 1—the use of hardware ECC may not be feasible for the full duration of the mission.
The different kinds of deep space and orbital missions that are conducted suggest that trying to find a "one size fits all" error correction system will be an ongoing problem for some time to come.
- For missions close to the earth the nature of the "noise" is different from that on a spacecraft headed towards the outer planets.
- In particular, if a transmitter on a spacecraft far from earth is operating at a low power, the problem of correcting for noise gets larger with distance from the earth.
[edit] Satellite broadcasting (DVB)
The demand for satellite transponder bandwidth continues to grow, fueled by the desire to deliver television (including new channels and High Definition TV) and IP data. Transponder availability and bandwidth constraints have limited this growth, because transponder capacity is determined by the selected modulation scheme and Forward error correction (FEC) rate.
Overview
- QPSK coupled with traditional Reed Solomon and Viterbi codes have been used for nearly 20 years for the delivery of digital satellite TV.
- Higher order modulation schemes such as 8PSK, 16QAM and 32QAM have enabled the satellite industry to increase transponder efficiency by several orders of magnitude.
- This increase in the information rate in a transponder comes at the expense of an increase in the carrier power to meet the threshold requirement for existing antennas.
- Tests conducted using the latest chipsets demonstrate that the performance achieved by using Turbo Codes may be even lower than the 0.8 dB figure assumed in early designs.
[edit] Data storage
Error detection and correction codes are often used to improve the reliability of data storage media.
A "parity track" was present on the first magnetic tape data storage in 1951. The "Optimal Rectangular Code" used in group code recording tapes not only detects but also corrects single-bit errors.
Some file formats, particularly archive formats, include a checksum (most often CRC32) to detect corruption and truncation and can employ redundancy and/or parity files to recover portions of corrupted data.
Reed Solomon codes are used in compact discs to correct errors caused by scratches.
Modern hard drives use CRC codes to detect and Reed-Solomon codes to correct minor errors in sector reads, and to recover data from sectors that have "gone bad" and store that data in the spare sectors.[4]
RAID systems use a variety of error correction techniques, to correct errors when a hard drive completely fails.
[edit] Error-correcting memory
DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation.
Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy.
Interleaving allows distributing the effect of a single cosmic ray potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and the illusion of an error-free memory system may be maintained.[5]
[edit] See also
- List of checksum algorithms
- List of error-correcting codes
- List of algorithms for error detection and correction
- Reliability (computer networking)
- Link adaptation
[edit] Notes
- ^ See also Encyclopedia of Microcomputers: Volume 6, p. 343.
- ^ a b A. J. McAuley, Reliable Broadband Communication Using a Burst Erasure Correcting Code, ACM SIGCOMM, 1990.
- ^ http://www-math.cudenver.edu/~wcherowi/courses/m6409/mariner9talk.pdf
- ^ My Hard Drive Died | Scott A. Moulton
- ^ "Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite". Tsinghua Space Center, Tsinghua University, Beijing. http://www.apmcsta.org/File/doc/Conferences/6th%20meeting/Chen%20Zhenyu.doc. Retrieved 2009-02-16.
[edit] References
This article incorporates public domain material from the General Services Administration document "Federal Standard 1037C" (in support of MIL-STD-188).
[edit] External links
- The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKay, contains chapters on elementary error-correcting codes; on the theoretical limits of error-correction; and on the latest state-of-the-art error-correcting codes, including low-density parity-check codes, turbo codes, and fountain codes.
- Compute parameters of linear codes - an on-line interface for generating and computing parameters (e.g. minimum distance, covering radius) of linear error-correcting codes.
- ECC Page