Data corruption: Difference between revisions

Content deleted Content added

Inline

Revision as of 13:12, 28 July 2011

"I have over 2000 pictures, microsoft word documents etc on my 16GB USB stick and they have all become corrupted! I can't move them or anything! I've lost every single file I've ever owned :( Anyone know why this has happened?" -VeLz

Data corruption refers to errors in computer data that occur during transmission, retrieval, or processing, introducing unintended changes to the original data. Computer storage and transmission systems use a number of measures to provide data integrity, or lack of errors.

In general, when data corruption occurs, the file containing that data may become inaccessible, and the system or the related application will give an error. For example, if a Microsoft Word file is corrupted, when you try to open that file with MS Word, you will get an error message, and the file would not be opened. Some programs can give a suggestion to repair the file automatically (after the error), and some programs cannot repair it. It depends on the level of corruption, and the in-built functionality of the application to handle the error. There are various causes of the corruption.

Transmission

Data corruption during transmission has a variety of causes. Interruption of data transmission causes information loss. Environmental conditions can interfere with data transmission, especially when dealing with wireless transmission methods. Heavy clouds can block satellite transmissions. Wireless networks are susceptible to interference from devices such as microwave ovens.

Storage

Data loss during storage has two broad causes: hardware and software failure. Head crashes and general wear and tear of media fall into the former category, while software failure typically occurs due to bugs in the code.

Error detection and correction may occur in the hardware, the disk subsystem or adapter, or software which implements error checking and correction (i.e., RAID software such as mdadm for Linux).

There are two types of data loss:

Undetected- also known as "silent corruption". These problems have been attributed to errors during the write process to disk. These are the most dangerous errors as there is no indication that the data is incorrect.
Detected- these errors are most often caused by disk drive problems. Errors may either permanent or temporary, where temporary errors are able to be overcome when the operation is repeated by the hardware. Errors are normally detected by the hardware, either by the disk drive by checking the data read from the disk using the ECC/CRC error correcting code stored alongside the data on disk, or in the case of a RAID array by comparing the contents of the RAID strips with the ECC checksum or parity of the RAID stripe.

Countermeasures

When data corruption behaves as a Poisson process, where each bit of data has an independently low probability of being changed, data corruption can generally be detected by the use of checksums, and can often be corrected by the use of error correcting codes.

If an uncorrectable data corruption is detected, procedures such as automatic retransmission or restoration from backups can be applied. Certain levels of RAID disk arrays have the ability to store and evaluate parity bits for data across a set of hard disks and can reconstruct corrupted data upon the failure of a single or multiple disks, depending on the level of RAID implemented.

Today, many errors are detected and corrected by the disk drive using the ECC/CRC codes^[1] which are stored on disk for each sector. If the disk drive detects multiple read errors on a sector it may make a copy of the failing sector on another part of the disk- remapping the failed sector of the disk to a spare sector without the involvement of the operating system (though this may be delayed until the next write to the sector). This "silent correction" can lead to other problems if disk storage is not managed well, as the disk drive will continue to remap sectors until it runs out of spares, at which time the temporary correctable errors can turn into permanent ones as the disk drive deteriorates. S.M.A.R.T. provides a standardized way of monitoring the health of a disk drive, and there are tools available for most operating systems to automatically check the disk drive for impending failures by watching for deteriorating SMART parameters.

In the case of a RAID setup with a single parity (e.g. RAID1, RAID5), a single detected error will allow the RAID controller or software to correct that error. If there is an "undetected" error then the RAID algorithm for a single parity drive cannot determine which drive is at fault, it can only determine that the data does not match its parity/checksum, and will therefore just correct the parity/checksum information.

For RAID levels with two parity bits, a single "undetected" error where a drive provides incorrect data without flagging a fault can be corrected. This relies on the RAID system checking the parity- called parity check on read (DDN disk subsystems) or pre-read redundancy check (PRRC on LSI/Engenio disk subsystems). This feature may need to be enabled even if it is available on the disk subsystem. This check may not be done on normal reads (depending on the disk subsystem), as it involves reading all of the data from all disks in the RAID stripe. This may be a performance overhead, if the requesting program only wanted the data which resided on a fraction of the stripe. In the case of parity-check-on-read all of the data from all of the disks in the stripe is read, increasing the IOs to the disk susbsystem, as well as potentially incurring an overhead to calculate and check the parity.

"Data scrubbing" is another method to reduce the likelihood of data corruption, as disk errors are caught and recovered from, before multiple errors accumulate and overwhelm the number of parity bits. Instead of parity being checked on each read, the parity is checked during a regular scan of the disk, often done as a low priority background process. Note that the "data scrubbing" operation activates a parity check. If a user simply runs a normal program that reads data from the disk, then the parity would not be checked unless parity-check-on-read was both supported and enabled on the disk subsystem.

If appropriate mechanisms are employed to detect and remedy data corruption, data integrity can be maintained. This is particularly important in commercial applications (e.g. banking), where an undetected error could either corrupt a database index or change data to drastically affect an account balance, and in the use of encrypted or compressed data, where a small error can make an extensive dataset unusable.^[2] It is worth noting that while the study by CERN has been often referenced as showing large levels of data corruption, the disk subsystem which was the subject of the paper was set up with RAID5 and a single parity bit (hence could not recover from a single "silent" error), did not use parity-check-on-read (and hence could not detect "silent errors" through parity checking of the RAID stripe), and did not use data scrubbing. The disk storage was also subject to a microcode software bug which caused higher levels of errors than normal ^[3] .

References

^ "Read Error Severities and Error Management Logic". Retrieved 24 July 2011.
^ Data Integrity by Cern April 2007 Cern.ch
^ Bernd Panzer-Steindel. "Data integrity".There are some correlations with known problems, like the problem where disks drop out of the RAID5 system on the 3ware controllers. After some long discussions with 3Ware and our hardware vendors this was identified as a problem in the WD disk firmware.

[1] "Read Error Severities and Error Management Logic". Retrieved 24 July 2011.

[2] Data Integrity by Cern April 2007 Cern.ch

[3] Bernd Panzer-Steindel. "Data integrity".There are some correlations with known problems, like the problem where disks drop out of the RAID5 system on the 3ware controllers. After some long discussions with 3Ware and our hardware vendors this was identified as a problem in the WD disk firmware.

[1]

[2]

[3]

@@ Line 2: / Line 2: @@
 {{Cleanup-rewrite|date=September 2009}}
 [[Image:Data loss of image file.JPG|thumb|Photo data corruption; in this case, a result of a failed data recovery from a hard disk drive]]
+"I have over 2000 pictures, microsoft word documents etc on my 16GB USB stick and they have all become corrupted! I can't move them or anything! I've lost every single file I've ever owned :( Anyone know why this has happened?"
+-VeLz
 '''Data corruption''' refers to errors in [[computer]] [[data]] that occur during transmission, retrieval, or processing, introducing unintended changes to the original data. Computer storage and transmission systems use a number of measures to provide [[data integrity]], or lack of errors.

Revision as of 13:12, 28 July 2011

Transmission

Storage

Countermeasures

See also

References