Chipkill is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology that protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips, such that the failure of any single memory chip will affect only one ECC bit per word. This allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes, such as a BCH code, that can correct multiple bits with less overhead. The equivalent system from Sun Microsystems is called Extended ECC. The equivalent systems from HP are called Advanced ECC and Chipspare. A similar system from Intel is called double-device data correction (DDDC).
Chipkill is frequently combined with dynamic bit-steering, so that if a chip fails (or has exceeded a threshold of bit errors), another, spare, memory chip is used to replace the failed chip. The concept is similar to that of RAID, which protects against disk failure, except that now the concept is applied to individual memory chips. The technology was developed by the IBM Corporation in the early and middle 1990s. An important RAS feature, Chipkill technology is deployed primarily on SSDs, mainframes and midrange servers.
A 2009 paper using data from Google's datacentres provided evidence demonstrating that in observed Google systems, DRAM errors were recurrent at the same location, and that 8% of DIMMs were affected each year. Specifically, "In more than 85% of the cases a correctable error is followed by at least one more correctable error in the same month". DIMMs with chipkill error correction showed a lower fraction of DIMMs reporting uncorrectable errors compared to DIMMs with error correcting codes that can only correct single-bit errors. A 2010 paper from University of Rochester also showed that Chipkill memory gave substantially lower memory errors, using both real world memory traces and simulations.
- ECC memory
- Lockstep (computing)
- Memory ProteXion
- Redundant array of independent memory
- Single-error correction and double-error detection (SECDED)
- "Best Practice Guidelines for ProLiant Servers with the Intel Xeon 5500 processor series Engineering Whitepaper, 1st Edition" (PDF). HP. May 2009. p. 8. Retrieved 2014-09-09.
- Schroeder, Bianca; Pinheiro, Eduardo and Weber, Wolf-Dietrich (2009). "DRAM errors in the wild: a large-scale field study". Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems. SIGMETRICS '09 (ACM): 193–204. doi:10.1145/1555349.1555372. Retrieved 7 September 2011.
- Li, Huang, Shen, Chu (2010). ""A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility". Usenix Annual Tech Conference 2010".
- Timothy J. Dell, A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory, (1997), IBM Microelectronics Division.
- Intel E7500 Chipset MCH Intelx4 Single Device Data Correction (x4 SDDC) Implementation and Validation, Intel Application note AP-726, August 2002.
- DRAM study turns assumptions about errors upside down, Ars Technica October 7, 2009.