Hard disk drive failure
A hard disk drive failure occurs when a hard disk drive malfunctions and the stored information cannot be accessed with a properly configured computer. A disk failure may occur in the course of normal operation, or due to an external factor such as exposure to fire or water or high magnetic fields, or suffering a sharp impact or environmental contamination, which can lead to a head crash.
The most notorious cause of hard-disk failure is a head crash, where the internal read-and-write head of the device, usually just hovering above the surface, touches a platter, or scratches the magnetic data-storage surface. A head crash usually incurs severe data loss, and data recovery attempts may cause further damage if not done by a specialist with proper equipment. Hard-drive platters are coated with an extremely thin layer of non-electrostatic lubricant, so that the read-and-write head will simply glance off the surface of the platter should a collision occur. However, this head hovers mere nanometers from the platter's surface which makes a collision an acknowledged risk. Another cause of failure is a faulty air filter. The air filters on today's hard drives equalize the atmospheric pressure and moisture between the hard-drive enclosure and its outside environment. If the filter fails to capture a dust particle, the particle can land on the platter, causing a head crash if the head happens to sweep over it. After a hard-drive crash, each particle from the damaged platter and head media can cause a bad sector. These, in addition to platter damage, will quickly render a hard drive useless. A hard drive also includes controller electronics, which occasionally fail. In such cases, it may be possible to recover all data.
Since hard drives are mechanical devices, they will all eventually fail. While some may not fail prematurely, many hard drives simply fail because of worn out parts. Many hard-drive manufacturers include a Mean Time Between Failures figure on product packaging or in promotional literature. These are calculated by constantly running samples of the drive for a short amount of time, analyzing the resultant wear and tear upon the physical components of the drive, and extrapolating to provide a reasonable estimate of its lifespan. Since this fails to account for phenomena such as the aforementioned head crash, external trauma (dropping or collision), power surges, and so forth, the Mean Time Between Failures number is not generally regarded as an accurate estimate of a drive's lifespan. Hard-drive failures tend to follow the concept of the bathtub curve. Hard drives typically fail within a short time if there is a defect present from manufacturing. If a hard drive proves reliable for a period of a few months after installation, the hard drive has a significantly greater chance of remaining reliable. Therefore, even if a hard drive is subjected to several years of heavy daily use, it may not show any notable signs of wear unless closely inspected. On the other hand, a hard drive can fail at any time in many different situations.
The phenomenon of disk failure is not limited to hard drives. Other media types are prone to failure; in the late 1990s the click of death, so called because affected drives would endlessly click when disks were inserted into them, plagued many users of Iomega's 100 megabyte Zip disks.
CD-ROM and DVD writeable media can fail over time due to degradation of the organic dye layer. Studies done by NIST under harsh conditions of light, temperature and humidity demonstrated sharp increases in bit errors after only 100 hours (with the exception of gold/phthalocyanine technology, which is far more durable). Advertised to last 100–300 years, the NIST report suggests that gold-layer disks are at least stable for "several tens of years", when stored properly. Drives with ever-increasing read and write speeds rotate CD and DVD media at over 25,000 rpm. Disks have been demonstrated to crack at 30,000 rpm due to centrifugal force.
Signs of hard-disk failure
Hard-drive failure can be catastrophic or gradual. The former typically presents as a drive that can no longer be detected by CMOS setup, or that fails to pass BIOS POST so that the operating system never sees it. Gradual hard-drive failure can be harder to diagnose, because its symptoms, such as corrupted data and slowing down of the PC (caused by gradually failing areas of the hard drive requiring repeated read attempts before successful access), can be caused by many other computer issues, such as malware. A rising number of bad sectors can be a sign of a failing hard drive, but because the hard drive automatically adds them to its own growth defect table, they may not become evident to utilities such as Scandisk unless the utility can catch them before the hard drive's defect management system does, or the backup sectors held in reserve by the internal hard-drive defect management system run out. A cyclical repetitive pattern of seek activity such as rapid or slower seek-to-end noises (click of death) can be indicative of hard drive problems.
Landing zones and load/unload technology
During normal operation, heads in HDDs fly above the data recorded on the disks. Modern HDDs prevent power interruptions or other malfunctions from landing its heads in the data zone by either physically moving (parking) the heads to a special landing zone on the platters that is not used for data storage, or by physically locking the heads in a suspended (unloaded) position raised off the platters. Some early PC HDDs did not park the heads automatically when power was prematurely disconnected and the heads would land on data. In some other early units the user would run a program to manually park the heads.
A landing zone is an area of the platter usually near its inner diameter (ID), where no data is stored. This area is called the Contact Start/Stop (CSS) zone. Disks are designed such that either a spring or, more recently, rotational inertia in the platters is used to park the heads in the case of unexpected power loss. In this case, the spindle motor temporarily acts as a generator, providing power to the actuator.
Spring tension from the head mounting constantly pushes the heads towards the platter. While the disk is spinning, the heads are supported by an air bearing and experience no physical contact or wear. In CSS drives the sliders carrying the head sensors (often also just called heads) are designed to survive a number of landings and takeoffs from the media surface, though wear and tear on these microscopic components eventually takes its toll. Most manufacturers design the sliders to survive 50,000 contact cycles before the chance of damage on startup rises above 50%. However, the decay rate is not linear: when a disk is younger and has had fewer start-stop cycles, it has a better chance of surviving the next startup than an older, higher-mileage disk (as the head literally drags along the disk's surface until the air bearing is established). For example, the Seagate Barracuda 7200.10 series of desktop hard disks are rated to 50,000 start-stop cycles, in other words no failures attributed to the head-platter interface were seen before at least 50,000 start-stop cycles during testing.
Around 1995 IBM pioneered a technology where a landing zone on the disk is made by a precision laser process (Laser Zone Texture = LZT) producing an array of smooth nanometer-scale "bumps" in a landing zone, thus vastly improving stiction and wear performance. This technology is still largely in use today, predominantly in desktop and enterprise (3.5 inch) drives. In general, CSS technology can be prone to increased stiction (the tendency for the heads to stick to the platter surface), e.g. as a consequence of increased humidity. Excessive stiction can cause physical damage to the platter and slider or spindle motor.
Load/Unload technology relies on the heads being lifted off the platters into a safe location, thus eliminating the risks of wear and stiction altogether. The first HDD RAMAC and most early disk drives used complex mechanisms to load and unload the heads. Modern HDDs use ramp loading, first introduced by Memorex in 1967, to load/unload onto plastic "ramps" near the outer disk edge.
Addressing shock robustness, IBM also created a technology for their ThinkPad line of laptop computers called the Active Protection System. When a sudden, sharp movement is detected by the built-in accelerometer in the Thinkpad, internal hard disk heads automatically unload themselves to reduce the risk of any potential data loss or scratch defects. Apple later also utilized this technology in their PowerBook, iBook, MacBook Pro, and MacBook line, known as the Sudden Motion Sensor. Sony, HP with their HP 3D DriveGuard and Toshiba have released similar technology in their notebook computers.
Modes of failure
Hard drives may fail in a number of ways. Failure may be immediate and total, progressive, or limited. Data may be totally destroyed, or partially or totally recoverable.
Earlier drives tended to develop bad sectors with use and wear, which could be "mapped out" so that they did not affect operation; this was considered normal unless many bad sectors developed in a short period. Later drives map out bad sectors automatically and invisibly to the user; S.M.A.R.T. information logs these problems. A drive with bad sectors may usually continue to be used.
Other failures which may be either progressive or limited are usually considered to be a reason to replace a drive; the value of data potentially at risk usually far outweighs the cost saved by continuing to use a drive which may be failing. Repeated but recoverable read or write errors, unusual noises, excessive and unusual heating, and other abnormalities, are warning signs.
- Head crash: a head may contact the rotating platter due to mechanical shock or other reason. At best this will cause irreversible damage and data loss where contact was made. In the worst case the debris scraped off the damaged area may contaminate all heads and platters, and destroy all data on all platters. If damage is initially only partial, continued rotation of the drive may extend the damage until it is total.
- Bad sectors: some magnetic sectors may become faulty without rendering the whole drive unusable. This may be a limited occurrence or a sign of imminent failure.
- Stiction: after a time the head may not "take off" when started up as it tends to stick to the platter, a phenomenon known as stiction. This is usually due to unsuitable lubrication properties of the platter surface, a design or manufacturing defect rather than wear. This occasionally happened with some designs until the early 1990s.
- Circuit failure: components of the electronic circuitry may fail making the drive inoperable.
- Bearing and motor failure: electric motors may fail or burn out, and bearings may wear enough to prevent proper operation.
- Miscellaneous mechanical failures: parts, particularly moving parts, of any mechanism can break or fail, preventing normal operation, with possible further damage caused by fragments.
Metrics of failures
Most major hard disk and motherboard vendors now support S.M.A.R.T (Self-Monitoring, Analysis, and Reporting Technology), which measures drive characteristics such as operating temperature, spin-up time, data error rates, etc. Certain trends and sudden changes in these parameters are thought to be associated with increased likelihood of drive failure and data loss. However, S.M.A.R.T. parameters alone may not be useful for predicting individual drive failures. While several S.M.A.R.T. parameters have an impact on failure probability, a large fraction of failed drives do not produce predictive S.M.A.R.T. parameters. Unpredictable breakdown may occur at any time in normal use, with potential loss of all data. Recovery of some or even all data from a damaged drive is sometimes, but not always possible, and is normally costly.
A 2007 study published by Google suggested very little correlation between failure rates and either high temperature or activity level. Indeed, the Google study indicated that "lower temperatures are associated with higher failure rates". Hard drives with S.M.A.R.T.-reported average temperatures below 27 °C (81 °F) had higher failure rates than hard drives with the highest reported average temperature of 50 °C (122 °F), failure rates at least twice as high as the optimum S.M.A.R.T.-reported temperature range of 36 °C (97 °F) to 47 °C (117 °F). The correlation between manufacturer/model and failure rate was relatively strong. Statistics in this matter are kept highly secret by most entities — Google did not relate manufacturers' names with failure rates, though they have since revealed that they use Hitachi Deskstar drives in some of their servers.
Google's 2007 study found, based on a large field sample of drives, that actual annualized failure rates (AFRs) for individual drives ranged from 1.7% for first year drives to over 8.6% for three-year old drives. A similar 2007 study at CMU on enterprise drives showed that measured MTBF was 3–4 times lower than the manufacturer's specification, with an estimated 3% mean AFR over 1–5 years based on replacement logs for a large sample of drives, and that hard drive failures were highly correlated in time.
A 2007 study of latent sector errors (as opposed to the above studies of complete disk failures) showed that 3.45% of 1.5 million disks developed latent sector errors over 32 months (3.15% of nearline disks and 1.46% of enterprise class disks developed at least one latent sector error within twelve months of their ship date), with the annual sector error rate increasing between the first and second years. Enterprise drives showed less sector errors than consumer drives. Background scrubbing was found to be effective in correcting these errors.
SCSI, SAS, and FC drives are more expensive than consumer-grade SATA drives, and usually used in servers and disk arrays, where SATA drives were sold to the home computer and desktop and near-line storage market and were perceived to be less reliable. This distinction is now becoming blurred.
The mean time between failures (MTBF) of SATA drives is usually specified to be about 1.2 million hours (some drives such as Western Digital Raptor have rated 1.4 million hours MTBF), while SAS/FC drives are rated for upwards of 1.6 million hours. However, independent research indicates that MTBF is not a reliable estimate of a drive's longevity (service life). MTBF is conducted in laboratory environments in test chambers and is an important metric to determine the quality of a disk drive, but is designed to only measure the relatively constant failure rate over the service life of the drive (the middle of the "bathtub curve") before final wear-out phase. A more interpretable, but equivalent, metric to MTBF is annualized failure rate (AFR). AFR is the percentage of drive failures expected per year. Both AFR and MTBF tend to measure reliability only in the initial part of the life of a hard-drive thereby understating the real probability of failure of a used disk-drive.
In order to avoid the loss of data due to disk failure, common solutions include:
- Data backup
- Data redundancy
- Active hard-drive protection
- S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) included in hard-drives
- Base isolation used under server racks in data centers
Data from a failed drive can sometimes be partially or totally recovered if the platters' magnetic coating is not totally destroyed. Specialised companies carry out data recovery, at significant cost, by opening the drives in a clean room and using appropriate equipment to read data from the platters directly. If the electronics have failed, it is sometimes possible to replace the electronics board, though often drives of nominally exactly the same model manufactured at different times have different, incompatible, circuit boards.
Sometimes operation can be restored for long enough to recover data, perhaps requiring reconstruction techniques such as file carving. Risky techniques are justifiable if the drive is otherwise dead. If a drive is started up once it may continue to run for a shorter or longer time but never start again, so as much data as possible is recovered as soon as the drive starts. A 1990s drive that does not start due to stiction can sometimes be started by tapping it or rotating the body of the drive rapidly by hand. Another technique which is sometimes known to work is to cool the drive, in a waterproof wrapping, in a domestic freezer. There is much useful information about this in blogs and forums, but professionals also resort to this method with some success.
- "Barracuda 7200.10 Serial ATA Product Manual" (PDF). Retrieved 26 April 2012.
- IEEE.org, Baumgart, P.; Krajnovich, D.J.; Nguyen, T.A.; Tam, A.G.; IEEE Trans. Magn.
- Pugh et al.; "IBM's 360 and Early 370 Systems"; MIT Press, 1991, pp.270
- "Sony | For Business | VAIO SMB". B2b.sony.com. Retrieved 13 March 2009.
- "HP.com" (PDF). Retrieved 26 April 2012.
- "Toshiba HDD Protection measures." (PDF). Retrieved 26 April 2012.
- "Hard Drives". escotal.com. Retrieved 16 July 2011.
- Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso (February 2007). "Failure Trends in a Large Disk Drive Population". USENIX Conference on File and Storage Technologies. 5th USENIX Conference on File and Storage Technologies (FAST 2007). Retrieved 15 September 2008.
- Shankland, Stephen (1 April 2009). "CNet.com". News.cnet.com. Retrieved 26 April 2012.
- AFR broken down by age groups: Failure Trends in Large Disk Drive Population, p. 4, figure 2 and subsequent figures.
- Bianca Schroeder and Garth A. Gibson. ""Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?". Proceedings 5th USENIX Conference on File and Storage Technologies. 2007".
- "L.N. Bairavasundaram, GR Goodson, S. Pasupathy, J.Schindler. "An analysis of latent sector errors in disk drives". Proceedings of SIGMETRICS'07, June 12-16,2007.".
- "WD VelociRaptor Drive Specification Sheet (PDF)" (PDF). Retrieved 26 April 2012.
- Jay White (Sept 2011). ""Storage subsystem resiliency guide. NetApp technical report. TR-3437."; page 5.".
- "Everything You Know About Disks Is Wrong". StorageMojo. 20 February 2007. Retrieved 29 August 2007.
- "One aspect of disk failures that single-value metrics such as MTTF and AFR cannot capture is that in real life failure rates are not constant. Failure rates of hardware products typically follow a "bathtub curve" with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle."(Schroeder et al. 2007)
- David A. Patterson; John L. Hennessy (13 October 2011). Computer Organization and Design, Revised Fourth Edition: The Hardware/Software Interface. Section 6.12. Elsevier. pp. 613–. ISBN 978-0-08-088613-8. - "...disk manufacturers argue that the calculation [of MTBF] corresponds to a user who buys a disk and keeps replacing the disk every five years- the planned lifetime of the disk."
- Decrypting hard-drive failures – MTBF and AFR
- "Detailed description of drive that worked for 20 minutes after freezing". Geeksaresexy.blogspot.com. 19 January 2006. Retrieved 26 April 2012.
- "Failing Hard Drives and the Freezer Technique Revisited". DtiData. 18 March 2011. Retrieved 26 April 2012.
|Wikibooks has a book on the topic of: Minimizing Hard Disk Drive Failure and Data Loss|