Self-Monitoring, Analysis and Reporting Technology: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
No edit summary
Line 6: Line 6:


Monitoring a drive's behavior can predict approximately 60 percent of hard drive failures. S.M.A.R.T. has the purpose of warning a user or system administrator of impending drive failure while time remains to take preventive action — such as copying the data to a replacement device.
Monitoring a drive's behavior can predict approximately 60 percent of hard drive failures. S.M.A.R.T. has the purpose of warning a user or system administrator of impending drive failure while time remains to take preventive action — such as copying the data to a replacement device.

[[Compaq]] pioneered S.M.A.R.T., but most major hard-drive and [[motherboard]] vendors now support it to some extent. Many motherboards will display a warning message when a disk drive approaches failure. However, at the present time S.M.A.R.T. is not implemented correctly on many computer platforms due to the absence of industry-wide software & hardware standards for S.M.A.R.T. data interchange.

From a legal perspective, the term "S.M.A.R.T." refers only to a signaling method between internal disk drive electromechanical sensors and the host computer--thus a manufacturer could include a sensor for just one physical attribute and then claim the product is S.M.A.R.T. compatible. For example, some drive manufacturers claim to support S.M.A.R.T. but do not include a temperature sensor. In the case of electronic devices, reliability is typically the inverse of temperature, so temperature can be a crucial predictor of failure. During periods of heavy usage (such as "defrag" operations, or in the case of a web-server) internal drive temperature can exceed the manufacturer's published specifications. Damage to electronics from excessive temperature is cumulative. A S.M.A.R.T.-compliant temperature sensor can warn the operator before a disk drive is damaged by excessive heat, but this sensor is frequently omitted.

With regard to disk drives in particular, the term "S.M.A.R.T." is therefore a virtually-meaningless standard because many drive manufacturers claim to support it but simultaneously refuse to disclose which physical characteristics are monitored by onboard sensors. This creates confusion and prevents the consumer from making valid comparisons. Manufacturing companies which claim to support S.M.A.R.T. but withhold specific sensor information on individual products include Seagate, [...]

Some disk controllers can duplicate all write operations on a secondary "backup" drive in real-time. This feature is known as "R.A.I.D. mirroring." However, many programs which are designed to analyze changes in drive behavior and relay S.M.A.R.T. alerts to the operator do not function when a computer system is configured for R.A.I.D. support. Additionally, some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives, depending on the type of interface (e.g. SCSI, Fibre channel, IDE, SATA, SAS, SSA). It is probable that the computer industry will correctly implement S.M.A.R.T. only when a significant percentage of consumers demand compatibility, standardization, and full-disclosure from manufacturers.


==Attributes==
==Attributes==

Revision as of 22:35, 4 February 2006

Self-Monitoring, Analysis, and Reporting Technology or S.M.A.R.T. is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.

Fundamentally, hard drives can suffer one of two classes of failure:

Predictable
Some failure modes, especially mechanical wear and aging, happen gradually over time. A monitoring device can detect these, much as a temperature dial on the dashboard of an automobile can warn a driver — before serious damage occurs — that the engine has started to overheat.
Unpredictable
Other failures may occur suddenly and unpredictably, such as an electronic component burning out.

Monitoring a drive's behavior can predict approximately 60 percent of hard drive failures. S.M.A.R.T. has the purpose of warning a user or system administrator of impending drive failure while time remains to take preventive action — such as copying the data to a replacement device.

Compaq pioneered S.M.A.R.T., but most major hard-drive and motherboard vendors now support it to some extent. Many motherboards will display a warning message when a disk drive approaches failure. However, at the present time S.M.A.R.T. is not implemented correctly on many computer platforms due to the absence of industry-wide software & hardware standards for S.M.A.R.T. data interchange.

From a legal perspective, the term "S.M.A.R.T." refers only to a signaling method between internal disk drive electromechanical sensors and the host computer--thus a manufacturer could include a sensor for just one physical attribute and then claim the product is S.M.A.R.T. compatible. For example, some drive manufacturers claim to support S.M.A.R.T. but do not include a temperature sensor. In the case of electronic devices, reliability is typically the inverse of temperature, so temperature can be a crucial predictor of failure. During periods of heavy usage (such as "defrag" operations, or in the case of a web-server) internal drive temperature can exceed the manufacturer's published specifications. Damage to electronics from excessive temperature is cumulative. A S.M.A.R.T.-compliant temperature sensor can warn the operator before a disk drive is damaged by excessive heat, but this sensor is frequently omitted.

With regard to disk drives in particular, the term "S.M.A.R.T." is therefore a virtually-meaningless standard because many drive manufacturers claim to support it but simultaneously refuse to disclose which physical characteristics are monitored by onboard sensors. This creates confusion and prevents the consumer from making valid comparisons. Manufacturing companies which claim to support S.M.A.R.T. but withhold specific sensor information on individual products include Seagate, [...]

Some disk controllers can duplicate all write operations on a secondary "backup" drive in real-time. This feature is known as "R.A.I.D. mirroring." However, many programs which are designed to analyze changes in drive behavior and relay S.M.A.R.T. alerts to the operator do not function when a computer system is configured for R.A.I.D. support. Additionally, some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives, depending on the type of interface (e.g. SCSI, Fibre channel, IDE, SATA, SAS, SSA). It is probable that the computer industry will correctly implement S.M.A.R.T. only when a significant percentage of consumers demand compatibility, standardization, and full-disclosure from manufacturers.

Attributes

Each drive manufacturer defines a set of attributes and selects threshold values which should not be exceeded under normal operation. Attribute values can range from 1 to 253 (1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the "normal" value. Manufacturers that have supported one or more S.M.A.R.T. parameters in various products include: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, WD (Western Digital). These manufacturers do not necessarily agree on precise attribute definitions and measurement units; therefore the following list should be regarded as a general reference only.


KNOWN S.M.A.R.T. PARAMETERS

(Parameters marked "CRITICAL" are potential indicators of imminent electromechanical failure)


Read Error Rate (01) *CRITICAL*

Indicates the rate of hardware read errors that occurred when reading data from a disk surface. Lower values indicate a problem with either disk surface or read/write heads.

Throughput Performance (02)

Overall (general) throughput performance of a hard disk drive. If the value of this attribute is deceasing there is a high probability of troubles with your disk.

Spin-Up Time (03)

Average time of spindle spin up (from zero RPM to fully operational).

Start/Stop Count (04)

A tally of spindle start/stop cycles.

Reallocated Sectors Count (05) *CRITICAL*

Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping and "reallocated" sectors are called remaps. This is why, on a modern hard disks, you can not see "bad blocks" while testing the surface - all bad blocks are hidden in reallocated sectors. However, the more sectors that are reallocated, the more read/write speed will decrease.

Read Channel Margin (06)

Margin of a channel while reading data. The function of this attribute is not specified.

Seek Error Rate (07)

Rate of seek errors of the magnetic heads. If there is a failure in the mechanical positioning system, a servo damage or a thermal widening of the hard disk, seek errors arise. More seek errors indicates a worsening condition of a disk surface and the mechanical subsystem.

Seek Time Performance (08)

Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.

Power-On Hours (09)

Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state. A decrease of this attribute value to the critical level (threshold) indicates a decrease of the MTBF (Mean Time Between Failure). However, in reality, even if the MTBF value falls to zero, it does not mean the MTBF resource is completely exhausted and the drive will not function normally.

Spin Retry Count (10)

Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). A decrease of this attribute value is a sign of problems in the hard disk mechanical subsystem.

Recalibration Retries (11)

This attribute indicates the number of times recalibration was requested (under the condition that the first attempt was unsuccessful). A decrease of this attribute value is a sign of problems in the hard disk mechanical subsystem.

Device Power Cycle Count (12)

This attribute indicates the count of full hard disk power on/off cycles.

Soft Read Error Rate (13)

Uncorrected read errors reported to the operating system.

G-Sense Error Rate (221)

The number of errors resulting from externally-induced shock & vibration.

Power-Off Retract Cycle (228)

The number of times the magnetic armature was retracted automatically as a result of cutting power.

Load/Unload Cycle (193)

Count of load/unload cycles into head landing zone position.

Temperature (194)

Current internal temperature.

Reallocation Event Count (196) *CRITICAL*

Count of remap operations. The raw value of this attribute shows the total number of attempts to transfer data from reallocated sectors to a spare area. Both successful & unsuccessful attempts are counted.

Current Pending Sector Count (197) *CRITICAL*

Number of "unstable" sectors (waiting to be remapped). When unstable sectors are read successfully, the value is decreased. If errors occur when reading a sector, the drive will attempt to recover the data, transfer it to the reserved (spare) area and mark the sector as remapped.

Uncorrectable Sector Count (198) *CRITICAL*

The total number of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

UltraDMA CRC Error Count (199)

The number of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

Write Error Rate/Multi-Zone Error Rate (200)

The total number of errors when writing a sector.

Disk Shift (220) *CRITICAL*

Distance the disk has shifted relative to the spindle (usually due to shock). Unit of measure is unknown.

Loaded Hours (222)

Time spent operating under data load (movement of magnetic head armature)

Load/Unload Retry Count (223)

Number of times head changes position.

Load Friction (224)

Resistance caused by friction in mechanical parts while operating.

Load 'In'-time (226)

Total time of loading on the magnetic heads actuator (time not spent in parking area).

Torque Amplification Count (227)

Number of attempts to compensate for platter speed variations

GMR Head Amplitude (230)

Amplitude of "thrashing" (distance of repetitive forward/reverse head motion)

References

"S.M.A.R.T. attribute meaning". PalickSoft. February 3. {{cite web}}: Check date values in: |date= and |year= / |date= mismatch (help); Cite has empty unknown parameter: |publishyear= (help)

External links

Software

Various operating-system specific software can extend the users ability to monitor disk drive conditions through the S.M.A.R.T. interface and predict when a failure is likely to occur by logging deviations in attribute values. This software may also possess the capability to distinguish between gradual degradation over time (representing normal wear) and a sudden change (which may indicate a more serious problem).