Error recovery control

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computing, error recovery control (ERC) (Western Digital: time-limited error recovery (TLER), Samsung/Hitachi: command completion time limit (CCTL)) is a feature of hard disks which allow a system administrator to configure the amount of time a drive's firmware is allowed to spend recovering from a read or write error. Limiting the recovery time allows for improved error handling in a hardware RAID environment. In some cases, there is a conflict as to whether error handling should be undertaken by the hard drive or by the hardware RAID controller, which leads to drives being marked as unusable and significant performance degradation, when this could otherwise have been avoided.

Overview[edit]

Modern hard drives feature an ability to recover from some read/write errors by internally remapping sectors and performing other forms of self test and recovery. The process for this can sometimes take several seconds or (under heavy usage) minutes, during which time the drive is unresponsive. Hardware RAID controllers are designed to recognise a drive which does not respond within a few seconds, and mark it as unreliable, indicating that it should be withdrawn from use and the array rebuilt from parity data. This is a long process, degrades performance, and if more drives fail under the resulting additional workload, it may be catastrophic.

If the drive itself is inherently reliable but has some bad sectors, then TLER and similar features prevent a disk from being unnecessarily marked as 'failed' by limiting the time spent on correcting detected errors before advising the array controller of a failed operation. The array controller can then handle the data recovery for the limited amount involved, rather than marking the entire drive as faulty.

Desktop Computers and TLER Effect[edit]

Effectively, TLER and similar features limit the performance of on-drive error handling, to allow hardware RAID controllers to handle the error if problematic. In a non-hardware-RAID environment, such features are unhelpful, and manufacturers do not recommend their use.

Generally, Western Digital enterprise drives such as Raptor, Caviar RE2 and RE2-GP (RAID Edition) come with TLER Read "Enabled" (7 seconds) and TLER Write "Enabled" (7 seconds) while desktop drives such as Caviar SE, SE16, and GP come with TLER Read and Write Disabled (0 seconds).

Stand-Alone vs RAID Hard Disk Usage Considerations[edit]

It is best for TLER to be "Enabled" when in a hardware RAID array to prevent the recovery time from a disk read or write error from exceeding the hardware RAID controller's timeout threshold. If a drive times out, the hard disk will need to be manually re-added to the array, requiring a re-build and re-synchronization of the hard disk. Enabling TLER seeks to prevent this by interrupting error correction before timeout, to report failures only for data segments. The result is increased reliability in a hardware RAID array.

In a stand-alone configuration TLER should be disabled. As the drive is not redundant, reporting segments as failed will only increase manual intervention. Without a hardware RAID controller to drop the disk, normal (no TLER) recovery ability is most stable.

In a software RAID configuration whether or not TLER is helpful is dependent on the operating system. For example in FreeBSD the ATA/CAM stack controls the timeouts, and is set to progressively increase the timeouts as they occur. Thus, if a desktop disk without TLER starts delaying a response to a sector read, FreeBSD will retry the read with successively longer timeouts to prevent prematurely dropping the disk out of the array.

Model TLER Default ( Read / Write ) Stand-Alone Recommendation RAID Recommendation
Caviar, SE, SE16, GP, Raptor Disabled ( 0s / 0s ) Default Enabled (if possible)
Caviar RE2, RE2-GP, Red Enabled ( 7s / 0s ) Disabled Default

Interaction of TLER with the advanced ZFS filesystem[edit]

The ZFS/OpenZFS filesystem was written to immediately write data to a sector that reports as bad or takes an excessively long time to read (such as non TLER drives); this will usually force an immediate sector remap on a weak sector in most drives. ZFS was developed by Sun Microsystems.

Western Digital Time Limit Error Recovery Utility - WDTLER.EXE[edit]

The WDTLER utility allows for the enabling or disabling of the TLER parameter in the hard disk's firmware settings allowing the user to determine the best setting for his particular usage as either a stand-alone or RAID drive. This utility is written for DOS and you will require a DOS bootable disk with this utility on it to use it.

The WDTLER utility works on and makes changes to all the connected and compatible Western Digital hard drives to the computer. It is important to remember that any change will affect all the hard drives. If you only wish to change specific hard drives on your computer then you should disconnect the other hard drives before you use this utility, then reconnect them after you are finished.

The WDTLER utility comes with three batch files, TLERSCAN.BAT to get the current state of the TLER setting on all the hard drives, TLER-ON.BAT to Enable TLER, and TLER-OFF.BAT to Disable TLER. The included TLER-ON.BAT will set the Read & Write TLER time to 7 seconds. If you wish to use a custom timeout value, you can use the WDTLER.EXE utility directly with the -r# -w# parameters to specify how many seconds the Time Limit value should be.

Western Digital now claims that using the WDTLER.EXE tool on newer drives can damage the firmware and make the disk unusable. The WDTLER.EXE tool is no longer available from Western Digital, and new disks will not be able to have the TLER setting changed. RE disks are only suitable for RAID arrays and Caviar are only suitable for non-RAID use. The utility still works for older disks.

Below is the WDTLER output for Western Digital Caviar SE16 320 GB and 500 GB hard disk for the default TLER configuration before and after TLER has been Enabled.

Before - TLER Read & Write: Disabled

WDTLER Version 1.03
Copyright (C) 2004-2006 Western Digital Corporation
Western Digital Time Limit Error Recovery Utility

Model: WDC WD3200KS-00PFB0 Serial Number: WD-WCAPD1234567
   Read TLER is disabled.
   Write TLER is disabled.

Model: WDC WD3200KS-00PFB0 Serial Number: WD-WCAPD1234567
   Read TLER is disabled.
   Write TLER is disabled.

Model: WDC WD5000KS-00MNB0 Serial Number: WD-WMANU1234567
   Read TLER is disabled.
   Write TLER is disabled.

Model: WDC WD5000KS-00MNB0 Serial Number: WD-WMANU1234567
   Read TLER is disabled.
   Write TLER is disabled.

Legend: WD3200KS - Western Digital Caviar SE16 320 GB, WD5000KS - Western Digital Caviar SE16 500 GB


After - TLER Read & Write: 7 seconds

WDTLER Version 1.03
Copyright (C) 2004-2006 Western Digital Corporation
Western Digital Time Limit Error Recovery Utility

Model: WDC WD3200KS-00PFB0 Serial Number: WD-WCAPD1234567
   Read TLER time is 7.000 seconds.
   Write TLER time is 7.000 seconds.

Model: WDC WD3200KS-00PFB0 Serial Number: WD-WCAPD1234567
   Read TLER time is 7.000 seconds.
   Write TLER time is 7.000 seconds.

Model: WDC WD5000KS-00MNB0 Serial Number: WD-WMANU1234567
   Read TLER time is 7.000 seconds.
   Write TLER time is 7.000 seconds.

Model: WDC WD5000KS-00MNB0 Serial Number: WD-WMANU1234567
   Read TLER time is 7.000 seconds.
   Write TLER time is 7.000 seconds.

Legend: WD3200KS - Western Digital Caviar SE16 320 GB, WD5000KS - Western Digital Caviar SE16 500 GB


Note: Western Digital (1.5TB Green Power) WD15EADS-00P8B0 (Nov 2009) drives do not support TLER. WD15EADS-00S2B0 (Feb 2010) models do support TLER.

smartctl utility[edit]

On disks that fully implement the ATA-8[1] standard, the smartctl utility (part of the smartmontools package) can be used[2] to control the TLER behavior of many drives by setting the SCT Error Recovery Control (scterc) parameter:

  • Reading the current setting:
smartctl -l scterc /dev/sda
     SCT Error Recovery Control:
           Read: Disabled
          Write: Disabled
  • Changing the setting:
smartctl -l scterc,200,200 /dev/sda
    SCT Error Recovery Control:
               Read:    200 (20.0 seconds)
              Write:    200 (20.0 seconds)
  • Disabling TLER (allow unlimited time for recovery, for stand-alone drives not in a RAID array):
smartctl -l scterc,0,0 /dev/sda
     SCT Error Recovery Control set to:
           Read: Disabled
          Write: Disabled

These commands may not work on all HDs: some manufacturers have changed their desktop drives to stop support for the ERC parameter,[3] purportedly to force sales of their more expensive RAID/enterprise models.

Raid Controllers[edit]

Hardware raid controllers disconnect timeout value may vary from vendor to vendor. TLER should trigger before the controller times out the drive. For example,

  • 3ware 9650SE: 20 seconds
  • LSI Logic(for IBM x-series): 10 seconds (see BIOS Raid Config Utility > Advanced Device Properties)

Note:

Software Raid[edit]

  • Linux mdadm simply holds and lets the drive complete its recovery - however, the default command timeout for the SCSI Disk layer (/sys/block/sd?/device/timeout) is 30 seconds,[4] after which it will attempt to reset the drive, and if that fails, offline the drive[5]

References[edit]

External links[edit]