Data deduplication

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 128.222.37.53 (talk) at 04:15, 28 August 2009 (→‎References: organized products alphabetically by first word, added EMC Avamar link). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Data deduplication essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB.

Benefits

In general, data deduplication improves data protection, increases the speed of service, and reduces costs.

  • The business benefits from data de-duplication start with increasing overall data integrity and end with reducing overall data protection costs. Data de-duplication lets users reduce the amount of disk they need for backup by 90 percent or more.
  • With reduced acquisition costs—and reduced power, space, and cooling requirements—disk becomes suitable for first stage backup and restore and for retention that can easily extend to months.
  • With data on disk, restore service levels are higher, media handling errors are reduced, and more recovery points are available on fast recovery media.
  • Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.
  • Data deduplication is a very valuable tool within the virtual environment as well, giving you the ability to deduplicate the VMDK files need for deployment of virtual environments.
  • Data deduplication also has the ability to deduplicate snap shots files i.e. VMSN & VMSD in VMWare will give you considerable cost savings compared to the conventional disk backup environment whilst still giving you more recovery points for disaster recovery.
  • It contributes significantly in the process of Data Center Transformation through reducing carbon footprints due to savings on storage space.
  • It reduces the recurring cost of human resource to management and administration.
  • It reduces the recycling of the hardware.
  • It reduces the budget for data management, backup and retrieval by lowering fixed and recurring cost.

Drawbacks

Data deduplication solutions rely on cryptographic hash functions for identification of duplicate segments of data. A collision would result in data loss. Because of this, vendors have devised various ways of tackling this problem.

Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying reliability of the system. (Compare this, for example, to having a quad-redundant network link where all four cables pass through the same physical conduit.)

Major Commercial Players

Data deduplication is a very hot area these days, so there are a number of vendors. Particularly now that the VTL (virtual tape library) vendors are also getting involved.

ExaGrid's patented byte-level deduplication (content aware), NEC's HydraStor (Content Aware Deduplication Technology) , IBM's ProtecTier, Quantum, EMC/Data Domain, Symantec NetBackup PureDisk, EMC Avamar, Sepaton, Falconstor are some notable names.

The FalconStor VTL Enterprise software architecture provides concurrent overlap backups with data deduplication.[1]

Quantum was an early leader in this market and holds a patent for variable-length block data deduplication.

According to an OpenSolaris forum posting by Sun Fellow Jeff Bonwick, Sun/Oracle is scheduled to incorporate deduplication features into ZFS sometime in the summer of 2009.[2]

References

  1. ^ "FalconStor offers de-duplication in its next-generation virtual tape library".
  2. ^ "ZFS and deduplication".
  • Symantec NetBackup PureDisk [1]