Jump to content

Oracle ZFS

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Bolthole (talk | contribs) at 17:41, 12 August 2011 (→‎Platforms: Added "Solaris" as section header). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

ZFS
Developer(s)Sun Microsystems
Full nameZFS
IntroducedNovember 2005 with OpenSolaris
Structures
Directory contentsExtensible hash table
Limits
Max volume size16 EB
Max file size16 EB (264 bytes)
Max no. of files248
Max filename length255 bytes
Features
ForksYes (called Extended Attributes)
AttributesPOSIX
File system
permissions
POSIX, NFSv4 ACLs
Transparent
compression
Yes
Transparent
encryption
Yes[1]
Data deduplicationYes
Other
Supported
operating systems
Solaris, OpenSolaris, FreeBSD, Mac OS X Server 10.5, NetBSD, Linux via ZFS-FUSE or partial native support via 3rd party kernel module[2]

In computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity (protection against bit rot, etc.), support for high storage capacities, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs. ZFS is implemented as open-source software, licensed under the Common Development and Distribution License (CDDL). The ZFS name is a trademark of Oracle.[3]

History

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick. It was announced on September 14, 2004.[4] Source code for ZFS was integrated into the main trunk of Solaris development on October 31, 2005[5] and released as part of build 27 of OpenSolaris on November 16, 2005. Sun announced that ZFS was included in the 6/06 update to Solaris 10 in June 2006, one year after the opening of the OpenSolaris community.[6]

The name originally stood for "Zettabyte File System".[7] A ZFS file system can store up to 256 quadrillion zettabytes (ZB), where a zettabyte is 270 bytes.

Version numbers

As new features are introduced the version number of the ZPool and Z file system are incremented to designate the format and features available.[8][9] Notable ZFS storage pool versions include:

  • 10 - Supported by Solaris 10 U7
  • 14 - Supported by OpenSolaris 2009.06, FreeBSD 8.1
  • 15 - Supported by Solaris 10 10/09 (U8), FreeBSD 8.2
  • 17 - Triple Parity RAID-Z
  • 19 - Supported by Solaris 10 09/10
  • 21 - Deduplication
  • 22 - Solaris 10 9/10 (U9)
  • 28 - FreeBSD 9.0
  • 30 - Encryption support

Features

Data Integrity

One major feature that distinguishes ZFS from other file systems is that ZFS is designed from the ground up with a focus on data integrity. That is, protect the user's data on disk, against silent corruption caused by e.g., bit rot, cosmic radiation, current spikes, bugs in disk firmware, ghost writes, etc.

Data Integrity is a high priority in ZFS because recent research shows that none of the currently widespread file systems — such as Ext, XFS, JFS, ReiserFS, or NTFS — nor Hardware RAID provide sufficient protection against such problems.[10][11][12][13][14] It is well known that Hardware RAID has some issues. Initial research indicates that ZFS clearly protects data better than earlier solutions.[15]

For ZFS, this is accomplished using a (Fletcher-based) checksum or a (SHA-2) hash throughout the file system tree.[16] Each block of data is checksummed and that value is then saved in the pointer to that block—not at the actual block itself. The block pointer itself is then checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree.[16] When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored value of what it "should" be. If the values match, the data is passed up the programming stack to the process that asked for it. If the values do not match, then ZFS can heal the data if the storage pool has redundancy via ZFS type of mirroring or RAID.[17] If the storage pool consists of a single disk it is possible to provide such redundancy by specifying "copies=2" (or "copies=3") which means that data will be stored twice (thrice) on the disk, effectively halving (1/2) the storage capacity of the disk.[18] If redundancy exists, then ZFS fetches the second copy of the data (or recreates it via a RAID recovery mechanism), and recalculates the checksum—hopefully reproducing the original value this time. If the data passes the integrity check, the system can then update the first copy with known-good data so that redundancy can be restored.

ZFS cannot fully protect the user's data when using a hardware RAID controller, as it is not able to perform the automatic self-healing unless it controls the redundancy of the disks and data. ZFS prefers direct, exclusive access to the disks, with nothing in between that interferes. If the user insists on using hardware-level RAID, the controller should be configured as JBOD mode (i.e. turn off RAID-functionality) for ZFS to be able to guarantee data integrity. Note that hardware RAID configured as JBOD may still detach disks that do not respond in time; and as such may require TLER/CCTL/ERC-enabled disks to prevent drive dropouts. These limitations do not apply when using a non-RAID controller, which is the preferred method of supplying disks to ZFS. A non-RAID controller is generally called a Host Bus Adapter (HBA) and allows the operating system to control timeout and error control, rather than the RAID controller which generally has very strict timeout control.

A modern hard disk devotes a portion its capacity to error detection data. Many errors occur during normal usage, but are corrected by the disk's internal software, and thus are not visible to the host software. A tiny fraction of errors are not corrected. For example, a modern Enterprise SAS disk specification estimates this fraction to be one uncorrected error in every 1016 bits.[19] A smaller fraction of errors are not even detected by the disk firmware or the host operating system. This is known as "silent corruption". In a recent study, CERN found this issue to be problematic.[20]

These problems have not been a serious concern while storage devices remained relatively small and slow. Hence, a user very rarely faced silent corruption, so it was not deemed to be a problem that required a solution. With the advent of larger drives and very fast RAID setups, a user is capable of transferring 1016 bits in a sufficiently short time. In particular, ZFS creator Jeff Bonwick stated that the fast database at Greenplum — a database software company located in San Mateo, California specializing in enterprise data cloud solutions for large-scale data warehousing and analytics — faces silent corruption every 15 minutes,[21] which is one of the reasons that Greenplum now base their fast database solution on ZFS. These large and fast raid setups require new file systems that focus on data integrity. This is one of the design goals of ZFS, as explained by Jeff Bonwick.[21]

Storage pools

Unlike traditional file systems, which reside on single devices and thus require a volume manager to use more than one device, ZFS filesystems are built on top of virtual storage pools called zpools. A zpool is constructed of virtual devices (vdevs), which are themselves constructed of block devices: files, hard drive partitions, or entire drives, with the last being the recommended usage.[22] Block devices within a vdev may be configured in different ways, depending on needs and space available: non-redundantly (similar to RAID 0), as a mirror (RAID 1) of two or more devices, as a RAID-Z (similar to RAID-5) group of three or more devices, or as a RAID-Z2 (similar to RAID-6) group of four or more devices.[23] In July 2009, triple-parity RAID-Z3 was added to OpenSolaris.[24][25]

Thus, a zpool (ZFS storage pool) is vaguely similar to a computer's RAM. The total RAM pool capacity depends on the number of RAM memory sticks and the size of each stick. Likewise, a zpool consists of one or more vdevs. Each vdev can be viewed as a group of hard disks (or partitions, or files, etc.). Each vdev should have redundancy because if a vdev is lost, then the whole zpool is lost. Thus, each vdev should be configured as RAID-Z1, RAID-Z2, mirror, etc. It is not possible to change the number of drives in an existing vdev (Block Pointer Rewrite will allow this, and also allow defragmentation), but it is always possible to increase storage capacity by adding a new vdev to a zpool. It is possible to swap a drive to a larger drive and resilver (repair) the zpool. If this procedure is repeated for every disk in a vdev, then the zpool will grow in capacity when the last drive is resilvered. A vdev will have the same capacity as the smallest drive in the group. For instance, a vdev consisting of three 500 GB and one 700 GB drive, will have a capacity of 4 x 500 GB.

In addition, pools can have hot spares to compensate for failing disks. ZFS also supports both read and write caching, for which special devices can be used. Solid State Devices can be used for the L2ARC, or Level 2 adaptive replacement cache, speeding up read operations, while NVRAM buffered SLC memory can be boosted with supercapacitors to implement a fast, non-volatile write cache, improving synchronous writes. Finally, when mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis.

Storage pool composition is not limited to similar devices but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to diverse filesystems as needed. Arbitrary storage device types can be added to existing pools to expand their size at any time. [26]

The storage capacity of all vdevs is available to all of the file system instances in the zpool. A quota can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.

Capacity

ZFS is a 128-bit file system, so it can address 1.84 × 1019 times more data than 64-bit systems such as NTFS. The limitations of ZFS are designed to be so large that they would never be encountered. This was assured by surpassing physical rather than theoretical limitations—there simply is not enough useable matter on the planet Earth to support a maximized ZFS filesystem. Some theoretical limits in ZFS are:

  • 248 — Number of entries in any individual directory[27]
  • 16 exabytes (16×1018 bytes) — Maximum size of a single file
  • 16 exabytes — Maximum size of any attribute
  • 256 zettabytes (278 bytes) — Maximum size of any zpool
  • 256 — Number of attributes of a file (actually constrained to 248 for the number of files in a ZFS file system)
  • 264 — Number of devices in any zpool
  • 264 — Number of zpools in a system
  • 264 — Number of file systems in a zpool

Copy-on-write transactional model

ZFS uses a copy-on-write transactional object model. All block pointers within the filesystem contain a 256-bit checksum or 256-bit hash (currently a choice between Fletcher-2, Fletcher-4, or SHA-256)[28] of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see Merkle signature scheme).

Snapshots and clones

An advantage of copy-on-write is that when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot version of the file system to be maintained. ZFS snapshots are created very quickly, since all the data composing the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writeable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist. This is an implementation of the Copy-on-write principle.

Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus all disks in a pool are used, which balances the write load across them.

Variable block sizes

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. If data compression (LZJB) is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).

Lightweight filesystem creation

In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or resize a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.

Cache management

ZFS also uses the ARC, a new method for Read cache management, instead of the traditional Solaris virtual memory page cache. For Write cache ZFS employs the Intent Log (ZIL). ZFS makes allowances for both of these methods to incorporate separate virtual devices to improve the total IOPS. For Read operations it is the "cache" vdev and for Write operations it is the "log" vdev.[29]

Adaptive endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data itself; as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

Deduplication

Deduplication capability was added to the ZFS source repository at the end of October 2009.[30] The OpenSolaris ZFS development packages have been available since December 3, 2009 (build 128).

Effective use of deduplication requires additional hardware. ZFS designers recommend 2 GiB of RAM for every 1 TiB of storage. Example: at least 32 GiB of memory is recommended for 20 TiB of storage. [31] If RAM is lacking, consider adding an SSD as a cache, which will automatically handle the large de-dupe tables. This can speed up de-dupe performance 8x or more. Insufficient physical memory or lack of ZFS cache results in virtual memory thrashing, which lowers performance significantly.

As of today with Solaris 11 Express, deduplication can cause several problems if you are not aware of the dedup limitations. [32]

Encryption

The encryption capability in ZFS[33] is embedded into the I/O pipeline. During writes a block may be compressed, encrypted, checksummed and then deduplicated in that order. The policy for encryption is set at the dataset level when datasets (file systems or ZVOLs) are created. The wrapping keys provided by the user/administrator can be changed at any time without taking the file system off line. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys.[34] A command to switch to a new data encryption key for the clone or at any time is provided — this does not re-encrypt already existing data.

Additional capabilities

  • Explicit I/O priority with deadline scheduling.
  • Claimed globally optimal I/O sorting and aggregation.
  • Multiple independent prefetch streams with automatic length and stride detection.
  • Parallel, constant-time directory operations.
  • End-to-end checksumming, using a kind of "Data Integrity Field", allowing data corruption detection (and recovery if you have redundancy in the pool).
  • Transparent filesystem compression. Supports LZJB and gzip.[35]
  • Intelligent scrubbing and resilvering (resyncing).[36]
  • Load and space usage sharing among disks in the pool.[37]
  • Ditto blocks: Configurable data replication per filesystem, with zero, one or two extra copies requested per write for user data, and with that same base number of copies plus one or two for metadata (according to metadata importance).[38] If the pool has several devices, ZFS tries to replicate over different devices. Ditto blocks are primarily an additional protection against corrupted sectors, not against total disk failure.[39]
  • ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled, if they honor the write barriers. This feature provides safety and a performance boost compared with some other filesystems.
  • When entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it does not know if other slices are managed by non-write-cache safe filesystems, like UFS.
  • Per-user and per-group quotas support.[40]
  • Filesystem encryption since Solaris 11 Express[1]
  • Pools can be imported readonly
  • At import time a recovery by rolling back whole transactions is possible.
  • Planned features:
    • The so-called Block Pointer rewrite functionality is due to be added in the same time frame, paving the way for resizing pools, defragmentation, (re-)applying compression on filesystems and so on.[41]

Limitations

  • Capacity expansion is normally achieved by adding groups of disks as a top-level vdev: simple device, RAID-Z, RAID-Z2, RAID-Z3, or mirrored. Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself — the heal time will depend on the amount of stored information, not the disk size. The new free space will not be available until all the disks have been swapped.
  • It is currently not possible to reduce the number of top-level vdevs in a pool nor otherwise reduce pool capacity.[42] This functionality was said to be in development already in 2007.[43] It is not available as of Solaris 10 9/10 (AKA update 9).
  • It is not possible to add a disk as a column to a RAID-Z, RAID-Z2, or RAID-Z3 vdev. This feature depends on the block pointer rewrite functionality due to be added soon. One can however create a new RAID-Z vdev and add it to the zpool.
  • Vdevs cannot be nested, so a mirror or RAID-Z top-level vdev can only contain files or disks. Mirrors of mirrors (or other combinations) are not allowed.
  • Reconfiguring the number of devices in a top-level vdev requires copying data offline, destroying the pool, and recreating the pool with the new top-level vdev configuration, except for adding extra redundancy to an existing mirror, which can be done at any time or if all top level vdevs are mirrors with sufficient redundancy the zpool split[44] command can be used to remove a vdev from each top level vdev in the pool, creating a 2nd pool with identical data.
  • ZFS is not a native cluster, distributed, or parallel file system and cannot provide concurrent access from multiple hosts as ZFS is a local file system. Sun's Lustre distributed filesystem will adapt ZFS as back-end storage for both data and metadata in version 3.0, which is scheduled to be released in 2010.[needs update][45]
  • ZFS expects a disk cache flush command to commit cached data to media. Some virtualization software are configured by default to ignore cache flush commands, and some consumer-grade hardware 'lies' about actually executing the command as well. For example, VirtualBox can be, but is not by default configured to properly respect cache flushes (configuration would be using the procedure described in section 11.1.3 Responding to guest IDE flush requests of the Sun VirtualBox User Manual[46]); consumer grade USB disk enclosures are said to be particularly vulnerable to this problem. In the event of an outage or fault this can quite possibly lead to damage to the pool; recovery can be attempted by importing the pool as of few transactions ago (i.e. an older uberblock), losing minutes/seconds of data. Recovery enhancement is expect to be integrated in Q1 2010 (already in the latest development versions of OpenSolaris). A scrub is used to verify the integrity; however, some files may still need to be restored from backups, in the unlikely event they have already been deleted, blocks freed and then overwritten.
  • ZFS has no defragmentation utility. Usage of copy-on-write with often changed files normally leads to high fragmentation, but ZFS has several functions to mitigate this problem. For instance, ZFS does not immediately write out all data to disk, instead it collects all changes during a 30 second time frame before writing out the data, or when RAM cache gets full to 7/8 - when any of these events occur ZFS will flush data to the disk. This is configurable. Thus, file fragmentation is not that big a problem with ZFS. If you really must defragment ZFS, you can copy the data to another server, and destroy and recreate the original zpool, and then copy the data back. This eliminates fragmentation. When Block Pointer Rewrite functionality is added, then ZFS will be able to defragment.
  • If you use a single disk, by default, ZFS can only detect and report silent data corruption errors (because of the checksums) but not repair the errors. For ZFS to be able to both detect and also repair the data corruption, you must specify "copies=2" http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection which tells ZFS to store data twice on the disk (halving your storage capacity). If a data block gets corrupted, ZFS will repair the data block from another copy. Of course, "copies" does not help against a disk crash. To recover from a disk crash, you need disk redundancy such as raidz1, raidz2 or mirror. This applies to all file systems; no file system can protect your data against a disk crash when you use a single disk, you need two or more disks. Thus, ZFS "copies" is not a limitation, but a great advantage because of ZFS' ability to repair corrupted data even when using only a single disk. To further increase safety, "copies=3" can be used, which stores data thrice on every disk.
  • Resilver (repair) of a crashed disk in a ZFS raid takes a long time. This applies to all types of RAID, in one way or another. This means that future large disks, say 5 TB or 6 TB, can take several days to repair. This means that raidz1 (similar to RAID-5) should be avoided, because repairing a raid puts additional stress on the other disks which might cause them to crash, losing all data in the storage pool if configured as raidz1. Therefore, with large disks one should use raidz2 (allow two disks to crash) or raidz3 (allow three disks to crash). Adam Leventhal explains this problem further http://dtrace.org/blogs/ahl/2009/07/21/triple-parity-raid-z/. It should be noted, however, that ZFS RAID differs from conventional RAID solutions by only reconstructing the data when replacing a disk, not the entirety of the disk, which means that replacing a member disk on a ZFS pool that is half full will take only half the time as compared to conventional RAID.
  • IOPS performance of a ZFS storage pool can suffer if the ZFS raid is not appropriately configured. This applies to all types of RAID, in one way or another. If the zpool consists of only one group of disks configured as, say, raidz2 - then the IOPS performance will be that of a single disk. This means, to get high IOPS performance, the zpool should consist of several vdevs, because one vdev gives the IOPS of a single disk. However, there are ways to mitigate this IOPS performance problem, for instance add SSDs as L2ARC cache — which can boost IOPS into 100.000s http://blogs.sun.com/brendan/entry/a_quarter_million_nfs_iops . In short, a zpool should consist of several groups of vdevs, each vdev consisting of 8-12 disks. It is not recommended to create a zpool with a single large vdev, say 20 disks, because IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).

Platforms

Solaris

ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC and x86-based systems. Since the code for ZFS is open source, a port to other operating systems and platforms can be produced without Sun's involvement.

OpenSolaris

OpenSolaris 2008.05 and 2009.06 use ZFS as their default filesystem. There are over a dozen 3rd party distributions, of which nearly a dozen are mentioned here. (OpenIndiana and Illumos are two new distributions not included on the OpenSolaris distribution reference page.)

FreeBSD

Pawel Jakub Dawidek ported ZFS to FreeBSD, and it has been part of FreeBSD since version 7.0.[47] This includes zfsboot, which allows booting FreeBSD directly from a ZFS volume.[48][49]

FreeBSD's ZFS implementation is fully functional; the only missing features are kernel CIFS server and iSCSI, but at least the latter can be added using externally available packages.[50]. A CIFS server can be emulated in user space using Samba.

FreeBSD 7-stable (where updates to the series of versions 7.x are committed to) uses zpool version 6.

FreeBSD version 8 includes a much-updated implementation of ZFS, and zpool version 13 is supported in FreeBSD release 8.0.[51] zpool version 14 support was added to the 8-stable branch on 11 January 2010,[52] and is included in FreeBSD release 8.1. zpool version 15 is supported in release 8.2.[53] The 8-stable branch gained support for zpool version v28 and zfs version 5 in early June 2011[54]. Therefore, v28 will be supported in the 8.x FreeBSD series with the release of FreeBSD 8.3.

The 9-current development branch of FreeBSD uses ZFS Pool version 28.

FreeNAS

FreeNAS, an embedded open source network-attached storage (NAS) distribution based on FreeBSD, has the same ZFS support as FreeBSD.

GNU/kFreeBSD

GNU/kFreeBSD is a special case, because by virtue of being based on the kernel of FreeBSD, it provides the kernel side of ZFS support (see above). However, it depends on the distribution of GNU/kFreeBSD whether the necessary userland tools are available. The only distribution of this system to the date (Debian GNU/kFreeBSD) provides ZFS utilities in the zfsutils package. Additionally Debian installer supports installing system on ZFS root system on amd64 architecture.

NetBSD

ZFS port was started as a part of the 2007 Google Summer of Code and in August 2009 the code has made it into NetBSD's source tree.[55]

Mac OS X

The first indication of Apple Inc.'s interest in ZFS was an April 2006 post on the opensolaris.org zfs-discuss mailing list where an Apple employee mentioned being interested in porting ZFS to their Mac OS X operating system.[56]

In the release version of Mac OS X 10.5, ZFS was available in read-only mode from the command line, which lacks the possibility to create zpools or write to them.[57] Before the 10.5 release, Apple released the "ZFS Beta Seed v1.1", which allowed read-write access and the creation of zpools,[58] however the installer for the "ZFS Beta Seed v1.1" has been reported to only work on version 10.5.0, and has not been updated for version 10.5.1 and above.[59]

In August 2007, Apple opened a ZFS project on their Mac OS Forge site. On that site, Apple provided the source code and binaries of their port of ZFS which includes read-write access, but there was no installer available[60] until a third-party developer created one.[61]

In October 2009, Apple announced a shutdown of the ZFS project on Mac OS Forge. No explanation was given, just the following statement: "The ZFS project has been discontinued. The mailing list and repository will also be removed shortly." Versions of the previously released source and binaries, as well as the wiki, have been preserved and development has been adopted by a group of enthusiasts.[62][63]

Complete ZFS support was once advertised as a feature of Snow Leopard Server (Mac OS X Server 10.6). However, all references to this feature have been silently removed; it is no longer listed on the Snow Leopard Server features page.[64] Apple has not commented regarding the omission.

The maczfs project mirrored the public archives before they disappeared,[62] and a community-maintained project currently (as of 5 May 2011) provides basic ZFS software for most recent versions of OS X, including Snow Leopard (10.6).

In March 2011, the company Ten's Complement LLC (founded by Don Brady, a former Apple engineer who was technical lead on the original HFS+ team and worked on Apple's abandoned internal project to port ZFS[65]) announced that it was close to releasing a version of ZFS for Mac OS X called "Z‑410 Storage".[66] Z‑410 Storage would be targeted at prosumers[67] and may be available around the same time as Lion.[65]

Linux

Porting ZFS to Linux is complicated by the fact that the GNU General Public License, which governs the Linux kernel, is incompatible with the Sun CDDL under which ZFS is distributed. According to some developers a single derived work of both projects cannot be legally distributed, as it is not possible to simultaneously meet both licenses' requirements.[68] To include ZFS in the Linux kernel it would have to be cleanly reimplemented, and patents may hamper this.[69]

Linux FUSE

Another solution to this problem was to port ZFS to Linux's FUSE system so the filesystem runs in userspace instead, where it is not considered a derived work of the kernel. A project to do this was sponsored by Google's Summer of Code program in 2006.[70] The original ZFS on FUSE project is available here. Development for ZFS on FUSE/Linux now takes place at zfs-fuse.net.

Native ZFS on Linux

A native port of ZFS for Linux is being worked on. This ZFS on Linux port was produced at the Lawrence Livermore National Laboratory (LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44) between the U.S. Department of Energy (DOE) and Lawrence Livermore National Security, LLC (LLNS) for the operation of LLNL. It has been approved for release under LLNL-CODE-403049. The port is currently in release candidate status for version 0.6.0, which supports mounting filesystems.

Another native port is also being worked on by KQ Infotech .[71] [72] This port used the LLNL ZVOL implementation as a starting point, and developers are currently working on a POSIX layer. A GA release supporting zpool v28 was released in January 2011.[73]

Comparisons

List of Operating Systems, Distros and add-ons that support ZFS, the zpool version it supports, and the Solaris build they are based on (if any):

OS Zpool version Build Comments
Oracle Solaris Express 11 2010.11 31 b151a licensed for testing only
OpenSolaris 2009.06 14 b111b
OpenSolaris (last dev) 22 b134
OpenIndiana 28 b147
Nexenta Core 3.0.1 26 b134+ GNU userland
NexentaStor Community 3.1.0 28 b134+ GNU userland
NexentaStor Community 3.0.1 26 b134+ up to 18 TB, web admin
NexentaStor Enterprise 26 b134 + not free, web admin
FreeBSD 8.2-RELEASE 15 no CIFS or iSCSI
FreeBSD 8-STABLE / 9-CURRENT 28 no CIFS or iSCSI
Linux FUSE 0.7.0 23 low efficiency
Native Linux port (LLNL) 28 no stable POSIX layer, release candidate has basic POSIX layer
Native Linux port (KQ Infotech) 28 includes POSIX layer
Belenix 0.8b1 14 b111
Schillix 0.7.2 28 b147
StormOS "hail" based on Nexenta
Jaris Japanese
MilaX 0.5 20 b128a small size
Korona 4.5.0 22 b134 KDE
EON NAS 22 b130 embedded NAS
Mac OS X 10.6 (kernel extension / module) 8 Somewhat stable with installable packages for those who wish to use it and test, 1 reported crash. Project Page

(updated 2011/06/28)

Problems

Despite being touted as a filesystem countering silent corruption and needing no check and repair tools, there were cases including Joyent's[74] which proved that even Solaris/SPARC installations in enterprise setting are not as safe as advertised.

See also

References

  1. ^ a b "What's new in Solaris 11 Express 2010.11" (PDF). Oracle. Retrieved 2010-11-17.
  2. ^ "1.1 What about the licensing issue?". Retrieved 2010-11-18.
  3. ^ "Sun Trademarks - ZFS". Sun Microsystems.
  4. ^ "ZFS: the last word in file systems". Sun Microsystems. September 14, 2004. Archived from the original on April 28, 2006. Retrieved 2006-04-30.
  5. ^ Jeff Bonwick (October 31, 2005). "ZFS: The Last Word in Filesystems". Jeff Bonwick's Blog. Retrieved 2006-04-30.
  6. ^ "Sun Celebrates Successful One-Year Anniversary of OpenSolaris". Sun Microsystems. June 20, 2006.
  7. ^ "ZFS FAQ at OpenSolaris.org". Sun Microsystems. Retrieved 2011-05-18. The largest SI prefix we liked was 'zetta' ('yotta' was out of the question)
  8. ^ "Solaris ZFS Administration Guide, Appendix A ZFS Version Descriptions". Oracle Corporation. 2010. Retrieved 2011-02-11.
  9. ^ "Version". Sun Microsystems. Retrieved 2010-08-17.
  10. ^ "Iron File Systems" (PDF).
  11. ^ "Parity Lost and Parity Regained".
  12. ^ "An Analysis of Data Corruption in the Storage Stack" (PDF).
  13. ^ "Impact of Disk Corruption on Open-Source DBMS" (PDF).
  14. ^ Baarf.com
  15. ^ Yupu Zhang, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau. "End-to-end Data Integrity for File Systems: A ZFS Case Study" (PDF). Madison: Computer Sciences Department, University of Wisconsin. p. 14. Retrieved 2010-12-06.{{cite web}}: CS1 maint: multiple names: authors list (link)
  16. ^ a b Bonwich, Jeff (2005-12-09). "ZFS End-to-End Data Integrity".
  17. ^ Cook, Tim (2009-11-16). "Demonstrating ZFS Self-Healing".
  18. ^ Ranch, Richard (2007-05-04). "ZFS, copies, and data protection".
  19. ^ "Are Fibre Channel and SCSI Drives More Reliable?".
  20. ^ Indico.cern.ch
  21. ^ a b "A Conversation with Jeff Bonwick and Bill Moore". Association for Computing Machinery. 2007-11-15. Retrieved 2010-12-06.
  22. ^ "Solaris ZFS Administration Guide". Oracle Corporation. Retrieved 2011-02-11.
  23. ^ "ZFS Best Practices Guide". Solaris Performance Wiki. Retrieved 2007-10-02.
  24. ^ Leventhal, Adam. "Bug ID: 6854612 triple-parity RAID-Z". Sun Microsystems. Retrieved 2009-07-17.
  25. ^ Leventhal, Adam (2009-07-16). "6854612 triple-parity RAID-Z". zfs-discuss (Mailing list). Retrieved 2009-07-17. {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  26. ^ "Solaris ZFS Enables Hybrid Storage Pools—Shatters Economic and Performance Barriers"
  27. ^ "Solaris ZFS Administration Guide". Oracle Corporation. Retrieved 2011-02-11.
  28. ^ "ZFS On-Disk Specification" (PDF). Sun Microsystems, Inc. 2006. See section 2.4.
  29. ^ Unix.com
  30. ^ "ZFS Deduplication".
  31. ^ "Dedup Performance Considerations".
  32. ^ "[zfs-discuss] Summary: Dedup memory and performance (again, again)".
  33. ^ "Encrypting ZFS File Systems".
  34. ^ "Having my secured cake and Cloning it too (aka Encryption + Dedup with ZFS)".
  35. ^ "Solaris ZFS Administration Guide". Chapter 6 Managing ZFS File Systems. Retrieved 2009-03-17.
  36. ^ "Smokin' Mirrors". Jeff Bonwick's Weblog. 2006-05-02. Retrieved 2007-02-23.
  37. ^ "ZFS Block Allocation". Jeff Bonwick's Weblog. 2006-11-04. Retrieved 2007-02-23.
  38. ^ "Ditto Blocks — The Amazing Tape Repellent". Flippin' off bits Weblog. 2006-05-12. Retrieved 2007-03-01.
  39. ^ "Adding new disks and ditto block behaviour". Retrieved 2009-10-19.
  40. ^ "OpenSolaris.org". Sun Microsystems. Retrieved 2009-05-22.
  41. ^ Jeff Bonwick Keynote at Kernel Conference Australia 2009
  42. ^ "Bug ID 4852783: reduce pool capacity". OpenSolaris Project. Retrieved 2009-03-28.
  43. ^ Goebbels, Mario (2007-04-19). "Permanently removing vdevs from a pool". zfs-discuss (Mailing list). {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  44. ^ Download.oracle.com
  45. ^ "Lustre Roadmap".
  46. ^ "Sun VirtualBox User Manual version 3.0.4" (PDF).
  47. ^ Dawidek, Pawel (April 6, 2007). "ZFS committed to the FreeBSD base". Retrieved 2007-04-06.
  48. ^ "Revision 192498". May 20, 2009. Retrieved 2009-05-22.
  49. ^ "ZFS v13 in 7-STABLE". May 21, 2009. Retrieved 2009-05-22.
  50. ^ "iSCSI target for FreeBSD". Retrieved 6 August 2011.
  51. ^ "FreeBSD 8.0-RELEASE Release Notes". FreeBSD. Retrieved 2009-11-27.
  52. ^ "FreeBSD 8.0-STABLE Subversion logs". FreeBSD. Retrieved 2010-02-05.
  53. ^ "FreeBSD 8.2-RELEASE Release Notes". FreeBSD. Retrieved 2011-03-09.
  54. ^ "HEADS UP: ZFS v28 merged to 8-STABLE". June 6, 2011. Retrieved 2011-06-11.
  55. ^ "NetBSD Google Summer of Code projects: ZFS".
  56. ^ "Porting ZFS to OSX". zfs-discuss. April 27, 2006. Retrieved 2006-04-30.
  57. ^ "Apple: Leopard offers limited ZFS read-only". MacNN. June 12, 2007. Retrieved 2007-06-23.
  58. ^ "Apple delivers ZFS Read/Write Developer Preview 1.1 for Leopard". Ars Technica. October 7, 2007. Retrieved 2007-10-07.
  59. ^ Ché Kristo (November 18, 2007). "ZFS Beta Seed v1.1 will not install on Leopard.1 (10.5.1) " ideas are free". Retrieved 2007-12-30.
  60. ^ ZFS.macosforge.org
  61. ^ Alblue.blogspot.com
  62. ^ a b Code.google.com
  63. ^ Groups.google.com
  64. ^ "Snow Leopard". June 9, 2009. Retrieved 2008-06-10.
  65. ^ a b "How ZFS is slowly making its way to Mac OS X". March 18, 2011. Retrieved 2011-03-19.
  66. ^ ZDNet.com: ZFS returns to the Mac March 14, 2011
  67. ^ “We think the market for Mac OS X server is in serious decline (...) There's a huge chasm between using Xsan over Fibre Channel and a USB drive with Time Machine," Brady told Ars. "That middle piece is what we're looking at—users that want the convenience of a device like a Drobo, but with more reliability and [easy verifiability].”
  68. ^ Linus on GPLv3 and ZFS
  69. ^ Jeremy Andrews (April 19, 2007). "Linux: ZFS, Licenses and Patents". Retrieved 2007-04-21.
  70. ^ Ricardo Correia (March 16, 2009). "ZFS on FUSE/Linux". Retrieved 2009-03-16.
  71. ^ Darshin (August 24, 2010). "ZFS Port to Linux (all versions)". Retrieved 2010-08-31.
  72. ^ kqi (2011--6-12). "Where can I get the ZFS for Linux source code?". {{cite web}}: Check date values in: |date= (help)
  73. ^ Phoronix (November 22, 2010). "Running The Native ZFS Linux Kernel Module, Plus Benchmarks". Retrieved 2010-12-07.
  74. ^ Joyent Suffers Major Downtime Due To ZFS Bug

Bibliography

  • Watanabe, Scott (November 23, 2009). "Solaris ZFS Essentials" (Document). Prentice Hall. p. 256. {{cite document}}: Unknown parameter |edition= ignored (help); Unknown parameter |format= ignored (help); Unknown parameter |isbn= ignored (help); Unknown parameter |url= ignored (help)