Btrfs: Difference between revisions
A more accurate description of the criticism, rmoving some POV- see Discussion page |
rm OR (moreover, it's simply not true -- Shishkin responded), moved back explanation of the Shishkin's position |
||
Line 96: | Line 96: | ||
| accessdate = 2010-12-31 |
| accessdate = 2010-12-31 |
||
| mailinglist = linux-kernel |
| mailinglist = linux-kernel |
||
}}</ref> While it was merged into the mainline kernel, there are also differing opinions on Btrfs stability. Edward Shishkin, one of the [[Reiser4]] developers now working for [[Red Hat]], has called Btrfs "completely broken" |
}}</ref> While it was merged into the mainline kernel, there are also differing opinions on Btrfs stability. Edward Shishkin, one of the [[Reiser4]] developers now working for [[Red Hat]], has called Btrfs "completely broken", as certain design flaws could lead to large amounts of unusable 'free space'.<ref>{{cite mailing list |
||
| title = Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) |
| title = Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs) |
||
| author = Edward Shishkin |
| author = Edward Shishkin |
Revision as of 01:55, 18 March 2011
Developer(s) | Oracle Corporation |
---|---|
Full name | Btrfs |
Introduced | Stable: Yet to be released Unstable: v0.19, June 2009 with Linux |
Structures | |
Directory contents | B-tree |
File allocation | extents |
Limits | |
Max volume size | 16 EiB |
Max file size | 16 EiB |
Max no. of files | 264 |
Max filename length | 255 bytes |
Allowed filename characters | All bytes except NUL ('\0') and '/' |
Features | |
Dates recorded | modification (mtime), attribute modification (ctime), access (atime) |
Date resolution | nanosecond |
Attributes | POSIX, extended attributes |
File system permissions | POSIX, ACL |
Transparent compression | Yes |
Transparent encryption | No (planned[1]) |
Data deduplication | Yes (planned[1], patches available) |
Other | |
Supported operating systems | Linux |
Btrfs (B-tree file system, variously pronounced "Butter F S", "Better F S",[1] or "B-tree F S"[2]) is a GPL-licensed copy-on-write file system for Linux.
Btrfs is intended to address the lack of pooling, snapshots, checksums and integral multi-device spanning in Linux file systems, these features being crucial as the use of Linux scales upward into larger storage configurations common in the enterprise.[1] Chris Mason, the principal author of the filesystem, has stated its goal was "to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but also means being able to administer and to manage it with a clean interface that lets people see what's being used and makes it more reliable."[3]
Oracle has also begun work on CRFS (Coherent Remote File System), a network filesystem protocol intended to leverage the Btrfs architecture to gain higher performance than existing protocols (such as NFS and CIFS) and to expose Btrfs features such as snapshots to remote clients.[4]
Btrfs 1.0 (with finalized on-disk format) was originally slated for a late 2008 release,[5] but a stable release has not been made as of March 2011[update]. It has, however, been accepted into the mainline kernel for testing as of 2.6.29-rc1.[6] Several Linux distributions have also begun offering Btrfs as an experimental choice of root file system during installation, including openSUSE 11.3, SLES 11 SP1, Ubuntu 10.10, Red Hat Enterprise Linux 6,[7] MeeGo,[8] and Debian.[9]
In 2008 the principal developer of the ext3 and ext4 file systems, Theodore Ts'o, stated that ext4 is a stop-gap and that Btrfs is the way forward,[10] having "a number of the same design ideas that reiser3/4 had".[11] While it was merged into the mainline kernel, there are also differing opinions on Btrfs stability. Edward Shishkin, one of the Reiser4 developers now working for Red Hat, has called Btrfs "completely broken", as certain design flaws could lead to large amounts of unusable 'free space'.[12][13][14]
Features
Btrfs, thus far, has implemented:[15][16]
- Online volume growth and shrinking
- Online block device addition and removal
- Online defragmentation
- Online balancing (movement of objects between block devices to balance load)
- Transparent compression (currently zlib and LZO)
- Subvolumes (separately-mountable filesystem roots)
- Snapshots (writeable, copy-on-write copies of subvolumes)
- File cloning (copy-on-write on individual files, or byte ranges thereof)
- Object-level (RAID1-like) mirroring, (RAID0-like) striping
- Checksums on data and metadata (currently CRC-32C[17])
- In-place conversion (with rollback) from ext3/4 to Btrfs[18]
- File system seeding[19] (Btrfs on read-only storage used as a copy-on-write backing for a writeable Btrfs)
- User-defined transactions
- Block discard support (reclaims space on some virtualized setups and improves wear leveling on SSDs by notifying the underlying device that storage is no longer in use)
Planned features include:
- Object-level (RAID5-like and RAID6-like) parity-based striping
- Online and offline filesystem check
- Incremental dumps
- Data deduplication[1][20]
Btrfs, when complete, is expected to offer a feature set comparable to ZFS.[21] Oracle's acquisition of ZFS as part of the Sun Microsystems merger has not changed their plans for developing Btrfs.[22]
Cloning
Btrfs provides a clone operation which atomically creates a copy-on-write snapshot of a file, support for which was added to GNU coreutils 7.5.[23][24] Cloning from byte ranges in one file to another is also supported, allowing large files to be more efficiently manipulated like standard rope data structures.
Subvolumes and snapshots
Subvolumes effectively allow a single instance of Btrfs to have multiple root directories, all using the file system as a pooled store. These roots are labeled and can be mounted separately by label. (There is always a "default" root which is mounted by default.) Subvolumes can also be nested inside other subvolumes, where they appear as subdirectories. Subvolumes are always created empty.
Snapshots are writeable copy-on-write clones of whole subvolumes; snapshotting itself is an atomic operation. Snapshots of individual subdirectories are not possible.
In-place ext3/4 conversion
Btrfs can warp to fit unusual spatial layouts because it has very little metadata anchored in fixed locations. The btrfs-convert tool exploits this property to perform an in-place conversion of any ext3 file system to Btrfs by nesting equivalent Btrfs metadata for all its files in the unallocated space of the ext3 file system. The result is a hybrid file system that, when mounted as a Btrfs, makes the ext3 files accessible in a writeable snapshot. The ext3 filesystem itself appears as a large sparse file which can be mounted as a read-only disk image. Deleting the image file commits the conversion; remounting as ext3 rolls back the conversion.[25]
Transactions
Btrfs supports a very limited form of transaction without ACID semantics: rollback is not possible, only one transaction may run at a time and transactions are not atomic with respect to storage. They are analogous not to transactions in databases, but to the I/O "transactions" in ext3's JBD layer.
An ioctl interface, however, is provided so that user processes can start transactions to make temporary reservations of disk space. Once started, all file system I/O is then bound to the transaction until it closes, at which point the reservation is released and I/O is flushed to storage. To prevent trivial denial-of-service attacks, transactions are only available to root-privileged processes.
History
The core data structure of Btrfs—the copy-on-write B-tree—was originally proposed by IBM researcher Ohad Rodeh at a presentation at USENIX 2007. Rodeh suggested adding reference counts and certain relaxations to the balancing algorithms of standard B-trees that would make them suitable for a high-performance object store with copy-on-write snapshots, yet maintain good concurrency.[26]
Chris Mason, an engineer working on ReiserFS for SUSE at the time, joined Oracle later that year and began work on a new file system using such B-trees almost exclusively—not just for metadata and file data, but also recursively to track space allocation of the trees themselves. This allowed all traversal and modifications to be funneled through a single code path, against which features such as copy-on-write, checksumming and mirroring needed to be implemented only once to benefit the entire file system.[16][21]
Design
Btrfs is structured as several layers of trees, all using the same B-tree implementation to store their various data types as generic items sorted on a 136-bit key. The first 64 bits of the key are a unique object id. The middle 8 bits is an item type field; its use is hardwired into code as an item filter in tree lookups. Objects can have multiple items of multiple types. The remaining right-hand 64 bits are used in type-specific ways. Therefore items for the same object end up adjacent to each other in the tree, ordered by type. By choosing certain right-hand key values, objects can further put items of the same type in a particular order.[21][27]
Interior tree nodes are simply flat lists of key-pointer pairs, where the pointer is the logical block number of a child node. Leaf nodes contain item keys packed into the front of the node and item data packed into the end, with the two growing toward each other as the leaf fills up.[21]
Root tree
Every tree appears as an object in the root tree (or tree of tree roots). Some trees, such as file system trees and log trees, have a variable number of instances, each of which is given its own object id. Trees which are singletons (the data relocation, extent and chunk trees) are assigned special, fixed object ids ≤256. The root tree appears in itself as a tree with object id 1.
Trees refer to each other by object id. They may also refer to individual nodes in other trees as a triplet of the tree's object id, the node's level within the tree and its leftmost key value. Such references are independent of where the tree is actually stored.
File system tree
User-visible files and directories all live in a file system tree. File and directory objects all have inode items. Extended attributes and ACL entries are stored alongside in separate items. Files and directories also have a reference item whose right-hand key value is the object id of their parent directory. This allows upward traversal through the directory hierarchy. Hard-linked files have multiple such back-references.[27]
Within each directory, directory entries appear as directory items, whose right-hand key value are a CRC32C hash of their filename. Directory items thus collectively act as an index for path lookups, but are not useful for iteration because they are in hash order: user applications iterating over and opening files in a large directory would thus generate many more seeks between non-adjacent files—a notable performance drain in other file systems with hash-ordered directories such as ReiserFS[28], ext3 (with Htree-indexes enabled[29]) and ext4, all of which have TEA-hashed filenames. To avoid this, a separate directory index item is created for each new directory entry. The right-hand value of the item is set to a counter that is incremented on every new file. Iteration over these index items thus return entries in roughly the same order as they are stored on disk.
There is one file system tree per subvolume. Subvolumes can nest, in which case they appear as a directory item whose data is a reference to the nested subvolume's file system tree.
Extents
File data are kept outside the tree in extents, which are contiguous runs of disk blocks. Extent blocks default to 4KiB in size, do not have headers and contain only (possibly compressed) file data. In compressed extents, individual blocks are not compressed separately; rather, the compression stream spans the entire extent.
Each extent is tracked in-tree by an extent data item. The item's right-hand key value is the starting byte offset of the extent. This makes for efficient seeks in large files with many extents, because the correct extent for any given file offset can be computed with just one tree lookup.
Snapshots and cloned files share extents. When a small part of a large such extent is overwritten, the resulting copy-on-write may create three new extents: a small one containing the overwritten data, and two large ones with unmodified data on either side of the overwrite. To avoid having to re-write unmodified data, the copy-on-write may instead create bookend extents, or extents which are simply slices of existing extents. Extent data items allow for this by including an offset into the extent they are tracking: items for bookends are those with non-zero offsets.[27]
If the file data is small enough to fit inside a tree node, it is instead pulled in-tree and stored inline in the extent data item. Each tree node is stored in its own tree block—a single uncompressed block with a header.
Extent allocation tree
An extent allocation tree is used to track space usage by extents, which are zoned into block groups. Block groups are variable-sized allocation regions which alternate successively between preferring metadata extents (tree nodes) and data extents (file contents). The default ratio of data to metadata block groups is 1:2. Inode items include a reference to their current block group.[27] Together, these work like the Orlov allocator and block groups in ext3 in allocating related files together and resisting fragmentation by leaving allocation gaps between groups. (ext3 block groups, however, have fixed locations computed from the size of the file system, whereas those in Btrfs are dynamic and are created as needed.)
Items in the extent allocation tree do not have object ids, but instead use their byte offsets as the left-hand 64 bits of the key. A traditional free-space bitmap is not used, since the allocation tree essentially acts as a B-tree version of a BSP tree. (The current implementation of Btrfs, however, does keep an in-memory red-black tree of page-sized bitmaps to speed up allocations.)
Extent items contain a back-reference to the tree node or file occupying that extent. There may be multiple back-references if the extent is shared between snapshots. If there are too many back-references to fit in the item, they spill out into individual extent data reference items. Tree nodes, in turn, have back-references to their containing trees. This makes it possible to find which extents or tree nodes are in any region of space by doing a B-tree range lookup on a pair offsets bracketing that region, then following the back-references. When relocating data, this allows an efficient upwards traversal from the relocated blocks to quickly find and fix up all downwards references to those blocks, without having to walk the entire file system. This, in turn, allows the file system to efficiently shrink, migrate and defragment its storage online.
The extent tree, as with all other trees in the file system, is copy-on-write. Writes to the file system may thus cause a cascade whereby changed tree nodes and file data result in new extents being allocated, causing the extent tree to itself change. To avoid creating a feedback loop, extent tree nodes which are still in memory but not yet committed to disk may be updated in-place to reflect new copy-on-written extents.
Checksum tree
CRC-32C checksums are computed for both data and metadata and stored as checksum items in a checksum tree. There is one checksum item per contiguous run of allocated blocks, with per-block checksums packed end-to-end into the item data. If there are more checksums than can fit, they spill rightwards over into another checksum item in a new leaf.
Log tree
An fsync is a request to commit modified data immediately to stable storage. fsync-heavy workloads (such as databases) could potentially generate a great deal of redundant write I/O by forcing the file system to repeatedly copy-on-write and flush frequently-modified parts of trees to storage. To avoid this, a temporary per-subvolume log tree is created to journal fsync-triggered copy-on-writes. Log trees are self-contained, tracking their own extents and keeping their own checksum items. Their items are replayed and deleted at the next full tree commit or (if there was a system crash) at next remount.
Chunk and device trees
Block devices are divided into chunks of 256 MB or more. Chunks may be mirrored or striped across multiple devices. The mirroring/striping arrangement is transparent to the rest of the file system, which simply sees the single, logical address space that chunks are mapped into.
This is all tracked by the chunk tree, where each device is represented as a device item and each mapping from a logical chunk to its underlying physical chunks is stored in a chunk map item. The device tree is the inverse of the chunk tree, and contains device extent items which map byte ranges of block devices back to individual chunks. As in the extent allocation tree, this allows Btrfs to efficiently shrink or remove devices from volumes by locating the chunks they contain (and relocating their contents).
The file system, chunks and devices are all assigned a UUID. The header of every tree node contains both the UUID of its containing chunk and the UUID of the file system. The chunks containing the chunk tree, the root tree, device tree and extent tree are always mirrored—even on single-device volumes. These are all intended to improve the odds of successful data salvage in the event of media errors.
Data relocation tree
The data relocation tree serves as scratch space for extents and their temporary copies during rebalancing or defragmentation. It is exempt from copy-on-write.
Superblock
All the file system's trees—including the chunk tree itself—are stored in chunks, creating a potential chicken-and-egg problem when mounting the file system. To bootstrap into a mount, a list of physical addresses of chunks belonging to the chunk and root trees must be stored in the superblock.[30]
Superblock mirrors are kept at fixed locations:[31] 64 KiB into every block device, with additional copies at 64 MiB, 256 GiB and 1 PiB. When a superblock mirror is updated, its generation number is incremented. At mount time, the copy with the highest generation number is used. All superblock mirrors are updated in tandem, except when SSD mode is enabled, in which case updates will instead alternate between mirrors to provide some wear levelling.
See also
References
- ^ a b c d e McPherson, Amanda (2009-06-22). "A Conversation with Chris Mason on BTRfs: the next generation file system for Linux" (Document). Linux FoundationTemplate:Inconsistent citations
{{cite document}}
: Unknown parameter|accessdate=
ignored (help); Unknown parameter|url=
ignored (help)CS1 maint: postscript (link) - ^ Valerie Henson (2008-01-31). Chunkfs: Fast file system check and repair. Melbourne, Australia. Event occurs at 18m 49s. Retrieved 2008-02-05.
It's called Butter FS or B-tree FS, but all the cool kids say Butter FS
- ^ Kerner, Sean Michael (2008-10-30). "A Better File System For Linux". InternetNews.com. Retrieved 2008-10-30Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Edge, Jake (2008-02-06). "CRFS and POHMELFS". LWN.net. Retrieved 2009-06-15Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ "Development timeline - btrfs Wiki". Btrfs.wiki.kernel.org. 2008-12-11. Retrieved 2009-06-15.
- ^ Wuelfing, Britta (2009-01-12). "Kernel 2.6.29: Corbet Says Btrfs Next Generation Filesystem" (Document). Linux MagazineTemplate:Inconsistent citations
{{cite document}}
: Unknown parameter|url=
ignored (help)CS1 maint: postscript (link) - ^ "Red Hat Enterprise Linux 6 documentation: Technology Previews".
- ^ "MeeGo project chooses Btrfs as standard file system". The H. 12 May 2010Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ "Debian 6.0 "Squeeze" released" (Press release). Debian. February 6th, 2011. Retrieved 2011-02-08.
Support has also been added for the ext4 and Btrfs filesystems...
{{cite press release}}
: Check date values in:|date=
(help) - ^ Paul, Ryan (2009-04-13). "Panelists ponder the kernel at Linux Collaboration Summit" (Document). Ars TechnicaTemplate:Inconsistent citations
{{cite document}}
: Unknown parameter|accessdate=
ignored (help); Unknown parameter|url=
ignored (help)CS1 maint: postscript (link) - ^ Theodore Ts'o (2008-08-01). "Re: reiser4 for 2.6.27-rc1". linux-kernel (Mailing list). Retrieved 2010-12-31.
{{cite mailing list}}
: Unknown parameter|mailinglist=
ignored (|mailing-list=
suggested) (help) - ^ Edward Shishkin (2010-06-18). "Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)". linux-kernel (Mailing list). Retrieved 2010-12-31.
{{cite mailing list}}
: Unknown parameter|mailinglist=
ignored (|mailing-list=
suggested) (help) - ^ Edward Shishkin (2010-06-30). "Re: Balancing leaves when walking from top to down (was Btrfs:...)". linux-btrfs (Mailing list). Retrieved 2011-01-08.
{{cite mailing list}}
: Unknown parameter|mailinglist=
ignored (|mailing-list=
suggested) (help) - ^ "btrfs FAQ". 2011-01-08. Retrieved 2010-01-08.
- ^ Corbet, Jonathan (2007-06-19). "btrfs and NILFS". Retrieved 2008-08-09.
- ^ a b Mason, Chris (2007-06-12). "Btrfs: a copy on write, snapshotting FS". linux-kernel (Mailing list). Retrieved 2010-05-29.
{{cite mailing list}}
: Unknown parameter|mailinglist=
ignored (|mailing-list=
suggested) (help) - ^ "Wiki FAQ: What checksum function does Btrfs use?". Btrfs wiki. Retrieved 2009-06-15.
- ^ "Conversion from Ext3 - btrfs Wiki". Btrfs.wiki.kernel.org. 2009-01-25. Retrieved 2009-06-15.
- ^ Mason, Chris (2009-01-12). "Btrfs changelog". Retrieved 2009-01-14.
- ^ "Offline Deduplication for Btrfs". Btrfs mailing list. Retrieved 2011-01-22.
- ^ a b c d Aurora, Valerie (2009-07-22). "A short history of btrfs". LWN.net. Retrieved 2009-08-22Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Marcel Hilzinger (2009-04-22). "Future of Btrfs Secured". Linux Magazine.
- ^ Meyering, Jim (2009-08-20). "GNU coreutils NEWS: Noteworthy changes in release 7.5". Retrieved 2009-08-30Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Scrivano, Giuseppe (2009-08-01). "cp: accept the --reflink option". Retrieved 2009-11-02Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Mason, Chris (2008-04-01). "Conversion from Ext3". Btrfs wiki. Retrieved 2009-08-22Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Rodeh, Ohad (2008). "B-trees, Shadowing, and Clones" (Document). New York: Association for Computing MachineryTemplate:Inconsistent citations
{{cite document}}
: Unknown parameter|url=
ignored (help)CS1 maint: postscript (link) - ^ a b c d Mason, Chris (2008-12-11). "Btrfs design". Btrfs wiki. Retrieved 2009-08-25Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Reiser, Hans (2001-12-07). "Re: Ext2 directory index: ALS paper and benchmarks". ReiserFS developers mailing list. Retrieved 2009-08-28Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Mason, Chris. "Acp"Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Mason, Chris (2008-04-30). "Multiple device support". Btrfs wiki. Retrieved 2009-08-25Template:Inconsistent citations
{{cite web}}
: CS1 maint: postscript (link) - ^ Bartell, Sean (2010-04-20). "Re: Restoring BTRFS partition". linux-btrfs (Mailing list).
{{cite mailing list}}
: Unknown parameter|mailinglist=
ignored (|mailing-list=
suggested) (help)