Jump to content

ext4

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 152.8.99.118 (talk) at 14:31, 13 May 2013 (Criticism: "Criticism" seems like a harsh word to use for T'so just saying that ext4 is a stop gap. He's not identifying flaws in the design, he's just saying another project is a better direction). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

ext4
Developer(s)Mingming Cao, Andreas Dilger, Alex Zhuravlev (Tomas), Dave Kleikamp, Theodore Ts'o, Eric Sandeen, Sam Naghshineh, others
Full nameFourth extended file system
IntroducedStable: 21 October 2008
Unstable: 10 October 2006 with Linux 2.6.28, 2.6.19
Partition IDs0x83 (MBR)
EBD0A0A2-B9E5-4433-87C0-68B6B72699C7 (GPT)
Structures
Directory contentsLinked list, hashed B-tree
File allocationExtents/Bitmap
Bad blocksTable
Limits
Max volume size1 EiB
Max file size16 TiB (for 4k block filesystem)
Max no. of files4 billion (specified at filesystem creation time)
Max filename length255 bytes (characters)
Allowed filename
characters
All bytes except NUL ('\0') and '/'
Features
Dates recordedmodification (mtime), attribute modification (ctime), access (atime), delete (dtime), create (crtime)
Date range14 December 1901 - 25 April 2514
Date resolutionNanosecond
ForksNo
Attributesacl, bh, bsddf, commit=nrsec, data=journal, data=ordered, data=writeback, delalloc, extents, journal_dev, mballoc, minixdf, noacl, nobh, nodelalloc,noextents, nomballoc, nouser_xattr, oldalloc, orlov, user_xattr
File system
permissions
POSIX
Transparent
compression
No
Transparent
encryption
No
Data deduplicationNo
Other
Supported
operating systems
Linux
Windows (using ext2fsd)

The ext4 or fourth extended filesystem is a journaling file system for Linux, developed as the successor to ext3.

History

ext4 was born as a series of backward compatible extensions to ext3, many of them originally developed by Cluster File Systems for the Lustre file system between 2003 and 2006, meant to extend storage limits and add other performance improvements.[1] However, other Linux kernel developers opposed accepting extensions to ext3 for stability reasons,[2] and proposed to fork the source code of ext3, rename it as ext4, and perform all the development there, without affecting the current ext3 users. This proposal was accepted, and on 28 June 2006, Theodore Ts'o, the ext3 maintainer, announced the new plan of development for ext4.[3]

A preliminary development version of ext4 was included in version 2.6.19[4] of the Linux kernel. On 11 October 2008, the patches that mark ext4 as stable code were merged in the Linux 2.6.28 source code repositories,[5] denoting the end of the development phase and recommending ext4 adoption. Kernel 2.6.28, containing the ext4 filesystem, was finally released on 25 December 2008.[6] On 15 January 2010, Google announced that it would upgrade its storage infrastructure from ext2 to ext4.[7] On December 14, 2010, they also announced they would use ext4, instead of YAFFS, on Android 2.3.[8]

Features

Large file system
The ext4 filesystem can support volumes with sizes up to 1 exbibyte (EiB) and files with sizes up to 16 tebibytes (TiB).[9]
Extents
Extents replace the traditional block mapping scheme used by ext2 and ext3. An extent is a range of contiguous physical blocks, improving large file performance and reducing fragmentation. A single extent in ext4 can map up to 128 MiB of contiguous space with a 4 KiB block size.[1] There can be four extents stored in the inode. When there are more than four extents to a file, the rest of the extents are indexed in an Htree.
Backward compatibility
ext4 is backward compatible with ext3 and ext2, making it possible to mount ext3 and ext2 as ext4. This will slightly improve performance because certain new features of ext4 can also be used with ext3 and ext2, such as the new block allocation algorithm.
ext3 is partially forward compatible with ext4. That is, ext4 can be mounted as ext3 (using "ext3" as the filesystem type when mounting). However if the ext4 partition uses extents (a major new feature of ext4), then the ability to mount as ext3 is lost.
Persistent pre-allocation
ext4 can pre-allocate on-disk space for a file. To do this on most file systems, zeros would be written to the file when created. In ext4 (and some other files systems such as XFS) fallocate(), a new system call in the Linux kernel, can be used. The allocated space would be guaranteed and likely contiguous. This situation has applications for media streaming and databases.
Delayed allocation
ext4 uses a performance technique called allocate-on-flush also known as delayed allocation. That is, ext4 delays block allocation until it writes data to disk. (In contrast, some file systems allocate blocks before writing data to disk.) Delayed allocation improves performance and reduces fragmentation by using the actual file size to improve block allocation.
Increasing the 32,000 subdirectory limit
In ext3 a directory can have at most 32,000 subdirectories. In ext4 this limit increased to 64,000. This limit can be increased by using the "dir_nlink" feature (but the parent's link count will not increase). To allow for larger directories and continued performance, ext4 turns on Htree indexes (a specialized version of a B-tree) by default. This feature is implemented in Linux 2.6.23. In ext3 Htrees can be used by enabling the dir_index feature.
Journal checksumming
ext4 uses checksums in the journal to improve reliability since the journal is one of the most used files of the disk. This feature has a side benefit: it can safely avoid a disk I/O wait during journaling, improving performance slightly. Journal checksumming was inspired by a research paper from the University of Wisconsin titled IRON File Systems[10] (specifically, section 6, called "transaction checksums"), with modifications within the implementation of compound transactions performed by the IRON file system (originally proposed by Sam Naghshineh in the RedHat summit).
Faster file system checking
In ext4 unallocated block groups and sections of the inode table are marked as such. This enables e2fsck to skip them entirely and greatly reduces the time it takes to check the file system. Linux 2.6.24 implements this feature.
fsck time/Inode Count (ext3 vs. ext4)
Multiblock allocator
When ext3 appends to a file, it calls the block allocator, once for each block. Consequently if there are multiple concurrent writers, files can easily become fragmented on disk. However ext4 uses delayed allocation which allows it to buffer data and allocate groups of blocks. Consequently the multiblock allocator can make better choices about allocating files contiguously on disk. The multiblock allocator can also be used when files are opened in O_DIRECT mode. This feature does not affect the disk format.
Improved timestamps
As computers become faster in general and as Linux becomes used more for mission-critical applications, the granularity of second-based timestamps becomes insufficient. To solve this, ext4 provides timestamps measured in nanoseconds. In addition, 2 bits of the expanded timestamp field are added to the most significant bits of the seconds field of the timestamps to defer the year 2038 problem for an additional 204 years.
ext4 also adds support for date-created timestamps. But, as Theodore Ts'o points out, while it is easy to add an extra creation-date field in the inode (thus technically enabling support for date-created timestamps in ext4), it is more difficult to modify or add the necessary system calls, like stat() (which would probably require a new version) and the various libraries that depend on them (like glibc). These changes would require coordination of many projects. So even if ext4 developers implement initial support for creation-date timestamps, this feature will not be available to user programs for now.[11]

Goal of Revision

In 2008, the principal developer of the ext3 and ext4 file systems, Theodore Ts'o, stated that although ext4 has improved features, it is not a major advance, it uses old technology, and is a stop-gap. Ts'o believes that Btrfs is the better direction because "it offers improvements in scalability, reliability, and ease of management".[12] Btrfs also has "a number of the same design ideas that reiser3/4 had".[13]

Delayed allocation and potential data loss

Because delayed allocation changes the behavior that programmers have been relying on with ext3, the feature poses some additional risk of data loss in cases where the system crashes or loses power before all of the data has been written to disk. Due to this, ext4 in kernel versions 2.6.30 and later automatically handles these cases as ext3 does.

The typical scenario in which this might occur is a program replacing the contents of a file without forcing a write to the disk with fsync. There are two common ways of replacing the contents of a file on Unix systems:[14]

  • fd=open("file", O_TRUNC); write(fd, data); close(fd);
In this case, an existing file is truncated at the time of open (due to O_TRUNC flag), then new data is written out. Since the write can take some time, there is an opportunity of losing contents even with ext3, but usually very small. However, because ext4 can delay allocating file data for a long time, this opportunity is much greater.
There are several problems with this approach:
  1. If the write does not succeed (which may be due to error conditions in the writing program, or due to external conditions such as a full disk), then both the original version and the new version of the file will be lost, and the file may be corrupted because only a part of it has been written.
  2. If other processes access the file while it is being written, they see a corrupted version.
  3. If other processes have the file open and do not expect its contents to change, those processes may crash. One notable example is a shared library file which is mapped into running programs.
Because of these issues, often the following idiom is preferred over the one above:
  • fd=open("file.new"); write(fd, data); close(fd); rename("file.new", "file");
A new temporary file ("file.new") is created, which initially contains the new contents. Then the new file is renamed over the old one. Replacing files by the "rename" call is guaranteed to be atomic by POSIX standards – i.e. either the old file remains, or it's overwritten with the new one. Because the ext3 default "ordered" journaling mode guarantees file data is written out on disk before metadata, this technique guarantees that either the old or the new file contents will persist on disk. ext4's delayed allocation breaks this expectation, because the file write can be delayed for a long time, and the rename is usually carried out before new file contents reach the disk.

Using fsync more often to reduce the risk for ext4 could lead to performance penalties on ext3 filesystems mounted with the data=ordered flag (the default on most Linux distributions). Given that both file systems will be in use for some time, this complicates matters for end-user application developers. In response, ext4 in Linux kernels 2.6.30 and newer detect the occurrence of these common cases and force the files to be allocated immediately. For a small cost in performance, this provides semantics similar to ext3 ordered mode and increases the chance that either version of the file will survive the crash. This new behavior is enabled by default, but can be disabled with the "noauto_da_alloc" mount option.[14]

The new patches have become part of the mainline kernel 2.6.30, but various distributions chose to backport them to 2.6.28 or 2.6.29. For instance, Ubuntu made them part of the 2.6.28 kernel in version 9.04 ("Jaunty Jackalope").[15]

These patches don't completely prevent potential data loss or help at all with new files. The only way to be safe is to write and use software that does fsync when it needs to. Performance problems can be minimized by limiting crucial disk writes that need fsync to occur less frequently.[16]

Compatibility with Windows and Macintosh

ext4 does not yet have as much support as ext2 and ext3 on non-Linux operating systems. ext2 and ext3 have stable drivers such as Ext2IFS, which are not yet available for ext4. It is possible to create compatible ext4 filesystems for use in Windows by disabling the extents feature, and sometimes specifying an inode size.[17] Another option for using ext4 in Windows is to use Ext2Fsd,[18] an open-source driver that, like Ext2IFS, supports writing in ext4 partitions where extents have been disabled. Viewing and copying files from ext4 to Windows, even with extents enabled, is also possible with the Ext2Read software.[19] However, there are no available drivers that provide full read and write compatibility with Windows.

Mac OS X has full ext2/3/4 read/write capability through the Paragon ExtFS [20] software, which is a commercial product. Free software such as ext4fuse has only read-only support with limited functionality.

See also

References

  1. ^ a b Mathur, Avantika; Cao, MingMing; Bhattacharya, Suparna; Dilger, Andreas; Tomas, Alex; Vivier, Laurent (2007). "The new ext4 filesystem: current status and future plans" (PDF). Proceedings of the Linux Symposium. Ottawa, ON, CA: Red Hat. Retrieved 2008-01-15.
  2. ^ Torvalds, Linus (2006-06-09). "extents and 48bit ext3". Linux kernel mailing list.
  3. ^ Ts'o, Theodore (2006-06-28). "Proposal and plan for ext2/3 future development work". Linux kernel mailing list.
  4. ^ Leemhuis, Thorsten (2008-12-23). "Higher and further: The innovations of Linux 2.6.28 (page 2)". Heise Online. Retrieved 2010-01-09. {{cite news}}: Cite has empty unknown parameter: |coauthors= (help)
  5. ^ "ext4: Rename ext4dev to ext4". Linus' kernel tree. Retrieved 2008-10-20.
  6. ^ Leemhuis, Thorsten (2008-12-23). "Higher and further: The innovations of Linux 2.6.28". Heise Online. {{cite news}}: Cite has empty unknown parameter: |coauthors= (help)
  7. ^ Paul, Ryan (2010-01-15). "Google upgrading to Ext4, hires former Linux Foundation CTO". Ars Technica. {{cite news}}: Cite has empty unknown parameter: |coauthors= (help)
  8. ^ "Android 2.3 Gingerbread to use Ext4 file system". The H Open. 14 December 2010.
  9. ^ "Migrating to Ext4". DeveloperWorks. IBM. Retrieved 2008-12-14.
  10. ^ Vijayan Prabhakaran. "IRON File Systems" (PDF). CS Dept, University of Wisconsin. {{cite journal}}: Cite journal requires |journal= (help); Invalid |display-authors=1 (help); Unknown parameter |author-separator= ignored (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
  11. ^ Ts'o, Theodore (Thu, 5 Oct 2006 12:55:04 -0400). "Re: creation time stamps for ext4 ?". {{cite web}}: Check date values in: |date= (help)
  12. ^ Paul, Ryan (2009-04-13). "Panelists ponder the kernel at Linux Collaboration Summit" (Document). Ars TechnicaTemplate:Inconsistent citations {{cite document}}: Unknown parameter |accessdate= ignored (help); Unknown parameter |url= ignored (help)CS1 maint: postscript (link)
  13. ^ Theodore Ts'o (2008-08-01). "Re: reiser4 for 2.6.27-rc1". linux-kernel (Mailing list). Retrieved 2010-12-31. {{cite mailing list}}: Unknown parameter |mailinglist= ignored (|mailing-list= suggested) (help)
  14. ^ a b "ext4 documentation in Linux kernel source". 2009-03-28.
  15. ^ Ubuntu bug #317781 Long discussion between Ubuntu developers and Theodore Ts'o on potential data loss
  16. ^ Thoughts by Ted blog entry, March 12th, 2009 A blog posting of Theodore Ts'o on the subject
  17. ^ Description of Ext2Read and information on disabling extents for compatibility with Ext2Fsd
  18. ^ "Ext2Fsd Project". Ext2fsd.com. Retrieved 2012-01-15.
  19. ^ Description of Ext4 compatibility with Windows 7, November 1, 2009[dead link]
  20. ^ [1]