Lempel–Ziv–Markov chain algorithm
|
|
This article needs attention from an expert on the subject. See the talk page for details. WikiProject Computer science or the Computer science Portal may be able to help recruit an expert. (August 2009) |
The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform data compression. It has been under development since 1998[1][2] and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to LZ77 and features a high compression ratio (generally higher than bzip2[3][4]) and a variable compression-dictionary size (up to 4 GB).[5]
Contents |
[edit] Overview
|
|
This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2010) |
LZMA uses a dictionary compression algorithm (a variant of LZ77), whose output is then encoded with a range encoder. The dictionary compressor produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder, using a model to make a probability prediction of each bit. Prior to LZMA, most encoder models were byte-based (i.e. they coded each bit using a cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase. This is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in the same context.
[edit] Compressed Format
|
|
This section needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2010) |
In LZMA compression, the compressed stream is a stream of bits, encoded using adaptive binary range coder. The stream is divided into packets, each packet describing either a single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modelled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type.
There are 7 types of packets:
| packed code (bit sequence) | packet description |
|---|---|
| 0 + byteCode | A single byte encoded using an adaptive binary range coder. The range coder uses context based on some number of the most significant bits of the previous byte. Depending on the state machine, this can also be a single byte encoded as a difference from the byte at the last used LZ77 distance. |
| 1+0 + len + dist | A typical LZ77 sequence describing sequence length and distance. |
| 1+1+0+0 | A one-byte LZ77 sequence. Distance is equal to the last used LZ77 distance. |
| 1+1+0+1 + len | An LZ77 sequence. Distance is equal to the last used LZ77 distance. |
| 1+1+1+0 + len | An LZ77 sequence. Distance is equal to the second last used LZ77 distance. |
| 1+1+1+1+0 + len | An LZ77 sequence. Distance is equal to the third last used LZ77 distance. |
| 1+1+1+1+1 + len | An LZ77 sequence. Distance is equal to the fourth last used LZ77 distance. |
The length is encoded as follows:
| Length code (bit sequence) | Description |
|---|---|
| 0+ 3 bits | The length encoded using 3 bits, gives the lengths range from 2 to 9. |
| 1+0+ 3 bits | The length encoded using 3 bits, gives the lengths range from 10 to 17. |
| 1+1+ 8 bits | The length encoded using 8 bits, gives the lengths range from 18 to 273. |
The distance is encoded as follows:
First a distance class is encoded using 6 bits. The 5 other bits of the distance code encode the information about how many direct distance bits need to be extracted from the stream.
[edit] 7-Zip reference implementation
The LZMA implementation extracted from 7-Zip is available as LZMA SDK. It was placed by Igor Pavlov in the public domain on December 2, 2008, with the release of version 4.62.[6] It was originally distributed under the terms of both the GNU LGPL and Common Public License, with a special exception for linked binaries.
LZMA2 compression, which is an improved version of LZMA,[7] has been introduced in version 9.04 beta, of May 30, 2009.[8]
The reference open source LZMA compression library is written in C++ and has the following properties:
- Compression speed: approximately 1 MB per second on a 2 GHz CPU
- Decompression speed: 10-20 MB per second on a 2 GHz CPU
- Support for multi-threading.
In addition to the original C++, the LZMA SDK contains reference implementations of LZMA compression and decompression ported to ANSI C, C#, and Java.[6] Python bindings also exist for the C++ library and ports of LZMA to Pascal and Go.[9][10][11]
The 7-Zip implementation uses several variants of hash chains, binary trees and Patricia tries as the basis for its dictionary search algorithm.
Decompression-only code for LZMA generally compiles to around 5 KB and the amount of RAM required during decompression is principally determined by the size of the sliding window used during compression. Small code size and relatively low memory overhead, particularly with smaller dictionary lengths, and free source code make the LZMA decompression algorithm well-suited to embedded applications.
[edit] See also
[edit] References
- ^ The SDK history file states that it was in development from 1996, and first used in 7-zip 2001-08-30. Aside from some unreferenced comments about 1998, the algorithm appears to have been unpublished before its use in 7-zip.
- ^ Igor Pavlov has asserted multiple times on SourceForge that the algorithm is his own creation.. "One of many forum posts with this claim.". Archived from the original on 25 August 2009. http://www.webcitation.org/5jImhbWcv. Retrieved 25 August 2009.
- ^ Collin, Lasse (2005-05-31). "A Quick Benchmark: Gzip vs. Bzip2 vs. LZMA". The Tukaani Project. http://tukaani.org/lzma/benchmarks.html. Retrieved 2008-09-02.
- ^ Klausmann, Tobias (2008-05-08). "Gzip, Bzip2 and Lzma compared". Blog of an Alpha animal. http://blog.i-no.de//archives/2008/05/08/index.html#e2008-05-08T16_35_13.txt. Retrieved 2008-09-02.
- ^ Overview of the LZMA format 7z Format
- ^ a b http://www.7-zip.org/sdk.html
- ^ from Inno Setup Help : "LZMA2 is a modified version of LZMA that offers a better compression ratio for uncompressible data (random data expands about 0.005%, compared to 1.35% with original LZMA), and optionally can compress multiple parts of large files in parallel, greatly increasing compression speed but with a possible reduction in compression ratio."
- ^ History of 7-zip changes
- ^ http://www.joachim-bauch.de/projects/pylzma/
- ^ http://www.birtles.org.uk/programming/
- ^ http://lzma.googlecode.com/
[edit] External links
- Official home page
- Lzip format specification
- XZ format specification
- LZMA SDK (Software Development Kit)
- LZMA Utils = XZ Utils
- Windows Binaries for XZ Utils
- Data compression, Compressors & Archivers
|
||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||