Talk:Lempel–Ziv–Markov chain algorithm

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Computing (Rated C-class, Mid-importance)
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.


Maybe information about XZ should also be added here. Maybe there should also be an XZ=>LZMA redirect. Mainly, because a lot of GNU/Linux distributions(Slackware, Gentoo, Zenwalk) and NetBSDs pkgsrc switch from tgz to txz these days. A lot of people will ask there self, what it is all about and since XZ is more or less the successor of LZMA I suggest adding a paragraph about it and do a redirect. Maybe also redirecting txz would be a good idea. I would have done it, but I have no account. -- (talk) 13:46, 11 May 2009 (UTC)

  • XZ is not a successor to LZMA, it is a successor to LZMA-utils, which a software package that uses LZMA. -- (talk) 12:40, 13 August 2009 (UTC)

Algorithm Details[edit]

I can't find anywhere in this article, or in the external links, any details on this algorithm (not the implementation or "7-zip"). Who invented it? When? What is the difference between it and the traditional LZ77, LZ78 and LZW algorithms, or is it a completely new algorithm? What's the acronym "LZMA"? Thanks. Nyh 08:33, 15 Dec 2004 (UTC)

  • Ditto that request. So far as I can tell, the author of 7-zip, Igor Pavlov, developed the algorithm, so the source itself seems to be the only documentation. From what I have been able to find, LZMA is an extension of LZ77 with a more flexible "dictionary" structure. I put dictionary in quotes, since LZ77's dictionary is a history window. Judging by the LZMA compression parameters described on this page, it appears LZMA's secret sauce is in its match-search method that allows it to scale its sliding window dramatically. mr_z 01:02:57, 21 Dec 2004 01:02:57 (UTC)
  • I came here looking for the same answers as well, esp. for this quote "is a data compression algorithm in development until 2001", in development by who and why did it stop in 2001? Does anyone know the history of this algorithm? --ShaunMacPherson 5 July 2005 12:59 (UTC)
  • Igor Pavlov could have discovered and implemented it as he claims. There has been no counter-argument or dispute on his ownership claims so far and the claims are plausable. The theory behind the algorithm is quite simple. Start with the older LZ77 compression. Rather than using a complete dictionary of all tokens it uses Markov chains to generate a very small dictionary of available tokens visible in nearby data. The tokens themselves are range encoded values which further improves the general dictionary size. He also ignored LZ77's arbitrary limitation on data sizes when generating the archive, which allows the huge dictionary to tend toward a more optimal set of tokens. All of these technologies are well-studied and understood, so there is no need for continued research on the format. Based on his many years of comments in the SourceForge forums it started as his own little personal project to see what happens when the technologies were combined. He noticed the results were excellent and then released to a few people for review. It quickly grew in popularity even though there were no formal papers describing the algorithm. Bwagstaff (talk) 23:22, 25 August 2009 (UTC)
  • LZMA's "secret sauce" is the fact it's a literal/match scheme coupled with range encoding, using an adaptive probability table capable of taking things into account like "match offsets tend to be multiples of 4" or "if we're currently on a 4-byte boundary then this byte is very likely to be a non-match". I hadn't actually thought of it as being related to Markov chains but I guess it is. The reason it beats the pants off Deflate is because Deflate uses Huffman coding, which is easier to decode but nearly always suboptimal.
Range encoding is incrementally more efficient than Huffman coding, but not nearly enough so to account for the dramatic performance increase over Deflate. If couple LZMA with a Huffman coder, and LZ77 with a range/arithmetic coder, the gap between them would close somewhat, but LZMA would still generally beat the pants off LZ77. The real "secret sauce" is the huge dictionary size, and the lookup algorithms that let the encoder search a dictionary of that size while staying usefully fast. --Piet Delport 16:23, 23 December 2005 (UTC)
Granted, Deflate's 32K limit doesn't help. So the "secret sauce" is the combination of the bigger window size plus the encoding that allows the bigger window size to be represented economically. See, I learned something! Posting wild speculation until someone corrects you is the true spirit of Wikipedia. -- 18:23, 2 January 2006 (UTC)
If by "encoding" you mean "data structure", then yes, that's it. :) --Piet Delport 23:48, 2 January 2006 (UTC)
  • What's the story with patents on the compression algorithm? The compression format has been gaining some popularity with open-source software and distributions. Maybe 7-zip owns patents but will only enforce them when the format is popular. Data compression is a minefield WRT patents - for example, that's why we have bzip2 instead of the arithmetic-coding variant bzip. OTOH, the Range encoding entry notes that range encoding might be patent-free. -- Richard, 19 Jan 2007

It's amazing that 5 years after I asked the question at the top of this section (how does the LZMA algorithm in fact work), and after lzma is becoming more and more popular, still there is no article, paper, or anything (at least not one that I could find) describing how the algorithm work. This Wikipedia article is just as useless on the subject as it was 5 years ago: The "algorithm" section says nothing about the algorithm (just the file format...), and the phrase "Markov Chain" isn't even mentioned after this title. The comments above are probably the most details I could find on the Internet about how this algorithm works, which is sad. Contrast this with Burrows and Wheeler's paper about the BWT (the algorithm behind bzip2), which is one of the most beautifully written papers I have ever read. Nyh (talk) 15:45, 25 February 2010 (UTC)

Charles Bloom wrote an analysis of the LZMA source code in his blog on Aug. 22, 2010. It's not clear if he means LZMA or LZMA2. Matt Mahoney (talk) 01:50, 13 September 2010 (UTC)

compression speed[edit]

On what benchmark are based the claim about a compression speed of 1MB/sec ? I've done some quick and dirty bench on a P4 2.8 Ghz and don't get something near the 1MB/sec with a 2Ghz processor as stated in the article. The test is done on fr page current (20050903_pages_current.xml.gz)

$ time lzma e pages_current.xml pages_current.xml.lzma
real    42m46.403s
user    41m47.801s
sys     0m2.220s
-rw-r--r--  1 ~~~ users 716524090 2005-09-23 23:39 pages_current.xml
-rw-r--r--  1 ~~~ users 141582811 2005-09-24 00:24 pages_current.xml.lzma
compression 280K/sec

$ time lzma d pages_current.xml.lzma pages_current.xml.1
real    0m42.048s
user    0m34.918s
sys     0m2.052s
decompression 17MB/sec

$ time bzip2 pages_current.xml
real    5m44.382s
user    5m31.425s
sys     0m1.560s
-rw-r--r--  1 ~~~ users 716524090 2005-09-24 00:27 pages_current.xml
-rw-r--r--  1 ~~~ users 164451016 2005-09-23 23:39 pages_current.xml.bz2
compression ~ 2 MB/sec

$ time bunzip2 pages_current.xml.bz2
real    1m50.196s
user    1m46.415s
sys     0m2.800s
decompression 6.5 MB/sec

$ time gzip pages_current.xml
real    1m13.921s
user    1m12.205s
sys     0m1.240s
-rw-r--r--  1 ~~~ users 716524090 2005-09-24 00:27 pages_current.xml
-rw-r--r--  1 ~~~ users 213544652 2005-09-23 23:39 pages_current.xml.gz
compression 9.8 MB/sec

$ time gunzip pages_current.xml.gz
real    0m14.064s
user    0m12.357s
sys     0m1.604s
decompression 51 MB/sec

Given the number above either the used data perform very differently from the benchmarked data or the compression speed is 1Mbits/sec rather than 1MBytes/sec. This test used the lzma sdk 4.27. Note also the memory usage seems pretty higher than gzip/bzip, using default option the compressor use 80MB of RAM. 23:58, 23 September 2005 (UTC)

  • try lzma sdk 4.32 or later with -a0 -D20 -fb64 -mfhc4 options. Should be faster than 1 MB/s, with only a slight lower compression ratio.

It's also unclear whether that 1MB/s refers to 1MB of input consumed per second or 1MB of output produced per second.

For compression, the rate should refer to input consumed, and for decompression, output produced. In other words, it's always in reference to the uncompressed data. (Otherwise, the rating would be largely determined by the compression ratio itself, which is not very useful.) --Piet Delport 00:50, 17 April 2006 (UTC)

Comparision of compression ratio[edit]

I did a quick comparision, default options for gzip,bzip2,lzma (from lzma utils] and rar:

 -rw-r--r--  1 root root 199M 2005-12-04 20:26 linux-2.6.11.tar
 -rw-r--r--  1 root root  36M 2005-12-04 20:28 linux-2.6.11.tar.bz2
 -rw-r--r--  1 root root  45M 2005-12-04 20:29 linux-2.6.11.tar.gz
 -rw-r--r--  1 root root  30M 2005-12-04 21:15 linux-2.6.11.tar.lzma
 -rw-r--r--  1 root root  34M 2005-12-04 21:24 linux-2.6.11.tar.rar

Quick decompression bench:

 # lzma d -so linux-2.6.11.tar.lzma |pv > /dev/null
 199MB 0:00:11 [17.2MB/s]
 # cat linux-2.6.11.tar.bz2| bunzip2 |pv > /dev/null
 199MB 0:00:20 [9.52MB/s]
 # cat linux-2.6.11.tar.gz| gunzip |pv > /dev/null
 199MB 0:00:03 [63.8MB/s]
Previous section talk about compression speed, not decompression speed, any number on this ? phe 12:56, 4 December 2005 (UTC)
  • Had it confirmed by Igor Pavlov that the acronym LZMA does indeed stand for Lempel Ziv Markov Chain, and not Lempel Ziv Modified Algorithm as previous asserted. My apologies.

Dictionary Size[edit]

This page (and the page for 7z) claim variable dictionary size is up to 1GB. However, at 7zip page, claim is up to 4GB. Any reference for (it once being or practically being) 1GB? --Gojomo 19:53, 16 February 2007 (UTC)

The current version of 7zip for Windows x64 (4.65) only supports a dictionary size of up to 1GB. From memory it's been like that always, or perhaps even lower (never paid much attention to such a high size, I admit, I don't have enough RAM to compressing something with such a large dictionary). Given that a dictionary of 4GB would need over 40GB of RAM for compression, I'm frankly not surprised that it's limited. I presume LZMA, and probably the 7z format is capable of supporting such a dictionary size, perhaps the current version is even capable of decompressing that large a dictionary if you have over 4GB of RAM and a file made with some other program. Nil Einne (talk) 18:25, 5 August 2009 (UTC)

Tone cleanup[edit]

I added a tone cleanup tag to the section describing the algorithm, because it sounds a little unprofessional. I don't think it should tell the reader to rely on forums for information, since forums aren't even an acceptable source of citations for Wikipedia. Describing the lack of information, or the that most of the information is only available via the forums, might say the same thing without directly telling the reader to rely on them, but I think it'd be best to either cite the information from a refutable source or simply not include it until it does become available. Maybe even include speculation by a refutable source (so long as it's properly cited). --Sydius (talk) 00:17, 23 July 2008 (UTC)

Forums are acceptable in some circumstances. See WP:SOURCES for details. I believe they meet the guidelines for acceptable sources. There are no known scholarly articles or news sources that provide the information in secondary form. The currently-best available references are those forum posts and other self-published comments by the subject's creator. The author is relavent to the subject, in that he was the undisputed creator of the algorithm and therefore expert about it. The sources are not self-serving or from third parties, as they are direct Q&A by the author to his technical audience. There is no reason to dispute authenticity or accuracy, since the author's identity and expert status is clear. Finally, the article is not based exclusively on the forum posts; the posts are simply the best known source of information. Unless and until there are better sources available, his forum posts seem to be the best available source and allowed according to WP guidelines. Bwagstaff (talk) 07:40, 10 January 2010 (UTC)

Use of LZMA in RPM[edit]

The latest edit relating RPM and LZMA points to a comment on proposed, rather than actual implementation. Unless there's a better source, the edit should be reverted. Tedickey (talk) 11:28, 16 December 2008 (UTC)

Fixed that for you :) (talk) 09:47, 31 December 2008 (UTC)

Expert attention[edit]

The reason I requested an expert's attention is because the article first quotes a potentially non notable but I presume expert, then goes to say he's mistaken (without any source). If the person quoted is wrong, there's little reason to quote him. Instead we should just mention how LZMA works, with sources. Nil Einne (talk) 18:14, 5 August 2009 (UTC)

Algorithm details[edit]

As you may have noticed, I added an attempt to fully explain the algorithm based on studying the C source code, and as far as I can tell this article is now the only decent textual description of LZMA available anywhere.

Obviously, this is worrying from a WP:ORIGINAL perspective; however, I feel it's a good idea to have this material for two reasons:

  1. The idea of WP:ORIGINAL is to make sure articles are verifiable. While this article is not verifiable by non-programmers, programmers can read the source code themselves and perform verification; the cited Linux kernel source code is short and readable enough to verify the correctness of the decoder description in a time short enough to make this realistic
  2. In about 10 years, no description of the algorithm has apparently ever been produced, despite the fact that it is widely used and that several programmers are necessarily very familiar with how it works, since they designed, implemented and ported it; Wikipedia is highly visible, so those experts are more likely to find the article and review or improve it, than if it were placed anywhere else.

The part about the LZMA encoders is the only one that is hard to verify, because the source code is very complex, and written in a hard-to-understand style; in particular, the part about the normal encoder is rally an educated guess mostly based on looking at the variables and loop structure.

However, the idea of using dynamic programming in the encoder is a pretty important feature, so it probably makes sense to keep it, although replacing or augmenting it with citations of an article stating that would be nice, if such an article were found.

The page now references the XZ for Java encoder which has more readable source code (allowing verification), and should be, modulo mistakes, a reasonably accurate high level description of its behavior.

It would be great if Igor Pavlov, Lasse Collin, or other people with direct implementation experience reviewed this, and even better if a paper describing and analyzing LZMA were published on peer-reviewed journals.

Eldray (talk) 19:25, 3 April 2012 (UTC)

There's no source given for most of the details (leaving others to assume that it was perhaps done by reading the source code - not the best way to proceed) TEDickey (talk) 21:24, 6 April 2012 (UTC)
Yes, it was done by reading the source code, which is mentioned in the article along with citations to the source code in question; unfortunately, no citable complete textual description of LZMA seems to exist, so for now there doesn't seem to be a better alternative.
There's a person who said he intends to publish a thesis or paper describing it, so maybe it could be cited once published. I also did cite Solomon's book, but it only covers a very limited part of the algorithm (hash chains and binary trees).
It's possible to add citations to specific source code lines for each paragraph, not sure if it's much more useful than just citing the relevant file.
Eldray (talk) 19:12, 7 April 2012 (UTC)
I am interested in seeing a real spec for LZMA happen. If anyone is reading this and would be interested in contributing to the effort (possibly with some support), please get in touch with me (the best way is through email, which is first name dot last name at gmail dot com). Raph Levien (talk) 18:50, 1 February 2013 (UTC)

"originally" used without source[edit]

The discussion is referring to "originally" in the context of Common Public License, but that license is dated more recently than the beginning of the project. Replacing that by "earlier" would work with the existing sources; it is unsourced as is. TEDickey (talk) 10:45, 5 August 2012 (UTC)

Actually, he's right. The earliest archived version of the LZMA SDK is dual-licensed under the CPL and the LGPL. And that came out 4 years before the CPL was published. Myconix (talk) 00:53, 12 February 2014 (UTC)
The comment states 4.62; your suggested source is 4.23 (and pointing the reader to a download site is less than helpful). The statement should be rewritten, given the information you are providing TEDickey (talk) 01:03, 12 February 2014 (UTC)
He was referring to the first public domain release, not the first release overall. Regardless, such an edit is trivial at best, and the statement has been clarified. If you believe there is a better citation to place there with the same information, feel free to replace it. Myconix (talk) 02:16, 12 February 2014 (UTC)

LZMA2 format?[edit]

Why is the LZMA2 container format described in an article about the LZMA algorithm? I feel that too many unrelated formats and encoding schemes are described in this page (LZMA2 encoder, Upper encoding layers). They should be moved to the 7-zip or xz pages. Ant diaz 14:05, 9 March 2013 (UTC) — Preceding unsigned comment added by Ant diaz (talkcontribs)

Specification of the compressed format[edit]

Here is a complete natural language specification of the compressed format. I hope someone can use it to fix the section Decompression algorithm details in the article. Ant diaz (talk) 16:48, 31 October 2013 (UTC)

Ant diaz: as you seem to know the subject, could you assist by highlighting which paragraphs are presently inaccurate. —Sladen (talk) 17:10, 31 October 2013 (UTC)
Sure. I must say that the article is well written given the sources used and the complexity of the algorithm. BTW, I found (and fixed) a mistake in the lzip manual while preparing this entry.
The second paragraph mentions the "LZMA2 algorithm", but LZMA2 is neither an algorithm nor the subject of the article. See the previous topic.
I find the first paragraph of Range coding of bits a bit confusing where it says "LZMA data is at the lowest level decoded one-bit at a time". "LZMA data" can be understood as the "data in the LZMA stream", which is not decoded one-bit at a time.
The second paragraph includes an implementation detail (that prob is typically implemented using a 16-bit data type) which most probably is no longer true. The size is configurable to 16 or 32 bits in the SDK from Igor Pavlov, and fixed to 'int' (usually 32 bit) in lzip because it is usually faster. BTW, do implementation details belong in an article of Wikipedia?
In the same paragraph, 'prob' is described as "the predicted probability of the bit being 1", but it is really the probability of the bit being 0. The larger the value of 'prob', the more probable that the decoded bit is a 0.
The descriptions following the fifth paragraph state that normalization is done before decoding. This is a speed optimization in the code. Normalization has to be done after decoding or else an extra normalization must be done at the end of the stream, which is not mentioned in the article.
In the last three paragraphs of LZMA coding contexts, just before the last table, I find the description of the coding of literals very confused. There is not a "pseudo-bit-tree", but 3 separate bit tree contexts. Maybe the following text could make it clearer: "The context array 'bm_literal' is special. In principle it acts as a normal bit tree context, the one selected by 'literal_state'. But if the previous decoded byte was not a literal, two other bit tree contexts are used depending on the value of each bit in 'match_byte' (the byte at the latest used distance), until a bit is decoded that is different from its corresponding bit in 'match_byte'. After the first difference is found, the rest of the byte is decoded using the normal bit tree context."
Hope this helps. - Ant diaz (talk) 18:28, 1 November 2013 (UTC)

no algorithm, completely useless article[edit]

the article is completely useless being without the actual algorithm and the differences to LZ77 in general. Where the fuck is the algorithm ? you explain the output format ?? the heading says "..algorithm" and not format...!!!!mannomann.

"Dictionary search data structures" etc in the article: the article is totally misleading - hash tables etc. are part of any LZ family compression algorithm in real implementations - it is presented here as crucial to LZMA, but it is merely known to use it in whatever LZ coding. — Preceding unsigned comment added by (talk) 15:40, 10 December 2014 (UTC)

There is no such thing as a "LZMA algorithm"; it is more like a "LZMA coding scheme". For example, the option '-0' of lzip uses the scheme in almost the simplest way possible; issuing the longest match it can find, or a literal byte if it can't find a match. Inversely, a much more elaborated way of finding coding sequences of minimum price than the one currently used by lzip could be developed, and the resulting sequence could also be coded using the LZMA coding scheme.
BTW, LZMA is a coding scheme with notable features, as shown by the data recovery capabilities of lziprecover. Ant diaz (talk) 18:58, 13 December 2014 (UTC)

LZMA2 vs. LZMA[edit]

This article mentions LZMA2 in several places, but in a very confusion fashion. I would like to clean this up, so I have some questions.

  • LZMA2 is a "file format", not a compression system, right? If so, the statement "LZMA2 compression, which is an improved version of LZMA" really needs to go.
  • "The XZ LZMA2 encoder processes" is referring to the part of the XY system that handles files (streams, the royal we), right? If so, is there no other term we can use than "encoder"? This implies something that it is not, at least in my terminology.
  • "In the 7-zip LZMA file format" and "the .7z file format, which can contain either LZMA or LZMA2 data" If item 1 is true, the statement "LZMA2 data" is meaningless. I suspect "either LZMA" means that the format previously used some other encoding, perhaps a single raw stream, but it does not appear to be mentioned.
  • "The .[xz] format, which can contain LZMA2 data" Can contain LZMA2? Well what other options are there, if any?
  • "the LZMA and LZMA2 algorithm details" This could be innocent wording, but I would like to be clear, does this mean "details of the LZMA2 file format..."?

I think this article would be greatly improved by moving all mentions of file formats out to a separate section.

Maury Markowitz (talk) 21:29, 17 December 2015 (UTC)

(1) LZMA2 is a data format as described in section "LZMA2 format". In practice, for compressible data, LZMA2 is just LZMA with 0.015%-3% more overhead due to chunk headers.
(2) It can be argued that the only encoder in xz is LZMA. The rest are either filters or containers, all unrelated to LZMA IMHO).
(3) The 7-zip program can unpack files in a variety of formats, including lzma-alone (LZMA), which is a single raw stream with a 13-byte header. The .7z file format is a complex container format that can contain a lot of data formats, including LZMA and LZMA2. See for example p7zip_15.09/DOC/Methods.txt
(4) Xz is a container format which curently contains another container format (LZMA2), which in turn contains an arbitrary mix of LZMA data and uncompressed data. That is all. Really. (Filter output does not appear in the xz file).
(5) "details of the LZMA2 data format" could be even clearer. I don't know about any LZMA2 files.
And I think this article would be greatly improved by moving all mentions of file formats to the corresponding pages (7-zip and xz). Best regards. Ant diaz (talk) 20:09, 18 December 2015 (UTC)

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Lempel–Ziv–Markov chain algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

As of February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete the "External links modified" sections if they want, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{sourcecheck}} (last update: 15 July 2018).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 13:39, 20 December 2017 (UTC)