Talk:LZ77 and LZ78

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Computing (Rated C-class, High-importance)
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.


Just a question for the author of this article... I am not familiar with the term "Patent minefield", used several times in this article. A wikipedia search for it shows that this is apparently the only article that uses that phrase. I don't object to the phrase really, but it might be good to include at least a brief explanation of what is meant by "patent minefield" or what the implications of being a patent minefield are. -- 22:34, 12 Feb 2005 (UTC)

See Software patent#Perceived negative effects. --Damian Yerrick (talk | stalk) 19:01, 9 October 2007 (UTC)

LZ78 Example[edit]

I've added a (hopefully pretty thorough) example of the LZ78 algorithm. I'd be grateful if someone other than me could give it a checking-over as a safeguard - and to make sure I can add in binary :)

I'm less familiar with LZ77, but intend to do a similar example in the near future.

Jenesis 16:53, 23 October 2005 (UTC)

Shouldn't there be a note saying in what circumstances would the newly-created symbol be sent right away by the encoding algorithm In the ABABA example? Maybe it's obvoius but I'm too dumb to see it ;_) --filu 12:18, 10 November 2005 (UTC)
OK, you may disregard my remark. I must had been hit on the head with hammer when I wrote that ;_) --filu 12:30, 11 November 2005 (UTC)

These sentences are difficult to understand:

In the example, so the third triple (0, 0, "c") should be incorporated, what the compression ratio, however, deteriorated significantly. 

It is important to note that more and more a character is shifted, was found to be in accordance to new characters do not have to double encode.

I don't know enough to fix this myself. Poelslager (talk) 04:22, 26 June 2015 (UTC)

Equivalence of LZ77 and LZ78[edit]

Does anyone have the cite that states that LZ77 and LZ78 are equivalent? I have vague memories that they were very different and that the only similarity is that they were devised by the same people. I also recall reading a paper showing they are equivalent. I can't remember which is correct anymore, but if someone knows of the cite, it would help. Rayjapan 10:46, 15 April 2007 (UTC)

I also looked at the page and wanted to put in a "Citation Needed" tag, so instead, I mentioned how LZ78 allows random access to input as long as the entire dictionary is available, while LZ77 does not.--Dwedit (talk) 13:57, 26 March 2009 (UTC)
But if LZ78 supports random access given the whole dictionary at any point, then LZ77 supports random access given the whole history buffer at any point. If LZ78 supports random access after the whole item has been decompressed and its dictionary stored in RAM, then this dictionary will likely be at least as big as the union of all history buffers (that is, the whole decompressed stream). Say you have 1024 bytes available for the history buffer or the dictionary; the rest goes to the operating system, and the decompressed stream is going to a character device without easy random access. At what point is "the whole dictionary" available? If you mean that the dictionary is stored literally in nonvolatile memory (separate from the target archive), that's either metatiling, DTE, or what Witten, Moffat, and Bell in Managing Gigabytes called Huffword, not LZ78. --Damian Yerrick (talk | stalk) 15:01, 27 November 2010 (UTC)

"sliding window"[edit]

what's this? —Preceding unsigned comment added by (talk) 16:53, 27 October 2008 (UTC)

In LZ77-family algorithms, the "sliding window" or "history buffer" is the end of the uncompressed text before any given point, used as a reference for matches. Please re-read the paragraph beginning "The encoder and decoder must both keep track of some amount of the most recent data" in the article. What about it do you not understand? --Damian Yerrick (talk | stalk) 15:14, 27 November 2010 (UTC)

Patent contravercies[edit]

There is a need for a section on patent controversies involving these algorithms. Even though I could find references, I'm not the one to write it, as I have strong opinions on the matter and could not be unbiased.

These patent issues are very important because they were one of the first significant software patent issues on "scope of patent" determined in US courts.

I think such a section should be written from a historical perspective. — Preceding unsigned comment added by RuediiX (talkcontribs) 10:50, 9 January 2011 (UTC)

Ancient history - Stac corporation sued Microsoft for using a particular hash algorithm when recompressing a file in MSDOS 6.20. MSDOS 6.21 removed that feature, and MSDOS 6.22 added it back using a different hash algorithm. Some computer peripheral companies using specific implementations of compression algorithms had to pay royalties for them. Rcgldr (talk) 20:36, 13 May 2011 (UTC)

This article needs some corrections[edit]

LZ77 is not a dictionary coder. Instead a history buffer is used and the output codes are length and buffer offset, or a raw data byte. Both the lengths and offsets may be huffman coded. The article has a reasonble explanation of the algorithm.

LZ78 is a dictionary coder, but does not work on future data. Instead the history for LZ78 is it's dictionary instead of a history (sliding window) buffer.

LZW encoder (based on LZ78) - The algorithm starts by setting the next code to 100 hex, and reading the first byte of the input stream and setting current code equal to that first byte. Then each next byte is read from the input stream, and the dictionary is searched for current code + next byte. If a match is found, the current code is updated to equal the index of the matching dictionary entry and there is no output. If a match is not found, a new dictionary code is created at index next code for current code + next byte, next code is incremented, and the currrent code is output. The next byte becomes the new current code. If the dictionary is filled, then no more entries are added to the dictionary. There may be an optional limit for the longest string allowed for a dictionary entry, and if this limit is reached, the current code is forced to be output, but no dictionary entry is added. Compression ratio may be monitored and used to reset the dictionary as needed. A special reset code will be needed to indicate a reset. The size of the dictionary codes in the compressed stream may be also increased with yet another special control code.

LZW decoder (based on LZ78) - The first current code input will be a raw byte, and will be output. last code and last byte will be set to that first code. Then for each current code input, the decoder will index the dictionary in reverse order, retrieving a previously defined code and raw byte with each dictionary entry retrieval. Since it picks up the raw bytes of a string in reverse order, the raw bytes need to be stored in a temporary buffer while the dictionary entries are traced backwards until dictionary code represents a raw byte (less than 100 hex), which is the first byte of the string for current code. last byte is set to that byte and a dictionary entry is created at next code, with values last code + last byte. last code is set to current code. next code is incremented. The buffered string is output. As an option, the decoder could also maintain string lengths in the dictionary and index into an output buffer while retrieving the raw bytes in reverse order. There is one special case where the decoder receives a code that's not yet in it's dictionary at next code. In this case, last code + last byte will be used to create the new dictionary entry, and the algorithm continues normally. If there is a maximum string length, the decoder has to follow the same rules as the encoder.

Note that old pc-dos compression programs such as pkzip include both LZ77 and LZ78 methods, with the main enhancement of using a huffman coding table for the raw bytes based on a scan ahead of the data to be compressed, and dynamic creation of huffman coding tables for the codes to be output. Rcgldr (talk) 15:06, 12 November 2011 (UTC)

  • From the LZ78 section: At this point it will output the location of the word in the dictionary, if one is available, the match length, and the character that caused a match failure. Since the match length is implied by the dictionary word, match length is not output, and only the dictionary word and raw character are output. For LZW, the character that caused the match failure is assumed to be the first character of the next dictionary code, so only the dictionary word is output. For LZW, the final word or character is output when the input stream is exhausted. Rcgldr (talk) 15:06, 12 November 2011 (UTC)
  • updated LZ78 section Rcgldr (talk) 16:34, 12 November 2011 (UTC)

preference of LZ77 versus LZ78 algorithms[edit]

The original section for LZ78 mentioned one reason for preference of LZ77 for LZ78 was patent issues, but there has been a history of patents issued for algorithms based on both LZ77 and LZ78.

Regarding preference of LZ77 versus LZ78, one reason is that Intel cpu based programs (at least 16 and 32 bit code) generally compress better with LZ77 based algorithms. For most other file types, compression ratios seem about the same. There are other issues. LZ78 stores strings in reverse order, which the decoder has to deal with. Most LZ78 algorithms add some addtional boundary conditions, such as a maximum string length (if exceeded, it's treated as a mis-match), and monitor compression ratio to see if the dictionary should be reset and a new one built, so they end up more complicated than LZ77 algorithms. Rcgldr (talk) 17:45, 12 November 2011 (UTC)

encoding and decoding section should probably be removed[edit]

Someone using the persona Klauria (doesn't exist as a Wiki user) added an encoding and decoding section, along with an example, but it's based on LZW algorighm. It probably should be removed, since it doesn't apply at all to the LZ77 algorithm, and it's an enhancment of the LZ78 algorithm. Rcgldr (talk) 07:47, 22 June 2012 (UTC)

Academic Paper[edit]

Is this paper worth including in the article to reference the actual algorithm? — Preceding unsigned comment added by 2602:306:BC90:CE30:213:72FF:FEE6:1A7C (talk) 08:17, 23 December 2014 (UTC)

The paper is already referenced in the article. 03:19, 27 December 2014 (UTC)

Quality of text[edit]

This article would benefit from a thorough proof-reading as several sections are written in bad English and are, at times, incomprehensible. Ie. LZ77#Example — Preceding unsigned comment added by (talk) 16:48, 8 May 2016 (UTC)

Just wanted to say the same thing. Eg. " In the example, so the third triple (0, 0, "c") should be incorporated, what the compression ratio, however, deteriorated significantly. " But I'm not very sure about the topic yet to correct it. Need help. Hoemaco (talk) 20:57, 2 September 2016 (UTC)
It's not just you. I have a degree in computer science and had difficulty following the text. 2601:602:9802:99B2:3978:D923:D461:AA4D (talk) 00:45, 22 November 2016 (UTC)