Talk:Huffman coding

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Computing (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 
WikiProject Computer science (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
WikiProject Mathematics (Rated C-class, Low-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
Low Importance
 Field: Discrete mathematics
One of the 500 most frequently viewed mathematics articles.

Pain in the ass[edit]

It's a real pain in the ass how every mathematics-related article on wikipedia assumes you have a masters degree, at least. I understood Huffman coding as explained in my first undergrad book on data structures and algorithms, and reading this completely wiped out any intuition that I gained previously. — Preceding unsigned comment added by 24.85.228.245 (talk) 05:52, 4 July 2012 (UTC)

I'm in the same boat. I know Huffman coding is simple, and I used to be able to do it, but this article was nowhere near the refresher I had hoped it would be. 192.203.137.242 (talk) 19:25, 7 December 2012 (UTC)
If you can spell out the problems with the material, someone more well-versed in the subject may be able to make improvements to the article.
Sebastian Garth (talk) 19:44, 8 December 2012 (UTC)

basic example image is wrong?[edit]

hi! i think the example in the basic example, the one in the captation of the image is wrong. it shoud be

a3 110
a4 111

and not:

a3 111
a4 110

both the image is wrong and the table. i think one can't mix the way the 0/1 is set. PedroCarvalho (talk) 06:23, 28 November 2007 (UTC)

well, you can mix the way 0/1 is set, the output code will be good anyway, you just have to remember it when you decode it! anyway, you'd better not to change the way 0/1 is set, so I have fixed both image and caption. Alessio Damato 13:59, 4 December 2007 (UTC)
it still mixes the way 0/1 is set. In the first split it gives 0 to the least common (a1: 0.4), and 1 to the most common ({a2,a3,a4}: 0.6), but in the two other splits is gives 0 to the most common and 1 to the least common. -Thomas Nygreen (talk) 09:59, 9 May 2008 (UTC)
Huffman code generators can select 0/1 in any manner they choose (including randomly) and still generate a valid Huffman code. Nevertheless the diagram ought to be consistent with the pseudocode. Dcoetzee 16:30, 9 May 2008 (UTC)
Exactly like it says in the text below it, the hoffman coding should be done from left to right, taking the lower probability each time. This is NOT what the picture shows, so it is clearly wrong. Yomach (talk) 06:27, 26 June 2010 (UTC)
Isn't the frequency for the 'f' character also wrong? Should be 2 not 3? Moulsonp (talk) 10:54, 11 May 2012 (UTC)

older[edit]

The algorithm in "Basic technique" is ambiguous. It's not clear how the two queues are used. It seems that the second queue is never read from, only added to.

[BREAK INTO DISCUSSION: 30 July 2007 - Simply does not compute - I have been trying to make sense of this 'pseudo code' for a couple of hours now. I am no slouch, but I just can't get the algorithm from this description. Somebody - please post something more intelligible, then remove this comment. - TropicalCoder (Google me) EXIT DISCUSSION]

~ I believe I may have better solved the example "Sample-1". I may be wrong, but I found the algorithm ill explained in my notes so I visited wikipedia to better understand it... I tried to solve it in the most logical way to me... building a complete binary tree with apropriate depth and using the percentage costs as guides to choose which boundaries to choose.

What I got was this: There is 1 character with weight 30; 1 with weight 20; 3 * 15; 1 * 10 and 3 * 5 for a total sum of 120. Dividing all by 120 and using the percentage to define how much of the tree to take away you get that 30/120 = 25%... and you go on like that... using this technic I got:

Character Weight Coding
h 30 00
e 20 010
b 15 011
d 15 100
g 15 101
a 10 1100
c 5 1101
f 5 1110
i 5 1111

Total cost is exactly the same, but the coding varies in size between 2 and 4 bits, and in #Basic_technique it says that "It is generally beneficial to minimize the variance of codeword length." I believe that makes my solution better, so if anyone with more than a mere day's knowledge on the matter agrees, then by all means make the correction :) --nunocordeiro 02:02, 11 July 2006 (UTC)

You got the right answer for the wrong reasons. You used Shannon-Fano coding, which in this case happens yield a result identical (in terms of each symbol's weight) to the bottom-merge Huffman code, the Huffman code with the lowest variance and lowest maximum length (among optimal codes). In general, these are not the same, and Shannon-Fano is suboptimal. The example should be replaced by one that either yields only one Huffman code (again, in terms of symbol weights, so {0,1} is the same code as {1,0}) or explain bottom-merge Huffman coding. I'll put this on my to do list, but if someone wants to fix this, go ahead. Calbaer 02:10, 27 July 2006 (UTC)

Removed "*Animation of the Huffman Algorithm: Algorithm Simulation by Simon Mescal" since the link is dead. If it comes back up move back in.


What is meant by "cost" in this article?

  • Yeah, I was wondering too look, what I found [1], now I see. The cost is the cost to send each letter in the target alphabet. For example if we send letters as in Morse with "." & "_" we will spend more on "_". Gnomz007 15:07, 16 Nov 2004 (UTC)

Or to qualify a little more, the letter S in morse is three dots in rapid sucession. Three time periods. The letter 0 is three dashes. Each dash takes up at least two time periods. So, at least six time periods. Therefore, sending O is at least twice as expensive (in time!) as sending an S. You would think that a byte is always 8 characters long and so this has no meaning. However, most US english only uses about 7 characters. If you find that no 8-bit characters are being used in the file -- or if you find a way to mark them as such -- you can compress the file down to 7/8th its original size right there. Consider some of the unicode encodings, where most characters take 8 bits, but if you want an extended character, you can use extra bytes is you flag them. --67.177.235.1 18:55, 4 October 2005 (UTC)

Oh no! This is wrong, wrong, wrong (in the context of the article). For most applications, costs are probabilities, pure and simple. But they need not be, because they are just multiplicative factors (or, as the Basic technique section has it, weights) for the expression to be minimized. Adding to the confusion, the Huffman coding with unequal letter costs subsection uses cost in the way you two and the linked paper are discussing, as output costs! It seems as though an information theorist wrote most the article but a computer scientist wrote the Problem definition section. Anyway, I think most people who read the entire article would come away very confused. Looks like it's time for a rewrite (or at least a find/replace of the word 'cost').... Calbaer 02:22, 27 July 2006 (UTC)

Back-story on the Discovery of Huffman Coding[edit]

Dr. David A. Huffman was the Chairperson of the Information and Computer Science Department at University of California at Santa Cruz (UCSC) in the early 1970's. He taught an undergraduate course, "The Algebraic Foundations of Cybernetics" in 1971-73. During one lecture he taught "Minimum Redundancy Coding". A student noticed, in some textbooks, it was called "Huffman Coding" -- and asked him if that was the same principle.

Dr. Huffman replied by relating an interesting story.

As a student at MIT, David Huffman had been working on an unsolved, challenging problem in Information Theory. Instead of studying for the final, he had decided to devote all his free time to work on this problem. The night before the final, with no solution yet, in a moment of despair, he threw his large stack of notes into the trash. He described the next moment as a flash of insight in which the solution occurred to him.

He went on to explain that he was still working on many difficult, interesting problems that, when solved, could be represented in textbooks for centuries. He did not want such a "simple idea" as minimum reduncancy coding to bear his name. Students wondered and discused among themselves: was this humility, or hubris?

not one-to-one on finite messages[edit]

The Huffman coding is not one-to-one for finite messages. Suppose the message consists completely of the same character (e.g. a file full of 'aaaaaaaaaaaaaaaaaaaa...'); then that character would be assigned the empty bitstring; and the message would encode to the empty string. But such messages of different lengths would all encode to the empty string, hence the information about the length is lost. If you think about decoding such a message, you would also see the problem: you have an empty encoded message, but you also have a symbol whose code is the empty bitstring, so you don't know how many characters to write out.

This problem does not occur for infinite messages, of course; since there is only one infinite message consisting solely of the same character, for any given character.

Perhaps this should be mentioned in the article? --Spoon! (talk) 04:18, 12 January 2008 (UTC)

This is not entirely correct. The weights in the Huffman algorithm represent probabilities, not observed frequencies: so you can only assign the empty bitstring to a character if you know that such character will occur with certainty. Hence this is a degenerate case, where the only information being transmitted is the length of the string. In practice, if all you have is finite sample data, then you need to use something alike to Laplace's rule of succession to assign probabilities. —Preceding unsigned comment added by Classicalecon (talkcontribs) 11:50, 11 February 2008 (UTC)
Another thing to keep in mind is that the empty string is not part of a valid Huffman code. A valid Huffman code has no codeword of length less than one. If the most likely outcome is more than 40% likely, its codeword will only have one bit, but it cannot have less, even if it is 99.9999% likely. If it is 100% likely — or even just much more likely than 50% — then a practical compressor wouldn't use a Huffman code per symbol, but perhaps instead a run-length encoding on the entire sequence. If there are relatively few possible sequences, that run-length code could be a Huffman code. Calbaer (talk) 22:27, 11 February 2008 (UTC)

Length-limited Huffman coding[edit]

The section on length-limited Huffman coding needs to be clarified, since finding the optimal length-limited code given the initial weight distribution is trivial. In what sense is this an unsolved problem? —Preceding unsigned comment added by Classicalecon (talkcontribs) 10:56, 11 February 2008 (UTC)

How is it trivial? If the Huffman code has a length longer than the length limit, we have to throw away this code and find the optimal length-limited code using a different method, e.g., the Package-Merge algorithm. It might be trivial to find a length-limited code in general — a fixed-length code certainly limits length — but to find an optimal length-limited code is another matter. No method is known to solve this problem in linear or logarithmic time. Calbaer (talk) 22:27, 11 February 2008 (UTC)
The article section currently claims that it isn't linearithmic, but I don't see how that's the case. The package-merge algorithm finishes in O(nL) time. But if L is greater than log n in base phi, code length is not a limiting factor. (See Witten, Moffat, and Bell, Managing Gigabytes, ISBN 9780442018634, pp. 345-350, for a proof.) Otherwise, L ≤ log n, and O(nL) ≤ O(n log n). So why isn't that linearithmic? --Damian Yerrick (talk | stalk) 21:21, 24 April 2012 (UTC)
Because Huffman coding is used where the weights/probabilities/frequencies are not all identical. If the probability of the jth item were 2-j, then the optimal (length-unlimited) code would have two codewords of length n-1. If you wanted to limit lengths to n/2, then the Package-Merge method would take O(n2). Calbaer (talk) 20:33, 6 June 2014 (UTC)

Basic technique?[edit]

In the article it describe the basic technique like so:

  1. Start with as many leaves as there are symbols.
  2. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
  3. While there is more than one node in the queues:
        1. Dequeue the two nodes with the lowest weight.
        2. Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
        3. Enqueue the new node into the rear of the second queue.
  4. The remaining node is the root node; the tree has now been generated.

However, it never says what the second queue is used for! From what I understand shouldn't this node be inserted back into the first queue and the queue sorted? I'm having real problems working out how to generate the Huffman tree (since it differs from a normal binary tree as the highest weight should only be a single bit). —Preceding unsigned comment added by 88.107.145.7 (talk) 19:30, 6 May 2008 (UTC)

This description is correct but confusing. The step "Dequeue the two nodes with the lowest weight" is implicitly examining the fronts of both queues. A conceptually simpler implementation uses priority queues, but is not linear time. I will attempt to clarify the exposition. Dcoetzee 19:35, 6 May 2008 (UTC)

Alternate Pseudocode[edit]

I think a more programming-focused version of the pseudocode should be added to the article, as the current explanation is a bit difficult for programmers to implement and somewhat ambiguous. Archer1742 (talk) 10:29, 26 March 2009 (UTC)

Removal of links[edit]

Please explain your reason for mass removal of links from this article. Please do not simply refer to WP:LINKFARM. Explain why each link you removed was bad.Biophys (talk) 14:56, 9 April 2009 (UTC)

How does one decode?[edit]

I have a solid background in comp theory, but I can't see from this article, unless I might start writing the math from scratch, how one takes the encoding and reforms a string like the French phrase in the example. The article linked in prefix code [2], also woefully inexplicable. explains me the idea of using the tree plus the encoded string, and so now I understand. But why isn't there an algorithm, pseudocode, and/or illustration of decoding?

Also, some proof of optimality would be nice. SamuelRiv (talk) 00:01, 11 February 2011 (UTC)

Yes, I see what you mean. Well, I can put together at least a brief overview of the topic, I think. It'll probably need some additional expansion, of course... Sebastian Garth (talk) 17:24, 11 February 2011 (UTC)
I just read the section and it is very useful. The thing that was bothering me is that even though I know how the prefix tree works now, the article seems to assume not just familiarity, but working comfort with the prefix tree. As such, the beginning sections completely skip over a basic, diagrammed rundown of prefix tree compression, such that the formulation of the problem makes sense in context, the importance of representing characters with probabilities has actual meaning in terms of prefix code, and the later animation (which also should be reduced in size, btw) can be explainable quickly after the introduction.
Really what was so confusing about the first section is it is representing all characters by probabilities. Again, I have a lot of background, but my immediate question was how the probabilities are useful in any sense of encoding and decoding. Of course, they are (generally) not - they are useful strictly in optimizing the algorithm. Ideally, the prefix code article would be readable as well - by the way, this article introducing prefix coding is what I used to figure this out - but a summary could easily be done on Wikipedia so that one only needs a little comp sci to get through. SamuelRiv (talk) 19:04, 12 February 2011 (UTC)

Ambiguity alert[edit]

Hey there, this is not 100% clear is it? Huffman avoided the major flaw of the suboptimal Shannon-Fano coding by building the tree from the bottom up instead of from the top down. So was the flaw created "by building the tree from the bottom up", or was the "bottom up" way the good method? Heh, if you have no idea about H. C. at first, you might be more confused than you were at the beginning. -andy 217.50.52.128 (talk) 15:54, 27 October 2012 (UTC)

Second paragraph is poorly worded[edit]

Nuf' said. — Preceding unsigned comment added by 99.245.3.15 (talk) 10:49, 25 August 2013 (UTC)

External links[edit]

My removal of the external links, none of which appear to meet our guideline on external links was reverted. Can I ask how the remaining links meet that guideline? Thargor Orlando (talk) 17:05, 21 January 2014 (UTC)

Asymmetric numeral systems (ANS) - combining speed of Huffman with accuracy of arithmetic?[edit]

There is lots of going on about this alternative to arithmetic coding, which have similar complexity to Huffman coding, but handles fractional bits of information - have accuracy like arithmetic coding.

Some implementations: https://github.com/Cyan4973/FiniteStateEntropy https://github.com/rygorous/ryg_rans Blogs: http://fastcompression.blogspot.fr/ http://fgiesen.wordpress.com/2014/02/02/rans-notes/ http://cbloomrants.blogspot.com/

Maybe it should be mentioned somewhere in the article? — Preceding unsigned comment added by 128.211.178.20 (talk) 21:58, 4 February 2014 (UTC)

I have added some very brief subsection, but ANS rather requires a separate article. — Preceding unsigned comment added by 98.222.202.103 (talk) 15:47, 28 February 2014 (UTC)

It had one, but that was deleted due to original research: Wikipedia:Articles_for_deletion/Asymmetric_binary_system. The subsection here seems to have been a way to get that content back in. No idea what's happened since then, but back then one compression expert mentioned that a fatal flaw of the system was the need to decode in reverse order [3]. That means no context-adaptivity (which the inventor of the code acknowledged). Since the primary advantage of arithmetic coding is, in practice, that adaptivity, I don't see the practical application of ANS, let alone its viability as a distinct subsection in the article of a completely different coding technique (without reliable sources, to boot). Calbaer (talk) 20:45, 6 June 2014 (UTC)
Not to say that ANS is useless, just that the evidence of notability has to be more than its inventor extolling its virtues and some folks implementing it in personal libraries. Calbaer (talk) 20:50, 6 June 2014 (UTC)
I removed it. The first comment here is from a Purdue domain, and Purdue is where the ANS inventor was located when this was posted. He has been using Wikipedia to try to promote this for years in violation of WP:SOAP, WP:FORUM, WP:NOR, WP:RS, and WP:notability. Calbaer (talk) 00:02, 7 June 2014 (UTC)
Maybe look at some current benchmarks (you can test yourself) e.g. http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders http://encode.ru/threads/1890-Benchmarking-Entropy-Coders , http://heartofcomp.altervista.org/MOC/MOCATD.htm (zhuff is LZ4 + FSE). — Preceding unsigned comment added by 89.76.32.36 (talk) 05:55, 7 June 2014 (UTC)
Maybe look at the Wikipedia policies and guidelines and follow them. It's been 6.5 years since Wikipedia:Articles_for_deletion/Asymmetric_binary_system, where several people tried to explain to you why this is not an appropriate place for this content. If you really feel it is, read the policies and guidelines, then say how it doesn't violate them. You clearly have either not bothered to read them, or have ignored them, repeatedly polluting several articles with your blatant and annoying self-promotion. This article, for example, had a whole section about your recent-but-arcane algorithm, while other well-established alternative algorithms such as LZW and arithmetic coding, used by most people in industrial societies every day of their lives, merited only a brief mention, and that just so folks could know there were commonly used alternatives to Huffman coding with various pros and cons. Your algorithm isn't among those well-established commonly used ones.
After 6.5 years, the method hasn't been published in a peer-reviewed paper, adopted by a commercial or open-source product (I'm not counting libraries), or otherwise garnered notability; a few excited hobbyists on the web do not make something "notable," nor does any benchmark, no matter how good. Data compression is a very small community, and I know a couple of the people you've worked and been in contact with. Your position in the community is not that of a crank or a hack. But, on Wikipedia, that's precisely what you come off as. Unfortunately, your material being here reminds me of the Wikipedia Seigenthaler biography incident: Both take advantage of a period of time in which no one who saw knew to fix the misplaced content. Under WP:AGF, I'll have to take this as a sign of willful ignorance rather than dishonesty. However, I suggest you no longer write about your own work. If it is as notable as you think, someone else inevitably will. And if they're familiar with Wikipedia, they'll do a better job: Writing for a general audience rather than a technical one, including reliable sources, making the notability of the work obvious. But, even then, it won't be in this article, because your compression method is not a form of Huffman coding. Calbaer (talk) 15:29, 7 June 2014 (UTC)
Also, since you seem to reply to critiques but not read policies and guidelines, I should make it clear that, even if your algorithm were presented in a peer reviewed paper or used in a mobile app, that alone does not merit its inclusion here. I cite the lack of these for reasons why it isn't notable, but notability requires a higher bar than that alone, and not every notable invention merits inclusion on Wikipedia. I have several optimal coding algorithms presented in peer-reviewed papers. They are somewhat relevant to this topic, but mentioning them would hinder, not help, someone who comes here trying to understand Huffman coding. And they wouldn't merit their own pages on Wikipedia, useful though I think they might be. Though it isn't in a policy or guideline that I know of, one thing you should ask yourself before contributing is: Does this help Wikipedia and its readers, or does this just help me? I'm not here to promote myself and my work, and you shouldn't either. Calbaer (talk) 17:11, 7 June 2014 (UTC)
You really don't like this guy. Anyway, don't you think that Wikipedia should somehow mention that there are now methods, like a few month old FSE (or maybe some of your codings you've mentioned?), which are both faster and providing better compression than Huffman - that we could save time/energy/disk space? — Preceding unsigned comment added by 178.180.14.129 (talk) 22:29, 7 June 2014 (UTC)
Hi, I was wondering who would call Matt Mahoney (PAQ) and Bulat Ziganshin (FreeArc) as "a few excited hobbyists", or wouldn't like that, while comparing with FSE, prefix codes look really miserable now. — Preceding unsigned comment added by 89.76.32.36 (talk) 23:22, 7 June 2014 (UTC)
I don't know who you think you're fooling. First you contributed self-promotion through a named Wikipedia account, then, when that article was deleted, you contributed to tangentially related articles as an anon through your Purdue IP. Then you followed up on the talk page via your U.S. home IP. After you moved back to Poland, you posted a few hours later from your Polish home IP. And you followed it up by saying I really don't like "that guy" from your Polish T-Mobile IP. It's all the same guy. You think you're posting anonymously and that people will think the other posts are not you, but the sockpuppeting is painfully obvious (and yet another violation of Wikipedia policy). Perhaps you'll try a VPN next, but by this point I think we can assume that any anonymous follow-up or follow-up from a newly created account is just you again.
Then you top it off by trying to dig up and post irrelevant information about me. I guess I knew going in that you were rude and deceptive, but to be so blatant about it does nonetheless surprise me. It especially surprises me because behavior on such a public forum holds the potential for seriously harming your professional reputation. Your work being unnotable is not a reflection of either a lack of intellect or any deficiency as a human being. But your behavior on Wikipedia is another matter.
Matt Mahoney told you that your code suffers from the fatal flaw of not being context-adaptive. Yet you cite him as a defender, evidence that your work is notable in the context of Wikipedia? Given what he and others have said, your code seems slower than prefix coding (which doesn't require multiplies) and lacks the main advantage of arithmetic coding (context adaptivity). It's fun for toy problems, but not as a replacement for current technologies. Matt is interested in this because he's interested in new methods. But just because another academic shows interest in your work does not make it notable. As you've been repeatedly told, even if your method were the best ever, that alone would not make it notable in this context. Wikipedia is not the forum for advancing research. You must know this, since you've been told this repeatedly for years. This is not an alternative to the peer review process. It is an encyclopedia.
The Huffman coding article needs a lot of work; it suffers from bloat and confuses novices, and stuff like this only adds to these problems. Please stop contributing to Wikipedia until you can be constructive rather than destructive, and can stop willfully violating the policies of the website. Calbaer (talk) 00:36, 8 June 2014 (UTC)