# Talk:Huffman coding

## Pain in the ass

It's a real pain in the ass how every mathematics-related article on wikipedia assumes you have a masters degree, at least. I understood Huffman coding as explained in my first undergrad book on data structures and algorithms, and reading this completely wiped out any intuition that I gained previously. — Preceding unsigned comment added by 24.85.228.245 (talk) 05:52, 4 July 2012 (UTC)

I'm in the same boat. I know Huffman coding is simple, and I used to be able to do it, but this article was nowhere near the refresher I had hoped it would be. 192.203.137.242 (talk) 19:25, 7 December 2012 (UTC)
If you can spell out the problems with the material, someone more well-versed in the subject may be able to make improvements to the article.
Sebastian Garth (talk) 19:44, 8 December 2012 (UTC)

## basic example image is wrong?

hi! i think the example in the basic example, the one in the captation of the image is wrong. it shoud be

 a3 110 a4 111

and not:

 a3 111 a4 110

both the image is wrong and the table. i think one can't mix the way the 0/1 is set. PedroCarvalho (talk) 06:23, 28 November 2007 (UTC)

well, you can mix the way 0/1 is set, the output code will be good anyway, you just have to remember it when you decode it! anyway, you'd better not to change the way 0/1 is set, so I have fixed both image and caption. Alessio Damato 13:59, 4 December 2007 (UTC)
it still mixes the way 0/1 is set. In the first split it gives 0 to the least common (a1: 0.4), and 1 to the most common ({a2,a3,a4}: 0.6), but in the two other splits is gives 0 to the most common and 1 to the least common. -Thomas Nygreen (talk) 09:59, 9 May 2008 (UTC)
Huffman code generators can select 0/1 in any manner they choose (including randomly) and still generate a valid Huffman code. Nevertheless the diagram ought to be consistent with the pseudocode. Dcoetzee 16:30, 9 May 2008 (UTC)
Exactly like it says in the text below it, the hoffman coding should be done from left to right, taking the lower probability each time. This is NOT what the picture shows, so it is clearly wrong. Yomach (talk) 06:27, 26 June 2010 (UTC)
Isn't the frequency for the 'f' character also wrong? Should be 2 not 3? Moulsonp (talk) 10:54, 11 May 2012 (UTC)

## older

The algorithm in "Basic technique" is ambiguous. It's not clear how the two queues are used. It seems that the second queue is never read from, only added to.

[BREAK INTO DISCUSSION: 30 July 2007 - Simply does not compute - I have been trying to make sense of this 'pseudo code' for a couple of hours now. I am no slouch, but I just can't get the algorithm from this description. Somebody - please post something more intelligible, then remove this comment. - TropicalCoder (Google me) EXIT DISCUSSION]

~ I believe I may have better solved the example "Sample-1". I may be wrong, but I found the algorithm ill explained in my notes so I visited wikipedia to better understand it... I tried to solve it in the most logical way to me... building a complete binary tree with apropriate depth and using the percentage costs as guides to choose which boundaries to choose.

What I got was this: There is 1 character with weight 30; 1 with weight 20; 3 * 15; 1 * 10 and 3 * 5 for a total sum of 120. Dividing all by 120 and using the percentage to define how much of the tree to take away you get that 30/120 = 25%... and you go on like that... using this technic I got:

Character Weight Coding
h 30 00
e 20 010
b 15 011
d 15 100
g 15 101
a 10 1100
c 5 1101
f 5 1110
i 5 1111

Total cost is exactly the same, but the coding varies in size between 2 and 4 bits, and in #Basic_technique it says that "It is generally beneficial to minimize the variance of codeword length." I believe that makes my solution better, so if anyone with more than a mere day's knowledge on the matter agrees, then by all means make the correction :) --nunocordeiro 02:02, 11 July 2006 (UTC)

You got the right answer for the wrong reasons. You used Shannon-Fano coding, which in this case happens yield a result identical (in terms of each symbol's weight) to the bottom-merge Huffman code, the Huffman code with the lowest variance and lowest maximum length (among optimal codes). In general, these are not the same, and Shannon-Fano is suboptimal. The example should be replaced by one that either yields only one Huffman code (again, in terms of symbol weights, so {0,1} is the same code as {1,0}) or explain bottom-merge Huffman coding. I'll put this on my to do list, but if someone wants to fix this, go ahead. Calbaer 02:10, 27 July 2006 (UTC)

Removed "*Animation of the Huffman Algorithm: Algorithm Simulation by Simon Mescal" since the link is dead. If it comes back up move back in.

• Yeah, I was wondering too look, what I found [1], now I see. The cost is the cost to send each letter in the target alphabet. For example if we send letters as in Morse with "." & "_" we will spend more on "_". Gnomz007 15:07, 16 Nov 2004 (UTC)

Or to qualify a little more, the letter S in morse is three dots in rapid sucession. Three time periods. The letter 0 is three dashes. Each dash takes up at least two time periods. So, at least six time periods. Therefore, sending O is at least twice as expensive (in time!) as sending an S. You would think that a byte is always 8 characters long and so this has no meaning. However, most US english only uses about 7 characters. If you find that no 8-bit characters are being used in the file -- or if you find a way to mark them as such -- you can compress the file down to 7/8th its original size right there. Consider some of the unicode encodings, where most characters take 8 bits, but if you want an extended character, you can use extra bytes is you flag them. --67.177.235.1 18:55, 4 October 2005 (UTC)

Oh no! This is wrong, wrong, wrong (in the context of the article). For most applications, costs are probabilities, pure and simple. But they need not be, because they are just multiplicative factors (or, as the Basic technique section has it, weights) for the expression to be minimized. Adding to the confusion, the Huffman coding with unequal letter costs subsection uses cost in the way you two and the linked paper are discussing, as output costs! It seems as though an information theorist wrote most the article but a computer scientist wrote the Problem definition section. Anyway, I think most people who read the entire article would come away very confused. Looks like it's time for a rewrite (or at least a find/replace of the word 'cost').... Calbaer 02:22, 27 July 2006 (UTC)

## Back-story on the Discovery of Huffman Coding

Dr. David A. Huffman was the Chairperson of the Information and Computer Science Department at University of California at Santa Cruz (UCSC) in the early 1970's. He taught an undergraduate course, "The Algebraic Foundations of Cybernetics" in 1971-73. During one lecture he taught "Minimum Redundancy Coding". A student noticed, in some textbooks, it was called "Huffman Coding" -- and asked him if that was the same principle.

Dr. Huffman replied by relating an interesting story.

As a student at MIT, David Huffman had been working on an unsolved, challenging problem in Information Theory. Instead of studying for the final, he had decided to devote all his free time to work on this problem. The night before the final, with no solution yet, in a moment of despair, he threw his large stack of notes into the trash. He described the next moment as a flash of insight in which the solution occurred to him.

He went on to explain that he was still working on many difficult, interesting problems that, when solved, could be represented in textbooks for centuries. He did not want such a "simple idea" as minimum reduncancy coding to bear his name. Students wondered and discused among themselves: was this humility, or hubris?

## not one-to-one on finite messages

The Huffman coding is not one-to-one for finite messages. Suppose the message consists completely of the same character (e.g. a file full of 'aaaaaaaaaaaaaaaaaaaa...'); then that character would be assigned the empty bitstring; and the message would encode to the empty string. But such messages of different lengths would all encode to the empty string, hence the information about the length is lost. If you think about decoding such a message, you would also see the problem: you have an empty encoded message, but you also have a symbol whose code is the empty bitstring, so you don't know how many characters to write out.

This problem does not occur for infinite messages, of course; since there is only one infinite message consisting solely of the same character, for any given character.

Perhaps this should be mentioned in the article? --Spoon! (talk) 04:18, 12 January 2008 (UTC)

This is not entirely correct. The weights in the Huffman algorithm represent probabilities, not observed frequencies: so you can only assign the empty bitstring to a character if you know that such character will occur with certainty. Hence this is a degenerate case, where the only information being transmitted is the length of the string. In practice, if all you have is finite sample data, then you need to use something alike to Laplace's rule of succession to assign probabilities. —Preceding unsigned comment added by Classicalecon (talkcontribs) 11:50, 11 February 2008 (UTC)
Another thing to keep in mind is that the empty string is not part of a valid Huffman code. A valid Huffman code has no codeword of length less than one. If the most likely outcome is more than 40% likely, its codeword will only have one bit, but it cannot have less, even if it is 99.9999% likely. If it is 100% likely — or even just much more likely than 50% — then a practical compressor wouldn't use a Huffman code per symbol, but perhaps instead a run-length encoding on the entire sequence. If there are relatively few possible sequences, that run-length code could be a Huffman code. Calbaer (talk) 22:27, 11 February 2008 (UTC)

## Length-limited Huffman coding

The section on length-limited Huffman coding needs to be clarified, since finding the optimal length-limited code given the initial weight distribution is trivial. In what sense is this an unsolved problem? —Preceding unsigned comment added by Classicalecon (talkcontribs) 10:56, 11 February 2008 (UTC)

How is it trivial? If the Huffman code has a length longer than the length limit, we have to throw away this code and find the optimal length-limited code using a different method, e.g., the Package-Merge algorithm. It might be trivial to find a length-limited code in general — a fixed-length code certainly limits length — but to find an optimal length-limited code is another matter. No method is known to solve this problem in linear or logarithmic time. Calbaer (talk) 22:27, 11 February 2008 (UTC)
The article section currently claims that it isn't linearithmic, but I don't see how that's the case. The package-merge algorithm finishes in O(nL) time. But if L is greater than log n in base phi, code length is not a limiting factor. (See Witten, Moffat, and Bell, Managing Gigabytes, ISBN 9780442018634, pp. 345-350, for a proof.) Otherwise, L ≤ log n, and O(nL) ≤ O(n log n). So why isn't that linearithmic? --Damian Yerrick (talk | stalk) 21:21, 24 April 2012 (UTC)
Because Huffman coding is used where the weights/probabilities/frequencies are not all identical. If the probability of the jth item were 2-j, then the optimal (length-unlimited) code would have two codewords of length n-1. If you wanted to limit lengths to n/2, then the Package-Merge method would take O(n2). Calbaer (talk) 20:33, 6 June 2014 (UTC)

## Basic technique?

In the article it describe the basic technique like so:

```  1. Start with as many leaves as there are symbols.
2. Enqueue all leaf nodes into the first queue (by probability in increasing order so that the least likely item is in the head of the queue).
3. While there is more than one node in the queues:
1. Dequeue the two nodes with the lowest weight.
2. Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
3. Enqueue the new node into the rear of the second queue.
4. The remaining node is the root node; the tree has now been generated.
```

However, it never says what the second queue is used for! From what I understand shouldn't this node be inserted back into the first queue and the queue sorted? I'm having real problems working out how to generate the Huffman tree (since it differs from a normal binary tree as the highest weight should only be a single bit). —Preceding unsigned comment added by 88.107.145.7 (talk) 19:30, 6 May 2008 (UTC)

This description is correct but confusing. The step "Dequeue the two nodes with the lowest weight" is implicitly examining the fronts of both queues. A conceptually simpler implementation uses priority queues, but is not linear time. I will attempt to clarify the exposition. Dcoetzee 19:35, 6 May 2008 (UTC)

## Alternate Pseudocode

I think a more programming-focused version of the pseudocode should be added to the article, as the current explanation is a bit difficult for programmers to implement and somewhat ambiguous. Archer1742 (talk) 10:29, 26 March 2009 (UTC)

## How does one decode?

I have a solid background in comp theory, but I can't see from this article, unless I might start writing the math from scratch, how one takes the encoding and reforms a string like the French phrase in the example. The article linked in prefix code [2], also woefully inexplicable. explains me the idea of using the tree plus the encoded string, and so now I understand. But why isn't there an algorithm, pseudocode, and/or illustration of decoding?

Also, some proof of optimality would be nice. SamuelRiv (talk) 00:01, 11 February 2011 (UTC)

Yes, I see what you mean. Well, I can put together at least a brief overview of the topic, I think. It'll probably need some additional expansion, of course... Sebastian Garth (talk) 17:24, 11 February 2011 (UTC)
I just read the section and it is very useful. The thing that was bothering me is that even though I know how the prefix tree works now, the article seems to assume not just familiarity, but working comfort with the prefix tree. As such, the beginning sections completely skip over a basic, diagrammed rundown of prefix tree compression, such that the formulation of the problem makes sense in context, the importance of representing characters with probabilities has actual meaning in terms of prefix code, and the later animation (which also should be reduced in size, btw) can be explainable quickly after the introduction.
Really what was so confusing about the first section is it is representing all characters by probabilities. Again, I have a lot of background, but my immediate question was how the probabilities are useful in any sense of encoding and decoding. Of course, they are (generally) not - they are useful strictly in optimizing the algorithm. Ideally, the prefix code article would be readable as well - by the way, this article introducing prefix coding is what I used to figure this out - but a summary could easily be done on Wikipedia so that one only needs a little comp sci to get through. SamuelRiv (talk) 19:04, 12 February 2011 (UTC)

I think that it would be good to have a section on basic practical implementation issues so that people can see how Huffman coding can be very fast to encode and decode, even for large alphabets---the encoder constructs a flat encoding table from the Huffman tree, mapping input symbols to their encoded lengths and actual codes. The decoder uses an inverse table to map codes back to the input symbols. Once the frequencies are sorted and codes are assigned, the actual encoding or decoding can be very, very fast. I think explaining that would make the abstract stuff more concrete for many readers, especially programmers who aren't fluent in formal theory.

Some useful detail:

For small alphabets with a reasonable longest code length, the decoder can just use a dense table (1-dimensional array) with an entry for every possible bit pattern whose length is the length of the longest code.

Suppose, for example, that your maximum code length is 9, but you have mostly shorter codes like 0x101. To decode the next code, you can just take the next 9 bits (not knowing whether you need them all) and index into a 2^9 = 512-element decoding table. For short codes like Ox101 there are duplicate entries for all possible 9-bit patterns that start with that short pattern (101000001, 101000010, ... 101111111). No matter which of those bit patterns you use to index into the array, you get back the same code length (3) and encoded bit pattern (0x101), in fast constant time.

For large alphabets or other situations where you may have some longer codes, you can get most of the same benefit by using a small table like that and a conditional branch---take the next 9 bits, index into the table as above, and check the code length. If it's 9 or less, proceed as above. If it's not, that's because you're actually looking at the first 9 bits of a longer-than-9-bit code, so you branch to a somewhat slower routine for the longer codes.

This is generally extremely fast, because short codes occur much more frequently than longer codes---that is why they're shorter. In the common case it's just a handful of instructions, hitting a small table that fits in small super-high-speed Level 0 cache memory. (And with only one conditional branch, which is usually not taken and won't cause a pipeline stall on modern processors with branch predictors that notice things like that.) The routine to handle long codes can be pretty fast too, with time bounded by the code length, so that the total time to decode a file is proportinal to the entropy (encoded file size).64.134.159.110 (talk) 17:45, 1 May 2016 (UTC)