Truncated binary encoding
Truncated binary encoding is an entropy encoding typically used for uniform probability distributions with a finite alphabet. It is parameterized by an alphabet with total size of number n. It is a slightly more general form of binary encoding when n is not a power of two.
When n is exactly 2x, truncated binary encoding simply assigns each symbol one of the n codewords of length x. When n is equal instead to 2x+b, truncated binary encoding assigns the first 2x-b symbols the first 2x-b codewords of length x and then assigns the remaining 2b symbols the last 2b codewords of length x+1. Because all the codewords of length x+1 consist of an unassigned codeword of length x with a "0" or "1" appended, the resulting code is a prefix code. For example, for the alphabet "0, 1, 2, 3, 4" n is equal to 5, which is equal to 22+1. Truncated binary encoding assigns the first three symbols (22-1) the codewords 00, 01, and 10, all of length 2, then assigns the last two symbols the codewords 110 and 111, the last two codewords of length 3, each of which is the unused codeword 11 with an extra digit appended.
For example, if n is 5, plain binary encoding and truncated binary encoding allocates these codewords: (RED digits/bits are not transmitted in truncated binary.)
Truncated binary |
Encoding | Standard binary |
---|---|---|
0 | 000 | 0 |
1 | 001 | 1 |
2 | 010 | 2 |
UNUSED | 011 | 3 |
UNUSED | 100 | 4 |
UNUSED | 101 | 5/UNUSED |
3 | 110 | 6/UNUSED |
4 | 111 | 7/UNUSED |
The nearest power of 2 after n=5 is 23=8, so there are u=8-5=3 unused codes in the binary representation on 3 bits.
In numerical terms, if there is an alphabet of size n which starts counting entries from the number 0, with a number of unused entries u to round the alphabet size up to the nearest power of two, and you are encoding the number x in truncated binary: If x is greater than or equal to u, add the value of u to x. Separately, and after that test and possible addition, drop the high bit in the standard binary encoding of the number when encoding x if the high bit is 0(zero).
Another example, encoding an alphabet of size 10 (between 0 and 9), there are 6 unused codes, so input values greater than or equal to 6 are offseted by 6 to the end of the binary space, and the first bit is disacarded from the truncated binary encoding output (here, the unused patterns are not shown in this table):
Input value |
Offseted value |
Standard Binary |
Truncated Binary |
---|---|---|---|
0 | 0 | 0000 | 000 |
1 | 1 | 0001 | 001 |
2 | 2 | 0010 | 010 |
3 | 3 | 0011 | 011 |
4 | 4 | 0100 | 100 |
5 | 5 | 0101 | 101 |
6 | 12 | 1100 | 1100 |
7 | 13 | 1101 | 1101 |
8 | 14 | 1110 | 1110 |
9 | 15 | 1111 | 1111 |
Here is a more extreme case providing less compression, with n=7 ; so the next power of 2 is 8 (we will then use 3 bits or 2 bits if the high bit is discarded) and u=1:
Input value |
Offseted value |
Standard Binary |
Truncated Binary |
---|---|---|---|
0 | 0 | 000 | 00 |
1 | 2 | 010 | 010 |
2 | 3 | 011 | 011 |
3 | 4 | 100 | 100 |
4 | 5 | 101 | 101 |
5 | 6 | 110 | 110 |
6 | 7 | 111 | 111 |
This last example demonstrates that the high bit cannot always be discarded when it is zero. In fact if u is lower than one quarter of the next power of two after n, then the leading high bit must not be discarded if the ununoded input value is greater than or equal to u (so this rule is applicable when n is an exact power of 2, where u is null and the truncated bit encoding is the same as standard binary on the same number of bits).
If we did not apply this second rule, then after truncating the leading null bit from the encoding of offseted values 2 and 3, then it would be impossible to make the distinction of the resulting output bit patterns 10 and 11 that also start the bit patterns 100, 101, 110 and 111 used for offseted value 4 to 7.