SC205 PROJECT

Motivation

• Many digital data streams contain a lot of redundant information. For example, a file may contain more zeros than ones: 0, 0, 0, 0, 1, 0 ,0 , 1, 0, 0, 0, 1, 1.

• Uncompressed data can take up a lot of space, which is not optimal for storing data on a fixed hard drive space and transferring data over fixed bandwidth.

• Example: One minute of uncompressed HD video can be over 1 GB. How can we fit a 2-3 hour HD film on a 1-2 GB disc ?

• We want to minimize the original data to reduce superfluous^[1] information.

• There are two points we need to take care of:

A good algorithm to achieve maximum data compression.
Maximum data compression that can be achieved without losing any data and can be recovered to exact bit sequence after decompression.

Approach

Huffman Coding

It is a lossless data compression algorithm. The main idea is to assign variable-length codes to input characters.
Lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.
Here variable-length codes assigned to input characters are prefix codes, meaning the codes are assigned in such a way that the code assigned to one character is not the prefix of code assigned to any other character.
This makes sure that there is no ambiguity while decompressing.

Formulate the Mathematics

Calculate the frequency of each character in the code.
Sort the characters in descending order of the frequency. These are stored in a priority queue (Q) using a binary heap.
Create an empty node *. Assign the minimum frequency (last element of Q) to the left child of z and assign the second minimum frequency (second last element of Q) to the right child of *. Set the value of the * as the sum of the above two minimum frequencies (* denote the internal nodes).
Remove these two minimum frequencies from Q and add the sum into the list of frequencies .
Insert node * into Q.
Repeat steps 2 to 5 for all the characters.
For each non-leaf node, assign 0 to the left edge and 1 to the right edge.