User:Ss020/sandbox
SC205 PROJECT
[edit]Motivation
[edit]• Many digital data streams contain a lot of redundant information. For example, a file may contain more zeros than ones: 0, 0, 0, 0, 1, 0 ,0 , 1, 0, 0, 0, 1, 1.
• Uncompressed data can take up a lot of space, which is not optimal for storing data on a fixed hard drive space and transferring data over fixed bandwidth.
• Example: One minute of uncompressed HD video can be over 1 GB. How can we fit a 2-3 hour HD film on a 1-2 GB disc ?
• We want to minimize the original data to reduce superfluous[1] information.
• There are two points we need to take care of:
- A good algorithm to achieve maximum data compression.
- Maximum data compression that can be achieved without losing any data and can be recovered to exact bit sequence after decompression.
Approach
[edit]Huffman Coding
[edit]- It is a lossless data compression algorithm. The main idea is to assign variable-length codes to input characters.
- Lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.
- Here variable-length codes assigned to input characters are prefix codes, meaning the codes are assigned in such a way that the code assigned to one character is not the prefix of code assigned to any other character.
- This makes sure that there is no ambiguity while decompressing.
Formulate the Mathematics
[edit]- Calculate the frequency of each character in the code.
- Sort the characters in descending order of the frequency. These are stored in a priority queue (Q) using a binary heap.
- Create an empty node *. Assign the minimum frequency (last element of Q) to the left child of z and assign the second minimum frequency (second last element of Q) to the right child of *. Set the value of the * as the sum of the above two minimum frequencies (* denote the internal nodes).
- Remove these two minimum frequencies from Q and add the sum into the list of frequencies .
- Insert node * into Q.
- Repeat steps 2 to 5 for all the characters.
- For each non-leaf node, assign 0 to the left edge and 1 to the right edge.
Solve the Mathematics
[edit]Let's look for the original code - “ABACCADCDACDACC”
- Counting frequency of each character in code
- Sorting character according to their frequency
- Creating empty node for Huffman Tree
- Repeating steps 2 to 5 of algorithm
- Repeating steps 2 to 5 of algorithm
- Assigning Huffman codes to each character
Solve the Mathematics
[edit]For decoding the code, we can take the code and traverse through the tree to find the character.
Let 101 is to be decoded, we can traverse from the root as in the figure below.
Interpretation and significance of the solution for application
[edit]After performing Huffman coding on the original code, we arrived at the following encoded result: “ABACCADCDACDACC”.
Comparing general encoded bits with Huffman encoded bits:
General code - 000100101000111011001011001010 (30 bits)
Huffman code - 1110011001110101011101011100 (28 bits)
Compressed ratio is 0.93 %
The compression ratio appears to be quite little. However, as the size of our input code grows, so does the performance of Huffman coding.
Bibliography
[edit]Huffman coding
[edit]https://mathworld.wolfram.com/HuffmanCoding.html
https://www.geeksforgeeks.org/huffman-coding-greedy-algo-3/
Wikipedia
[edit]https://youtu.be/eJ1lbXKjDfA
Recorded presentation
[edit]https://youtu.be/2m60HT3OMOI
- ^ "Superfluous", Wikipedia, 2018-11-01, retrieved 2021-07-02