Huffman Coding

Encoding and decoding of large text files via Huffman coding lossless compression algorithm

Overview

How to run

From the src directory, run make to build the encoder and decoder programs
Command to run encoder: java encoder <input_file_name>
Command to run decoder: java decoder <encoded_file_name> <code_table_file_name>

Encoding

Reads an input file that is to be compressed and generates two output files
- A compressed .bin file version of the input
- A code table that maps characters in the origional file to their binary representation
Can use one of three types of priority queue to facilitate building code table
- Binary heap
- 4-way cache optimized heap
- Pairing heap

Decoding

Builds a decode tree using the code table file as input
Generates a decoded text file from the input encoded.bin file using the decode tree

Program Structure

Encoding

huffEncodeTable(HuffmanTree huff)
- Returns a hash table containing all characters encoded with the Huffman tree and their corresponding binary code Strings
characterCode(HuffBaseNode bNode, char c, String code)
- Returns a String of 1’s and 0’s that corresponds to the character argument
- Traverses tree recursively and adds 1 or 0 based on branch traversed
fileToHash(File inputFile)
- Reads each line of the input file and increments the value corresponding to the read characters' keys in an instance -> frequency hash table.
- If no key with the instance is present, a new entry with key = instance and value = 1 is created
buildTree(File inputFile)
- Returns a Huffman tree corresponding to the input file
- Creates a root node for each value and frequency read in and adds them to a priority queue
- The two trees with the smallest frequency are removed from the heap and combined into a single Huffman tree whose new root has a weight of the sum of its two children
- This repeats until only one tree remains
generateEncodedFile(File inputFile, HuffmanTree huffTree)
- Reads each line of the input file and, utilizing the value to code hash table, translates the values corresponding code to binary that is written to the encoded.bin file
generateCodeTable(HuffmanTree huffTree)
- Writes the key value pairs generated by huffEncodeTree() to the code_table.txt file

Binary Heap

Array implementation of binary heap
- Child indexes for parent at index x are:
- 2x
- 2x+1
On removeMin(), check each child’s weight is less than the current node, if not, swap the positions of the parent node and the least child node

Four-way Heap

Array implementation of four-way heap
- Root is held at index 3
- Child indexes of parent at index x are:
- 4(x-3)+4
- 4(x-3)+5
- 4(x-3)+6
- 4(x-3)+7
Same checking and adjustments as binary heap necessary on removeMin() call, but must check all 4 children

Pairing Heap

Nodes hold child, left sibling, and right sibling pointers and a Huffman tree data item
Multi-pass scheme is used when melding after removeMin() call
- All children of removed root node are added to a queue
- First two heaps are removed from queue and melded, and result is added to queue
- Repeats until one heap exists

Decoding

createCodeTable(File codes)
- Creates a code -> value hash table based on the input file provided
decodeBinFile(File encoded)
- Reads each byte of the binary input file and converts binary to string, checking for matches against the code -> value hash table each bit

Queue Performance Analysis

Average times to build code table with each variation of priorety queue for the same large textfile recorded over multiple runs and reported below.

Binary heap: avg. 7360 ms
- With a delete min operation with complexity O(log2(n)), the binary heap is a decent structure to use in the creation of the Huffman tree, but its lack of optimizations elsewhere prevent it from being the best option.
Four-way heap: avg. 6603 ms
- Although the time complexity of the four-way heap’s removeMin operation is worse than that of the binary heap (4 log4(n) vs. log2(n)), the cache optimization that this approach utilizes makes it the fastest approach for large inputs.
Pairing heap: avg. 7832 ms
- While pairing heaps can be faster than binary or k-way heaps in many applications, here that is not the case. The pairing heap’s largest advantage comes from its constant time merge operation (compared to the alternative heaps linear time merge), however, the merge is not utilized outside of insert and removeMin operations and it performs worse due to it having a greater constant time overhead per operation.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.gitignore		.gitignore
README.md		README.md
test_input.txt		test_input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

test_input.txt

test_input.txt

Repository files navigation

Huffman Coding

Overview

How to run

Encoding

Decoding

Program Structure

Encoding

Binary Heap

Four-way Heap

Pairing Heap

Decoding

Queue Performance Analysis

About

Releases

Packages

Languages

tamargoadam/huffman-coding

Folders and files

Latest commit

History

Repository files navigation

Huffman Coding

Overview

How to run

Encoding

Decoding

Program Structure

Encoding

Binary Heap

Four-way Heap

Pairing Heap

Decoding

Queue Performance Analysis

About

Topics

Resources

Stars

Watchers

Forks

Languages