Skip to content

Huffman Coding lossless compression and decompression for large text files

Notifications You must be signed in to change notification settings

tamargoadam/huffman-coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Huffman Coding

Encoding and decoding of large text files via Huffman coding lossless compression algorithm

Overview

How to run

  • From the src directory, run make to build the encoder and decoder programs
  • Command to run encoder: java encoder <input_file_name>
  • Command to run decoder: java decoder <encoded_file_name> <code_table_file_name>

Encoding

  • Reads an input file that is to be compressed and generates two output files
    • A compressed .bin file version of the input
    • A code table that maps characters in the origional file to their binary representation
  • Can use one of three types of priority queue to facilitate building code table
    • Binary heap
    • 4-way cache optimized heap
    • Pairing heap

Decoding

  • Builds a decode tree using the code table file as input
  • Generates a decoded text file from the input encoded.bin file using the decode tree

Program Structure

Encoding

  • huffEncodeTable(HuffmanTree huff)
    • Returns a hash table containing all characters encoded with the Huffman tree and their corresponding binary code Strings
  • characterCode(HuffBaseNode bNode, char c, String code)
    • Returns a String of 1’s and 0’s that corresponds to the character argument
    • Traverses tree recursively and adds 1 or 0 based on branch traversed
  • fileToHash(File inputFile)
    • Reads each line of the input file and increments the value corresponding to the read characters' keys in an instance -> frequency hash table.
    • If no key with the instance is present, a new entry with key = instance and value = 1 is created
  • buildTree(File inputFile)
    • Returns a Huffman tree corresponding to the input file
    • Creates a root node for each value and frequency read in and adds them to a priority queue
    • The two trees with the smallest frequency are removed from the heap and combined into a single Huffman tree whose new root has a weight of the sum of its two children
    • This repeats until only one tree remains
  • generateEncodedFile(File inputFile, HuffmanTree huffTree)
    • Reads each line of the input file and, utilizing the value to code hash table, translates the values corresponding code to binary that is written to the encoded.bin file
  • generateCodeTable(HuffmanTree huffTree)
    • Writes the key value pairs generated by huffEncodeTree() to the code_table.txt file

Binary Heap

  • Array implementation of binary heap
    • Child indexes for parent at index x are:
    • 2x
    • 2x+1
  • On removeMin(), check each child’s weight is less than the current node, if not, swap the positions of the parent node and the least child node

Four-way Heap

  • Array implementation of four-way heap
    • Root is held at index 3
    • Child indexes of parent at index x are:
    • 4(x-3)+4
    • 4(x-3)+5
    • 4(x-3)+6
    • 4(x-3)+7
  • Same checking and adjustments as binary heap necessary on removeMin() call, but must check all 4 children

Pairing Heap

  • Nodes hold child, left sibling, and right sibling pointers and a Huffman tree data item
  • Multi-pass scheme is used when melding after removeMin() call
    • All children of removed root node are added to a queue
    • First two heaps are removed from queue and melded, and result is added to queue
    • Repeats until one heap exists

Decoding

  • createCodeTable(File codes)
    • Creates a code -> value hash table based on the input file provided
  • decodeBinFile(File encoded)
    • Reads each byte of the binary input file and converts binary to string, checking for matches against the code -> value hash table each bit

Queue Performance Analysis

Average times to build code table with each variation of priorety queue for the same large textfile recorded over multiple runs and reported below.

  • Binary heap: avg. 7360 ms
    • With a delete min operation with complexity O(log2(n)), the binary heap is a decent structure to use in the creation of the Huffman tree, but its lack of optimizations elsewhere prevent it from being the best option.
  • Four-way heap: avg. 6603 ms
    • Although the time complexity of the four-way heap’s removeMin operation is worse than that of the binary heap (4 log4(n) vs. log2(n)), the cache optimization that this approach utilizes makes it the fastest approach for large inputs.
  • Pairing heap: avg. 7832 ms
    • While pairing heaps can be faster than binary or k-way heaps in many applications, here that is not the case. The pairing heap’s largest advantage comes from its constant time merge operation (compared to the alternative heaps linear time merge), however, the merge is not utilized outside of insert and removeMin operations and it performs worse due to it having a greater constant time overhead per operation.

About

Huffman Coding lossless compression and decompression for large text files

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published