Understanding Huffman Coding and Its Implementation in Go
Language
- unknown
by James Smith (Golang Project Structure Admin)
Huffman coding is a widely-used algorithm for lossless data compression. By assigning variable-length codes to an input string’s characters, it ensures that the most frequently seen characters have shorter codes, while less frequently seen characters have longer codes. This technique minimizes the overall size of the data to be transmitted or stored.
In this blog post, we will delve into the principles of Huffman coding and progressively build a complete implementation in Go. We’ll start by introducing a basic overview of the algorithm, then move on to the actual code, explaining each part thoroughly. By the end, you should have a solid grasp of both the concept and its implementation.
Table of Contents
Who Created the Algorithm of Huffman Coding?
Huffman coding was developed by David A. Huffman in 1952 as part of a graduate-studies class assignment, when he was studying information theory at the Massachusetts Institute of Technology (MIT).
After completing his studies, Huffman went on to work at various institutions, including the Stanford Research Institute and the California Institute of Technology.
In addition to his work on Huffman coding, David A. Huffman made significant contributions to the field of error detection and correction. He explored various methods for ensuring data integrity during transmission over noisy channels, which is essential for reliable telecommunications and plays a crucial role in preventing data loss and ensuring clear communication in various applications from voice calls to data streaming.
How Does Huffman Coding Work Exactly?
Huffman coding works by following a systematic approach.
The process begins with counting the frequency of each character, with each character being represented as a node in a tree structure.
After determining the frequencies, you use them to build a priority queue (or heap), which helps construct the Huffman tree. Each node in this tree contains a character and its frequency.
To build the tree, you repeatedly remove the two nodes with the lowest frequencies from the queue and create a new internal node that combines these two nodes as its children.
The frequency of this new internal node is the sum of the frequencies of the two child nodes, and this node is then reinserted into the priority queue.
This process continues until only one node remains in the queue, which becomes the root of the Huffman tree.
The final step involves traversing the Huffman tree to generate the binary codes for each character.
Moving left in the tree adds a 0 to the code, while moving right adds a 1.
The result is a unique binary code for each character, with more frequent characters getting shorter codes, thereby achieving data compression.
How Is Huffman Coding Used Today?
After its initial introduction, Huffman coding quickly gained traction and found various applications, particularly in the fields of data compression and transmission.
It was soon integrated into some of the most commonly used modern compression standards, such as JPEG for images and MP3 for audio, highlighting its significance in reducing file sizes without sacrificing quality.
The impact of Huffman coding continues to resonate today, as efficient data storage and transmission remain critical in an era of big data and high-speed internet.
Defining the Data Structures
In order to start writing some code, we first need to define the data structures that will represent our Huffman tree.
We will create an interface named HuffmanTree
and two structs: HuffmanLeaf
and HuffmanNode
. The interface will define the method Freq
which returns the frequency of a node, while the structs will represent the leaf and internal nodes of the tree.
These data structures can be created in Go as below:
package main
type HuffmanTree interface {
Freq() int
}
type HuffmanLeaf struct {
freq int
value rune
}
type HuffmanNode struct {
freq int
left, right HuffmanTree
}
func (leaf HuffmanLeaf) Freq() int {
return leaf.freq
}
func (node HuffmanNode) Freq() int {
return node.freq
}
You can see that the HuffmanTree
interface defines a single method, Freq
, which is implemented by both the HuffmanLeaf
and HuffmanNode
structs.
The HuffmanLeaf
struct represents a leaf node, which has no children. This leaf node contains a character value and its frequency.
On the other hand, the HuffmanNode
struct represents an internal node that has left and right children, as well as a frequency that is the sum of its children’s frequencies.
Implementing the Priority Queue
To manage the nodes based on their frequencies, we will use a min-heap.
This data structure allows us to efficiently retrieve the nodes with the lowest frequencies when constructing the Huffman tree.
We will implement this using a slice of HuffmanTree
and the methods required by the heap.Interface
defined in the Go standard library.
Below is the code used to implement the priority queue:
package main
type treeHeap []HuffmanTree
func (heap treeHeap) Len() int {
return len(heap)
}
func (heap treeHeap) Less(i, j int) bool {
return heap[i].Freq() < heap[j].Freq()
}
func (heap treeHeap) Swap(i, j int) {
heap[i], heap[j] = heap[j], heap[i]
}
func (heap *treeHeap) Push(element interface{}) {
*heap = append(*heap, element.(HuffmanTree))
}
func (heap *treeHeap) Pop() (poppedElement interface{}) {
poppedElement = (*heap)[len(*heap)-1]
*heap = (*heap)[:len(*heap)-1]
return
}
In this implementation, the treeHeap
type is a slice of HuffmanTree
elements. It implements the heap.Interface
.
The Len
method returns the number of elements in the heap.
The Swap
method exchanges two elements in the heap based on their indices.
The Less
method is responsible for ordering the elements based on their frequency, returning true
if the frequency of the element at index i
is less than that of j
.
The Push
method adds a new element to the heap, while the Pop
method removes and returns the element with the lowest frequency.
Building the Huffman Tree
Now we need to implement the function that will actually build the Huffman tree from the character frequencies.
The function will take a map of character frequencies as input and return the root of the Huffman tree.
Here’s the code:
package main
import (
"container/heap"
)
func buildTree(symFreqs map[rune]int) HuffmanTree {
var trees treeHeap
for c, f := range symFreqs {
trees = append(trees, HuffmanLeaf{f, c})
}
heap.Init(&trees)
for trees.Len() > 1 {
a := heap.Pop(&trees).(HuffmanTree)
b := heap.Pop(&trees).(HuffmanTree)
heap.Push(&trees, HuffmanNode{a.Freq() + b.Freq(), a, b})
}
return heap.Pop(&trees).(HuffmanTree)
}
We first initialize a treeHeap
slice to hold the nodes of the Huffman tree.
We iterate over the symFreqs
frequency map, creating a HuffmanLeaf
for each character, and then add each leaf to the trees
heap.
This heap is used to ensure that the nodes with the lowest frequencies are always processed first.
After initializing the heap with heap.Init
, we enter a loop where the two nodes with the smallest frequencies are repeatedly popped from the heap.
These nodes are combined into a new internal node, which is then pushed back into the heap.
This process continues until only one node remains, representing the root of the complete Huffman tree.
Generating Huffman Codes
Once we have built the Huffman tree, we need to traverse it to generate the codes for each character.
This can be accomplished through a recursive function that traverses the tree and prints the codes.
Below is the function that achieves this task
package main
import (
"fmt"
)
func printCodes(tree HuffmanTree, prefix []byte) {
switch i := tree.(type) {
case HuffmanLeaf:
fmt.Printf("%c\t%d\t%s\n", i.value, i.freq, string(prefix))
case HuffmanNode:
// traverse left child
prefix = append(prefix, '0')
printCodes(i.left, prefix)
prefix = prefix[:len(prefix)-1]
// traverse right child
prefix = append(prefix, '1')
printCodes(i.right, prefix)
prefix = prefix[:len(prefix)-1]
}
}
In the body of this function, we perform a type switch on the tree
parameter to determine whether it is a HuffmanLeaf
or a HuffmanNode
.
If it’s a HuffmanLeaf
, we print the character, its frequency, and the current prefix, which represents the code.
If it’s a HuffmanNode
, we first append '0'
to the prefix and recursively call printCodes
for the left child. After returning from the recursive call, we remove the last element of the prefix to backtrack.
We then append '1'
to the prefix and recursively call printCodes
for the right child, repeating the same backtracking process.
Bringing Everything Together With the Main Function
Finally, we can use all of the code that we’ve written so far in the main
function, where we iterate over input, count character frequencies, build the tree and, finally, print the Huffman codes.
So here’s the complete main
function:
package main
import (
"fmt"
)
func main() {
const input = "this is example text for huffman encoding"
symFreqs := make(map[rune]int)
for _, c := range input {
symFreqs[c]++
}
tree := buildTree(symFreqs)
fmt.Println("SYMBOL\tWEIGHT\tHUFFMAN CODE")
printCodes(tree, []byte{})
}
We create a map called symFreqs
to store the frequency of each character, looping through each rune in the string and incrementing its count in the map.
After counting the frequencies, we call the buildTree
function that we wrote earlier to construct the Huffman tree.
Finally, we call the printCodes
function to display the symbols, their frequencies and the corresponding Huffman codes.
Checking the Output
If you run the main
function that we’ve just created, you should get output that’s equivalent to the following:
SYMBOL WEIGHT HUFFMAN CODE
x 2 0000
s 2 0001
e 4 001
u 1 01000
l 1 01001
h 2 0101
d 1 01100
r 1 01101
m 2 0111
n 3 1000
i 3 1001
p 1 10100
g 1 101010
c 1 101011
f 3 1011
6 110
t 3 1110
a 2 11110
o 2 11111
Note that the symbol that has a weight of 6 is the space character.
The symbols will probably be printed in a different order when you run the code, since iteration over maps in Go does not follow a specific ordering.
You can easily see in the output how the symbols with lower weights tend to have longer Hoffman codes, while the symbols with higher weights tend to have shorter Hoffman codes. As we discussed earlier, this is precisely what makes Hoffman coding so efficient.