Understanding Huffman Coding and Its Implementation in Go

Saturday, 7 September 2024, 14:26 PM

Language

unknown

by James Smith (Golang Project Structure Admin)

Huffman coding is a widely-used algorithm for lossless data compression. By assigning variable-length codes to an input string’s characters, it ensures that the most frequently seen characters have shorter codes, while less frequently seen characters have longer codes. This technique minimizes the overall size of the data to be transmitted or stored.

In this blog post, we will delve into the principles of Huffman coding and progressively build a complete implementation in Go. We’ll start by introducing a basic overview of the algorithm, then move on to the actual code, explaining each part thoroughly. By the end, you should have a solid grasp of both the concept and its implementation.

Table of Contents

Who Created the Algorithm of Huffman Coding?

Huffman coding was developed by David A. Huffman in 1952 as part of a graduate-studies class assignment, when he was studying information theory at the Massachusetts Institute of Technology (MIT).

David Huffman depicted around the time that he developed the algorithm for Huffman coding as a young man. — David Huffman (1925–1999) as a young man.

After completing his studies, Huffman went on to work at various institutions, including the Stanford Research Institute and the California Institute of Technology.

In addition to his work on Huffman coding, David A. Huffman made significant contributions to the field of error detection and correction. He explored various methods for ensuring data integrity during transmission over noisy channels, which is essential for reliable telecommunications and plays a crucial role in preventing data loss and ensuring clear communication in various applications from voice calls to data streaming.

How Does Huffman Coding Work Exactly?

Huffman coding works by following a systematic approach.

The process begins with counting the frequency of each character, with each character being represented as a node in a tree structure.

After determining the frequencies, you use them to build a priority queue (or heap), which helps construct the Huffman tree. Each node in this tree contains a character and its frequency.

To build the tree, you repeatedly remove the two nodes with the lowest frequencies from the queue and create a new internal node that combines these two nodes as its children.

The frequency of this new internal node is the sum of the frequencies of the two child nodes, and this node is then reinserted into the priority queue.

This process continues until only one node remains in the queue, which becomes the root of the Huffman tree.

The final step involves traversing the Huffman tree to generate the binary codes for each character.

Moving left in the tree adds a 0 to the code, while moving right adds a 1.

The result is a unique binary code for each character, with more frequent characters getting shorter codes, thereby achieving data compression.

How Is Huffman Coding Used Today?

After its initial introduction, Huffman coding quickly gained traction and found various applications, particularly in the fields of data compression and transmission.

It was soon integrated into some of the most commonly used modern compression standards, such as JPEG for images and MP3 for audio, highlighting its significance in reducing file sizes without sacrificing quality.

The impact of Huffman coding continues to resonate today, as efficient data storage and transmission remain critical in an era of big data and high-speed internet.

Defining the Data Structures

In order to start writing some code, we first need to define the data structures that will represent our Huffman tree.

We will create an interface named HuffmanTree and two structs: HuffmanLeaf and HuffmanNode. The interface will define the method Freq which returns the frequency of a node, while the structs will represent the leaf and internal nodes of the tree.

These data structures can be created in Go as below:

package main

type HuffmanTree interface {
    Freq() int
}

type HuffmanLeaf struct {
    freq  int
    value rune
}

type HuffmanNode struct {
    freq        int
    left, right HuffmanTree
}

func (leaf HuffmanLeaf) Freq() int {
    return leaf.freq
}

func (node HuffmanNode) Freq() int {
    return node.freq
}

You can see that the HuffmanTree interface defines a single method, Freq, which is implemented by both the HuffmanLeaf and HuffmanNode structs.

The HuffmanLeaf struct represents a leaf node, which has no children. This leaf node contains a character value and its frequency.

On the other hand, the HuffmanNode struct represents an internal node that has left and right children, as well as a frequency that is the sum of its children’s frequencies.

Implementing the Priority Queue

To manage the nodes based on their frequencies, we will use a min-heap.

This data structure allows us to efficiently retrieve the nodes with the lowest frequencies when constructing the Huffman tree.

We will implement this using a slice of HuffmanTree and the methods required by the heap.Interface defined in the Go standard library.

Below is the code used to implement the priority queue:

package main

type treeHeap []HuffmanTree

func (heap treeHeap) Len() int {
    return len(heap)
}

func (heap treeHeap) Less(i, j int) bool {
    return heap[i].Freq() < heap[j].Freq()
}

func (heap treeHeap) Swap(i, j int) {
    heap[i], heap[j] = heap[j], heap[i]
}

func (heap *treeHeap) Push(element interface{}) {
    *heap = append(*heap, element.(HuffmanTree))
}

func (heap *treeHeap) Pop() (poppedElement interface{}) {
    poppedElement = (*heap)[len(*heap)-1]

    *heap = (*heap)[:len(*heap)-1]

    return
}

In this implementation, the treeHeap type is a slice of HuffmanTree elements. It implements the heap.Interface.

The Len method returns the number of elements in the heap.

The Swap method exchanges two elements in the heap based on their indices.

The Less method is responsible for ordering the elements based on their frequency, returning true if the frequency of the element at index i is less than that of j.

The Push method adds a new element to the heap, while the Pop method removes and returns the element with the lowest frequency.

Building the Huffman Tree

Now we need to implement the function that will actually build the Huffman tree from the character frequencies.

The function will take a map of character frequencies as input and return the root of the Huffman tree.

Here’s the code:

package main

import (
    "container/heap"
)

func buildTree(symFreqs map[rune]int) HuffmanTree {
    var trees treeHeap

    for c, f := range symFreqs {
        trees = append(trees, HuffmanLeaf{f, c})
    }

    heap.Init(&trees)
    
    for trees.Len() > 1 {
        a := heap.Pop(&trees).(HuffmanTree)
        b := heap.Pop(&trees).(HuffmanTree)

        heap.Push(&trees, HuffmanNode{a.Freq() + b.Freq(), a, b})
    }

    return heap.Pop(&trees).(HuffmanTree)
}

We first initialize a treeHeap slice to hold the nodes of the Huffman tree.

We iterate over the symFreqs frequency map, creating a HuffmanLeaf for each character, and then add each leaf to the trees heap.

This heap is used to ensure that the nodes with the lowest frequencies are always processed first.

After initializing the heap with heap.Init, we enter a loop where the two nodes with the smallest frequencies are repeatedly popped from the heap.

These nodes are combined into a new internal node, which is then pushed back into the heap.

This process continues until only one node remains, representing the root of the complete Huffman tree.

Generating Huffman Codes

Once we have built the Huffman tree, we need to traverse it to generate the codes for each character.

This can be accomplished through a recursive function that traverses the tree and prints the codes.

Below is the function that achieves this task

package main

import (
    "fmt"
)

func printCodes(tree HuffmanTree, prefix []byte) {
    switch i := tree.(type) {
    case HuffmanLeaf:
        fmt.Printf("%c\t%d\t%s\n", i.value, i.freq, string(prefix))
    case HuffmanNode:
        // traverse left child
        prefix = append(prefix, '0')
        printCodes(i.left, prefix)
        prefix = prefix[:len(prefix)-1]

        // traverse right child
        prefix = append(prefix, '1')
        printCodes(i.right, prefix)
        prefix = prefix[:len(prefix)-1]
    }
}

In the body of this function, we perform a type switch on the tree parameter to determine whether it is a HuffmanLeaf or a HuffmanNode.

If it’s a HuffmanLeaf, we print the character, its frequency, and the current prefix, which represents the code.

If it’s a HuffmanNode, we first append '0' to the prefix and recursively call printCodes for the left child. After returning from the recursive call, we remove the last element of the prefix to backtrack.

We then append '1' to the prefix and recursively call printCodes for the right child, repeating the same backtracking process.

Bringing Everything Together With the Main Function

Finally, we can use all of the code that we’ve written so far in the main function, where we iterate over input, count character frequencies, build the tree and, finally, print the Huffman codes.

So here’s the complete main function:

package main

import (
    "fmt"
)

func main() {
    const input = "this is example text for huffman encoding"

    symFreqs := make(map[rune]int)
    for _, c := range input {
        symFreqs[c]++
    }

    tree := buildTree(symFreqs)

    fmt.Println("SYMBOL\tWEIGHT\tHUFFMAN CODE")
    printCodes(tree, []byte{})
}

We create a map called symFreqs to store the frequency of each character, looping through each rune in the string and incrementing its count in the map.

After counting the frequencies, we call the buildTree function that we wrote earlier to construct the Huffman tree.

Finally, we call the printCodes function to display the symbols, their frequencies and the corresponding Huffman codes.

Checking the Output

If you run the main function that we’ve just created, you should get output that’s equivalent to the following:

SYMBOL  WEIGHT  HUFFMAN CODE
x       2       0000
s       2       0001
e       4       001
u       1       01000
l       1       01001
h       2       0101
d       1       01100
r       1       01101
m       2       0111
n       3       1000
i       3       1001
p       1       10100
g       1       101010
c       1       101011
f       3       1011
        6       110
t       3       1110
a       2       11110
o       2       11111

Note that the symbol that has a weight of 6 is the space character.

The symbols will probably be printed in a different order when you run the code, since iteration over maps in Go does not follow a specific ordering.

You can easily see in the output how the symbols with lower weights tend to have longer Hoffman codes, while the symbols with higher weights tend to have shorter Hoffman codes. As we discussed earlier, this is precisely what makes Hoffman coding so efficient.

Golang Project Structure

Tutorials, tips and tricks for writing and structuring code in Go (with additional content for other programming languages)

Understanding Huffman Coding and Its Implementation in Go

Language

by James Smith (Golang Project Structure Admin)

Who Created the Algorithm of Huffman Coding?

How Does Huffman Coding Work Exactly?

How Is Huffman Coding Used Today?

Defining the Data Structures

Implementing the Priority Queue

Building the Huffman Tree

Generating Huffman Codes

Bringing Everything Together With the Main Function

Checking the Output

Related

Tags

Leave a Reply Cancel reply

Who Created the Algorithm of Huffman Coding?

How Does Huffman Coding Work Exactly?

How Is Huffman Coding Used Today?

Defining the Data Structures

Implementing the Priority Queue

Building the Huffman Tree

Generating Huffman Codes

Bringing Everything Together With the Main Function

Checking the Output

Share this:

Related

Tags

Leave a Reply Cancel reply