Golang Project Structure

Tips and tricks for writing and structuring Go code

Hash Functions in Go

Language

  • unknown

by

This post will explain what hash functions are and how they’re used. We will then go on to look at some examples of hashing algorithms in Go code. Finally, we will look at importing external packages that export hash functions, including from the standard library.

What Is a Hash Function?

A hash function takes data (in Go, as a byte slice of arbitrary length) and returns a single value of fixed length. The value returned is known as a hash, because just as with the culinary dish, which is like a stew, the hash function mixes and modifies the original ingredients or input.

This shows hash, which is a culinary dish made by chopping and mixing onions, meat and potatoes. Hash functions work on their input to produce output that is also known as a hash.
Hash is a meal that is made by chopping and mixing potatoes, onions and meat. Hash functions perform operations on their input in order to create output that is directly influenced by the original input: this output is also called a hash.

The hash output is intended to represent the original input, while also being of a more manageable size, making it easier to work with and store.

For example, when I enter the string "this is just some example text that I'm going to hash" into a hash function (in this case, using the standard MD5 algorithm), I get the following hash as a result: "e9231cd2b6ab88cbb2254529d22e0110".

The hash that I just gave was displayed as a string of hexadecimal characters (each byte in the hash is represented by two hex characters, so the string is 32-characters long but there are 16 bytes in a MD5 hash). You could display it in other ways, however, such as using a decimal representation of each byte.

You can experiment with this hash function yourself using this online tool, so you can see how it converts different input values into hashes. You will see that the hashes are always of the same length, regardless of how small or large the original input is.

How Are Hash Functions Used in the Map Data Structure?

Maps (also known as hash maps or hash tables or dictionaries) use hash functions to convert keys into integer values: these integers can determine which part of memory (known as a slot or bucket) to search for the stored key–value pair.

Ideally, values will be distributed as evenly as possible between buckets, so that some buckets don’t become too full while others remain empty or underutilized.

A pet dog sitting inside a bright red bucket.
It’s amazing what you can find if you look in the right bucket!

We don’t need to worry about this in Go, since the compiler itself provides a native implementation of the map data structure, which works very well for almost all purposes. However, in some languages we would have to write this code ourselves.

If you click through to this GitHub repository, you can see one relatively low-level implementation of a map that I wrote in the C programming language a few years ago.

How Are Hash Functions Used to Store Passwords in Databases?

If you run your own web server and handle user information, it’s important that you never store plaintext passwords in your database (or anywhere else). If someone manages to hack into the server, they will have access to the passwords that your users set.

This may allow them to log in to your site, but, more importantly, it may also allow them to log in to many other sites across the web, since some users will inevitably use the same password across multiple websites (despite this being against good practice). So if you store passwords insecurely in a database, you’re not only compromising the security of your own server: you’re potentially compromising the security of the everyone who uses your site.

This is why hashed passwords are generally stored instead. The plaintext password is run through a hash function and only the output is saved, not the original password.

But when a user wants to log onto your website, how do you check that they’re using the correct password to log in, since you no longer have access the original plaintext password that they used when registering? Well, you would simply run the password that they give when logging in through the hash function and check that it matches the hashed password that is stored in the database. If it does, you can allow them to log in successfully.

Of course, this will only work if you use the exact same hash function when comparing the two hashed passwords, since two different hashing algorithms will generally produce very different output.

How Are Hash Functions Used to Ensure File Integrity?

A similar procedure can be used to check that two separate versions of a file are equivalent: if they both contain the same binary data, they should produce the same hash when fed through the same hashing algorithm.

So we can compare their hashes, rather than analyzing all of the data, in order to make sure that two or more files contain the same data.

When you want to download a file from the internet, you will sometimes see a checksum shown next to the link to download the file. This checksum is produced by running the file’s data through a specific type of hash function.

The webpage will tell you which algorithm has been used to produce the checksum, and it will be an algorithm that conforms to widely recognized international standards, so that you can use software to apply the checksum algorithm to the downloaded file.

If the generated checksum does not match the expected value, then you know that the data is corrupted — either unintentionally or maliciously — and, for security reasons, you shouldn’t open the file.

The Worst Hash Function in the World

Let’s look at some Go code now, in order to see how hash functions work in practice.

A version of the function in the example below appeared in the first edition of K&R’s classic textbook, The C Programming Language, published in 1978. It is perhaps the easiest hash function to understand that it’s possible to create, but that may be its only advantage.

With characteristic understatement, the authors of the C textbook wrote, “This is not the best possible algorithm, but it has the merit of extreme simplicity.” It would be fairer to call it the worst algorithm imaginable.

Nonetheless, simplicity is a very real merit, and its simplicity is precisely the reason that I’ve introduced this function first:

package main

func badBadHash(data []byte) uint64 {
	var hash uint64
	
	for _, b := range data {
		hash += uint64(b)
	}
	
	return hash
}

You can see that we simply take a byte slice as input and iterate through every byte. We append the value of the byte, converted to a uint64, to a total that we return as the hash value.

This works as a hash function, but I can’t stress enough how bad it is.

For example, if you hash the string "aardvark", the value returned will only be one less than if you’d hashed the string "bardvark", since the rune 'b' has a number value that is greater by one than 'a' and all of the other runes in both strings are the same. This makes it very easy to guess what sort of input was used, if you have access to the output hash.

Likewise, since we’re just adding up the value of all of the bytes, we will get the same hash value if we pass in the same bytes, regardless of what order they’re in. So "GOD" and "DOG" would both produce the same hash, because the strings’ runes are each one-byte long and all three of the bytes are contained in both strings.

When we’ve found two input values that produce the same hash, we say that we’ve found a collision, and it’s generally something that we want to avoid.

If you run the following code with the badBadHash function, you will see how the two integer hash values only differ by around 100 (in fact, by 115, which is the ASCII value of the rune 's'):

package main

import "fmt"

func main() {
	input := "lemon"

	fmt.Printf("Hash [\"%s\"]: %d\n", input, badBadHash([]byte(input)))

	input += "s"

	fmt.Printf("Hash [\"%s\"]: %d\n", input, badBadHash([]byte(input)))
}

This is the result I got when running the above code:

Hash ["lemon"]: 539
Hash ["lemons"]: 654

If we subtract 539 from 654, we get 115, as we had predicted earlier.

Implementing the DJB2 Hash Function

The hash function above wouldn’t actually be used in real-world code, but the DJB2 hash isn’t much more difficult to implement yet functions reasonably well. It was written by the mathematician and computer scientist Daniel J. Bernstein, who sometimes goes by the moniker djb, hence the name of this hash function.

package main

func djb2(data []byte) uint64 {
	hash := uint64(5381)

	for _, b := range data {
		hash += uint64(b) + hash + hash<<5
	}

	return hash
}

We still add the value of each byte to the hash on each iteration, but now there’s a little more going on. We also add the hash to itself, as well as adding a version of the hash that has been left-shifted by five.

The << operator is used to left-shift a number in Go. This moves the number by the corresponding number of places in the binary number system. It is equivalent to multiplying the number by two to the power of n, where n is the number of places that the number is left-shifted by.

Note that we also initialize the hash variable with a magic number, which means that if we pass an empty byte slice to the djb2 function, we won’t get the value zero.

Implementing the SDBM Hash Function

The SDBM hash function was created for a database engine written by Ken Thompson (the same computer scientist who wrote the C-programming textbook mentioned above).

As you can see below, its implementation is not much different than the code we saw for the DJB2 hash function:

package main

func sdbmHash(data []byte) uint64 {
	var hash uint64

	for _, b := range data {
		hash = uint64(b) + (hash << 6) + (hash << 16) - hash
	}

	return hash
}

We also add and subtract the previous value of the hash itself to make the current hash: with the additions of the previous value of the hash, we first left-shift the value either six or sixteen places.

This hash function has a pretty good distribution, which means that a wide variety of inputs will tend to produce output hashes that do not cluster around certain values. Instead, the hashes will be well spread out between the minimum and maximum possible value.

Implementing the Fowler–Noll–Vo (FNV) Hash Function

This hash function is named after three American computer scientists — Glenn Fowler, Landon Curt Noll and Kiem-Phong Vo — working at Bell Laboratories’ research center in Murray Hill, New Jersey. They created the hash function in 1991 when they were working on Unix code.

This hash function is the only one we’ve seen that has been used very widely in real-world applications. It has commonly been used in hash tables, because it’s reasonably effective and fast to compute.

The implementation below is based on the standard for FNV version 1a, which is a slight improvement on the original version:

package main

const (
	uint64Offset uint64 = 0xcbf29ce484222325
	uint64Prime  uint64 = 0x00000100000001b3
)

func fvnHash(data []byte) (hash uint64) {
	hash = uint64Offset

	for _, b := range data {
		hash ^= uint64(b)
		hash *= uint64Prime
	}

	return
}

The FNV hash uses three constants: uint64Offset is the initial value of the hash, which would simply be returned if you passed an empty byte slice as the data argument to the fnvHash function, and uint64Prime is the prime-number value that is multiplied by the hash on each iteration of the loop.

Using the FNV Hash Function in Your Own Go Code

Now that we’ve seen how to implement it, you may like to know that I have created a Go package that exports functions that create FNV hashes of various fixed lengths.

If you run the following example code, you will see hexadecimal representations of some of the different hashes, all created using the same line of text as input:

package main

import (
	"fmt"

	hash "github.com/theTardigrade/golang-hash"
)

func main() {
	const input = "this is a simple test"

	fmt.Printf(
		"%x\n\n%x\n\n%x\n\n%x\n\n%x\n",
		hash.Uint8String(input),
		hash.Int8String(input),
		hash.UintString(input),
		hash.Uint256String(input),
		hash.Int256String(input),
	)
}

The hash.Uint8String function takes a string as input and returns a uint8 value (there is also a hash.Uint8 function that takes a byte slice and returns a uint8).

The package contains functions that return integers from 8-bit all the way up to 1,024-bit. Up to 64-bit integers, the functions return native number types. However, Go doesn’t natively support a 128-bit integer — or even larger integers — so the functions that work with large numbers return a pointer to a type from the math.Big package, which can store arbitrarily large integers.

Finally, it should be noted that there is now a package in the standard library that exports FNV hash functions, however, it does not allow you to create hashes bigger than 128-bit, which is where my package excels.

Looking at Hash Functions in the Go Standard Library

So let’s have a look at how to use the hash functions that are contained in the standard library, since these will often be easiest to integrate into your own code.

The following example shows how to produce a hash using the SHA-256 algorithm:

package main

import (
	"crypto/sha256"
	"fmt"
)

func main() {
	const input = "This is just a simple test."

	h := sha256.New()

	h.Write([]byte(input))

	fmt.Printf("%x\n", h.Sum(nil))
}

As you can see above, it is as simple as creating a new hash.Hash structure that implements the io.Writer interface. We then pass our input as a byte slice to the Write Method of this structure.

Then we call the Sum method to get the hash. This method gets its name because hashes are commonly used as checksums, as discussed earlier — but you can use the resulting hash for other purposes too.

The following table shows the hash functions that are currently available in the Go standard library:

Hash AlgorithmPackage
Adler-32"hash/adler32"
CRC-32"hash/crc32"
CRC-64"hash/crc64"
FNV"hash/fnv"
MD5"crypto/md5"
SHA-1"crypto/sha1"
SHA-224 and SHA-256"crypto/sha256"
SHA-384 and SHA-512"crypto/sha512"

What is a Cryptographic Hash Function?

SHA-256 is a cryptographic hashing algorithm, which means that it’s a specific type of hashing algorithm that is used for security purposes.

The hash that is produced by a cryptographic hash function is also known as a digest, fingerprint or signature.

When a hash function is used for cryptographic purposes, there are some extra requirements that it needs to meet.

For example, each hash value should be unique: there should be no known collisions. In theory, every function will produce collisions at some point, but they should be so extremely rare that we can rely on not encountering them. This ensures that different inputs will never produce the same output hash.

The speed of the hash function should be fast, so it doesn’t unnecessarily waste computer resources, but it should not be too fast. If we use a cryptographic hash function to store hashed passwords, they could be vulnerable to a brute-force attack if the hash function is very fast.

In other words, hackers can calculate the hash value for every imaginable combination of letters, symbols and numbers, checking which ones match the password hashes that they’ve found.

Rainbow tables can also be used by hackers: these are lists of precomputed hashes for insecure passwords that are commonly used by many different users across the internet. If one of our hashes is listed on that table, then the hackers can break the password straightaway.

Most importantly, a cryptographic hash function should be secure, which means that the slightest change to the function’s input should generate an extremely different hash value. This makes it much harder for hackers to guess what the input might have looked like by looking at its hash.

When we looked at the Go code for the worst possible hash function above, we saw that slight differences to the input caused only slight differences to the output, which would make it a very bad candidate for use as a cryptographic hash function.

Leave a Reply

Your email address will not be published. Required fields are marked *