How to Read a File Line by Line
Language
- unknown
by James Smith (Golang Project Structure Admin)
Sometimes you want to read a file that contains many lines of text or a list of structured data and you want to process each line individually, rather than working with the entire content of the file at once.
This post will discuss various ways to read a file line by line, looking first at how to read files of any type and then focusing specifically on how to read CSV files.
Table of Contents
How to Get a File’s Contents in Go
We’re going to start by doing something very simple in order to refresh our memories about file handling. We’ll just get all of the contents of a file and print it out to the console:
package main
import (
"fmt"
"log"
"os"
)
func main() {
const fileName = "test.txt"
fileContents, err := os.ReadFile(fileName)
if err != nil {
log.Fatalln(err)
}
fmt.Printf("%s\n", fileContents)
}
This is the complete contents of the file that we’re reading (just a single sentence spread over two lines):
test.txt
This is an
example file.
And that is, of course, exactly what gets printed out to the console when we run the code above.
We read the file by using the os.ReadFile
function, which loads the entire contents of the file into memory: this can be a problem, if we’re working with a huge file that requires large amounts of free memory, but with our small file, it’s okay.
The os.ReadFile
function used to be part of the "io/ioutil"
package (and it still is), but that package has now been deprecated, and its functions have all been moved to the "os"
package.
Note that we use the "%s"
verb with the fmt.Printf
function, in order to specific that fileContents
should be printed out as a string. This is necessary because os.ReadFile
returns a byte slice, so if we had just printed out fileContents
using, for example, the fmt.Println
function, we would simply see the numeric values of all the bytes that make up the file’s contents, not a human-readable representation.
The Simplest Way to Read a File Line by Line
Let’s look now at how to split the contents of a file up into lines, so we can handle them individually:
package main
import (
"fmt"
"log"
"os"
"regexp"
)
var (
lineBreakRegExp = regexp.MustCompile(`\r?\n`)
)
func main() {
const fileName = "test.txt"
fileContents, err := os.ReadFile(fileName)
if err != nil {
log.Fatalln(err)
}
fileLines := lineBreakRegExp.Split(string(fileContents), -1)
for i, line := range fileLines {
fmt.Printf("%d) \"%s\"\n", i+1, line)
}
}
We begin by defining a regular expression that matches the end of a line, whether it be either Unix-style ("\n"
) or Windows-style ("\r\n"
).
We then simply use the Split
method on this regex in order to break the contents of our file up into separate strings, one for each line. Since we only have two lines in our entire file, we should only have two strings in the fileLines
slice of strings.
We pass -1
as the second argument to the Split
method in order to signify that we want to return as many substrings as possible. If we had chosen a positive number, however, the function would have stopped as soon as that number of substrings had been created from the original string.
Finally, we iterate through each of the lines of the file printing them out to the screen, along with the line number, as seen in the output below:
1) " This is an"
2) "example file."
A Better Way to Read a File Line by Line
In the previous section, we hand-coded a regular expression in order to split the lines, but there’s a better — and more idiomatic way — to get the lines from the file. Have a look at the example below:
package main
import (
"bufio"
"fmt"
"log"
"os"
)
func main() {
const fileName = "test.txt"
file, err := os.Open(fileName)
if err != nil {
log.Fatalln(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
for i := 1; scanner.Scan(); i++ {
fmt.Printf("%d) \"%s\"\n", i, scanner.Text())
}
if err := scanner.Err(); err != nil {
log.Fatalln(err)
}
}
We now use the os.Open
function to get a handle to our file, rather than loading all of its contents at once. We use the defer
keyword to close the file handle automatically when the main function reaches its end, which is best practice.
The real work is done with the help of a bufio.Scanner
. When using this, you can set a custom splitting method (of the type bufio.SplitFunc
), but by default it will split every time it reaches the end of a line, just as our regex previously did.
We can iterate through the file until scanner.Scan
returns false. It either does this when it reaches the end of our file or when it encounters an error. We print each line, as we did before, as we iterate.
The i
variable isn’t strictly necessary — we could have used a for
loop with only a single condition — but we’re just using it here in order to count the line numbers.
If an error has been encountered as the scanner
reads the text, then calling scanner.Err
will return it. So we finally check to see if that method returns nil
or not, exiting with a log statement if there is an error to handle.
Reading a File in Chunks to Save Memory
We still have one problem though, which I touched on earlier. The previous examples have required us to load the entire contents of the file into memory at once. If we’re working on a machine with limited resources (such as a smartphone or embedded system), we may not have enough RAM to do this.
So we’re now going to look at how to read an entire file while only loading small chunks of its contents into memory at any one time.
For this section, I’ve created another file, which has more than lines of text than our previous one. It just contains the English name of the number of each line from one to twenty, as you can see below:
lines.txt
LINE ONE
LINE TWO
LINE THREE
LINE FOUR
LINE FIVE
LINE SIX
LINE SEVEN
LINE EIGHT
LINE NINE
LINE TEN
LINE ELEVEN
LINE TWELVE
LINE THIRTEEN
LINE FOURTEEN
LINE FIFTEEN
LINE SIXTEEN
LINE SEVENTEEN
LINE EIGHTEEN
LINE NINETEEN
LINE TWENTY
This larger file will allow us to take better advantage of only reading partial chunks of the file at any one time, without having to store the file’s whole contents in memory. This would, of course, be an even greater advantage if the file were larger still, containing perhaps many gigabytes or terabytes of textual or binary data.
Look at the code below, which reads small chunks of our file, gets the individual lines and prints them out to the console as it goes:
package main
import (
"bufio"
"fmt"
"io"
"log"
"os"
"strings"
)
func printLines(lineChannel <-chan string, doneChannel chan<- struct{}) {
var i int
for line := range lineChannel {
i++
fmt.Printf("%d) \"%s\"\n", i, line)
}
doneChannel <- struct{}{}
}
func main() {
const fileName = "lines.txt"
file, err := os.Open(fileName)
if err != nil {
log.Fatalln(err)
}
defer file.Close()
lineChannel := make(chan string)
doneChannel := make(chan struct{})
go printLines(lineChannel, doneChannel)
reader := bufio.NewReader(file)
buffer := make([]byte, 4)
var currentLineBuilder strings.Builder
var seenCarriageReturn bool
for {
bytesRead, err := reader.Read(buffer)
if err != nil {
if err == io.EOF {
break
}
log.Fatalln(err)
}
chunk := buffer[:bytesRead]
for _, b := range chunk {
switch b {
case '\n':
seenCarriageReturn = false
lineChannel <- currentLineBuilder.String()
currentLineBuilder.Reset()
case '\r':
seenCarriageReturn = true
default:
if seenCarriageReturn {
currentLineBuilder.WriteByte('\r')
seenCarriageReturn = false
}
currentLineBuilder.WriteByte(b)
}
}
}
if currentLineBuilder.Len() > 0 {
lineChannel <- currentLineBuilder.String()
}
close(lineChannel)
<-doneChannel
}
This code is more complicated than our previous examples, but it’s really not too hard to understand. We use the printLines
helper function, which we start in its own goroutine, in order to print out the individual lines of our file.
We use channels to pass data to and from the printLines
function. The lineChannel
is used to send each line as a string, whereas the doneChannel
is used only to send a single signal when all of the lines have been handled (i.e. when lineChannel
is closed). We send an empty struct to doneChannel
— rather than data of another type — because it requires no memory allocations, so it can be used simply as a signal.
When we declare the arguments for the printLines
function, we have an arrow that points away from the chan
keyword in the case of lineChannel
and an arrow that points towards the chan
keyword in the case of doneChannel
.
The arrow pointing away from the chan
keyword indicates that the channel will only be receiving data (and the compiler will complain if it tries to send data). On the other hand, the arrow pointing towards the chan
keyword indicates that the channel will only be sending data. These arrows aren’t necessary, but they help to make our intentions clear, both to ourselves and to others who may read our code.
In the main function, we open the file in the same way as we saw earlier, but now we create a bufio.Reader
object in order to perform our chunked reads.
We also declare a fixed-length byte slice to use as a buffer, storing the file data as it is read. I have chosen to make this four-bytes long: in real code, this would be much larger, since it’s inefficient to read such small amounts of data at a time, but I wanted to show that it is possible to read files of any size with very small buffers. If it takes longer than it would with a larger buffer, then you’re sacrificing time in order to save memory, which may be a reasonable tradeoff in certain circumstances.
We use the reader.Read
method within an infinite loop to fill the buffer with data. This will work its way through the file on each iteration, until it reaches the end. When it does so, it will return an io.EOF
error, which we check for to end the loop.
The reader.Read
method also returns the number of bytes that have been read, which is important, since there may be fewer bytes read than the buffer can hold. We declare a chunk
variable that is created by reslicing the buffer, getting only correct number of bytes that have been read into it.
Then we simply iterate through all of the bytes in chunk
. If there is a newline character, we ignore the previous carriage return (since that’s how line breaks are defined in Windows), otherwise we write the carriage return to the strings.Builder
that we are using to hold each line. We write all of the other bytes as we iterate through them.
When we meet a newline character, however, we send the string created by currentLineBuilder
to lineChannel
, so that it can be handled by our printLines
helper function, where it will be printed out to the console. We also reset currentLineBuilder
, so that it can be reused to hold the next line.
When we have reached the end of our file, we check if currentLineBuilder
holds any bytes: if it does, we send them as a string to printLines
, so that we don’t accidentally miss the last line.
Finally, we close lineChannel
in order to show that we are not going to send any more data. This will cause the for
loop in printLines
to come to an end, triggering an empty struct to be sent to doneChannel
. The main
function will wait for this and then exit, with our work completed.
We should now have printed the entire contents of the file to the screen, working through it line by line, without having to read the whole file into memory at once, as we did previously.
Reading a CSV File and Formatting its Data
A Comma-Separated-Values (CSV) file is a plaintext file that contains a list of data. You can think of it as the simplest form of spreadsheet: there are cells of data organized into rows and columns. The columns are separated by commas (hence the name of the file format) and the rows are separated by newline characters.
There can also be an optional header, where the first row is used to define names for each of the columns. This isn’t a strictly defined part of the file format, but it is a commonly adopted convention.
I’ve created a CSV file, as you can see below, which contains information about the three major chemical elements that constitute the Earth’s atmosphere:
elements.csv
Element Name,Element Symbol,Atomic Number,Percentage of Earth's Atmosphere
Nitrogen,N,7,78
Oxygen,O,8,21
Argon,Ar,18,1
We can now read this file in Go, if we want either to manipulate the data or simply to display it to the screen. Look at the code below for an example of the latter:
package main
import (
"encoding/csv"
"fmt"
"log"
"os"
)
func main() {
const fileName = "elements.csv"
file, err := os.Open(fileName)
if err != nil {
log.Fatalln(err)
}
defer file.Close()
data, err := csvReader.ReadAll()
if err != nil {
log.Fatalln(err)
}
const outputWidth = 85
verticalSeparator := " || "
horizontalSeparator := strings.Repeat("-", outputWidth)
for i, record := range data {
if i > 0 {
fmt.Println(horizontalSeparator)
}
output := strings.Join(record, verticalSeparator)
outputIndent := (outputWidth - len(output)) / 2
if outputIndent > 0 {
fmt.Print(strings.Repeat(" ", outputIndent))
}
fmt.Println(output)
}
}
We open the file using the os.Open
function as seen earlier, but now we create a specific type of reader using the csv.NewReader
function from the standard library. This will read all of the records — note that a row is sometimes called a record — into memory at once.
The data variable holds a slice of slices of strings — in other words, it’s a two-dimensional slice of strings. The first dimension contains the rows, or records, and the second dimension contains the columns.
So data[2][3]
would access the value in the cell at the fourth column and third row (index 2 is the third and index 3 is the fourth, because our slice is zero-indexed), which, in this case, would be a string representation of the number 21 (the approximate percentage of the Earth’s atmosphere that is made up of oxygen).
The remaining code in the main
function is simply my attempt to format the data in an attractive way before printing it out to the console. We assume that the width of our console will be 85, and then we create a horizontal separator of '-'
runes that will go all the way across to separate one row from another (we use the strings.Repeat
function to create a long row of repeating characters). The vertical separator likewise goes between each of the columns within each row.
Then we calculate an outputIndent
in order to make sure that each line will be roughly centred, so long as our console is large enough to print 85 characters, as we initially assumed.
The output of this code can be seen below:
Element Name || Element Symbol || Atomic Number || Percentage of Earth's Atmosphere
-------------------------------------------------------------------------------------
Nitrogen || N || 7 || 78
-------------------------------------------------------------------------------------
Oxygen || O || 8 || 21
-------------------------------------------------------------------------------------
Argon || Ar || 18 || 1
When the data is displayed like this (with separators between the cells), it perhaps helps to make it clearer how a CSV file can be used to store spreadsheet data.
Indeed, most of the major spreadsheet applications — such as Google Sheets, which is available to use for free online — can import data from CSV files directly.
Reading the Records of a CSV File Line by Line
Finally, we shall look at how to print out each of the records to the screen without having to load the entire contents of the CSV file into memory at once.
package main
import (
"encoding/csv"
"fmt"
"io"
"log"
"os"
"strings"
)
func main() {
const fileName = "elements.csv"
file, err := os.Open(fileName)
if err != nil {
log.Fatalln(err)
}
defer file.Close()
csvReader := csv.NewReader(file)
for {
record, err := csvReader.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatalln(err)
}
fmt.Printf("%s\n", strings.Join(record, "****"))
}
}
Instead of using the ReadAll
method on our csv.Reader
variable, we now use the Read
method. This takes no arguments and returns a slice of strings, which contain the data for all the cells in the current row. It also returns an error value, in case the file cannot be accessed.
We check for io.EOF
, just as we did when reading the text file with a buffer, in order to know when to stop our loop. This time we are reading whole records — i.e. whole lines — of arbitrary length, and we simply print each record as we iterate through it.
We are no longer creating a grid-type layout for the output. Instead, we simply separate each string in the record from the others with four asterisks.
You could, of course, modify this code snippet so that each of the records are sent via a channel to an independent goroutine, as they’re read, freeing up the main
function simply to perform the iteration. This is what we did in a previous example when reading the lines of a plaintext file, however, there would be little advantage in applying such an approach since we’re just printing each record out to the console: if we had been performing more processor-intensive manipulation and modification of each record, then such an approach could be more easily justified.
scanner.Scan() method does not load the entire contents of the file into memory at once. Instead, it reads the file line by line using a scanner. The scanner.Scan() function reads one line at a time, and the scanner.Text() function returns the text of the current line. This allows you to process the file line by line without loading the entire file into memory.
Yes, I tried to imply that in the text, but you’re right to make it clear and point out that this is one of the advantages of using a scanner. Thanks for commenting!