Basic NLP (Natural Language Processing) With Prose

Friday, 11 February 2022, 23:03 PM

Language

unknown

by James Smith (Golang Project Structure Admin)

No matter how many programming languages you learn, you will almost certainly never know them as well and as instinctively as your natural language.

Even though it may not always seem to be the case when you’re dealing with a difficult programming problem, the English language is much more complicated than Golang: its vocabulary is much larger and its grammar more intricate.

Yet many of us learned English at our mother’s knee, or on our father’s lap, and we use it as the primary means of communicating the deepest thoughts, feelings, desires and ideas that we have inside us. It’s amazing that it seems so well suited to the task, but natural language has evolved over thousands of years, so it has had a great deal of time to adapt to our needs.

The QWERTY keyboard used to enter words on most modern computers has a history that goes back to old typewriters like this one. This layout of keys was first designed in 1873.

Computer programming languages were developed much more recently, and they tend to be more exact and less enigmatic than the words that come out of our mouths. A machine needs precise instructions, if it is going to perform a certain task. However, there is more scope for ambiguity with the English language, and this can cause confusion, but it can also be used constructively to create puns and jokes.

In this post, we’re going to write some code — first in Go alone and then using a library called prose — that will allow us to grapple with the complexity of natural language and learn a little more about the words and their structure. The sections below are designed for complete beginners, so you don’t need any background knowledge, just enthusiasm.

Table of Contents

Downloading a Poem From Project Gutenberg

Poetry is perhaps one of the most complicated forms of human language. It is also one of the oldest, which suggests that it may express something fundamental about how we communicate. Many of the oldest literary texts that have been preserved — such as the Babylonian Epic of Gilgamesh, Homer’s Iliad or the Hindu Rig Veda — are written in poetic forms with rhythm, metrical stress and other effects to enhance their appeal and to make them less tricky to memorize.

When we begin doing some basic natural language processing, we will use a collection of poems by the 19th-century American author Emily Dickinson to give us some text to work with. This collection is in the public domain and it is freely accessible from Project Gutenberg.

The code below simply downloads the relevant text file and prints it out to the screen as a string:

package main

import (
	"bytes"
	"fmt"
	"io"
	"net/http"
)

func main() {
	res, err := http.Get("https://www.gutenberg.org/cache/epub/2678/pg2678.txt")
	if err != nil {
		panic(err)
	}
	defer res.Body.Close()

	content, err := io.ReadAll(res.Body)
	if err != nil {
		panic(err)
	}

	const poemStartMarkerContent = "\n        XX."
	const poemEndMarkerContent = "\n        XXI."

	poemStartIndex := bytes.Index(content, []byte(poemStartMarkerContent))
	poemEndIndex := bytes.Index(content, []byte(poemEndMarkerContent))

	poemStartIndex += len(poemStartMarkerContent)

	poemContent := string(bytes.TrimSpace(content[poemStartIndex:poemEndIndex]))

	fmt.Println(poemContent)
}

You can see how we use two variables, poemStartMarkerContent and poemEndMarkerContent, to find the bounds of our poem: the start content contains the title, in Roman numerals, of the poem that we want to extract, while the end content contains the title of the next poem, which we want to stop at.

Note how we add the length of the poemStartMarkerContent string to the initial index, because we don’t want to include the content of the marker in our final string.

If you run the code above, so long as you have an active internet connection, it should download and print the following poem:

I taste a liquor never brewed,
From tankards scooped in pearl;
Not all the vats upon the Rhine
Yield such an alcohol!

Inebriate of air am I,
And debauchee of dew,
Reeling, through endless summer days,
From inns of molten blue.

When landlords turn the drunken bee
Out of the foxglove's door,
When butterflies renounce their drams,
I shall but drink the more!

Till seraphs swing their snowy hats,
And saints to windows run,
To see the little tippler
Leaning against the sun!

Emily Dickinson lived a reclusive life in Amherst, Massachusetts, and her poems often touch on themes of grief, love, longing, despair and nature, often using symbols of Christian spirituality, as she tried to understand herself and her place in the world.

The **House of the Seven Gables** is an antique building in the state of Massachusetts. It now houses a museum.

You don’t need to enjoy the poem or understand how to interpret and appreciate its literary qualities or historical background, however, as we’re simply going to extract some basic information from it. We could have easily chosen to write our code with the McDonald’s menu or the instruction manual of an inkjet printer — but I think it’s nicer to spend at least a little time with an elegant work of art.

Counting Letters

To begin, let’s do something really simple by counting how often each of the twenty-six letters in the English alphabet occurs in our poem:

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"net/http"
	"sort"
	"strings"
)

func downloadPoem() (poemContent string, err error) {
	res, err := http.Get("https://www.gutenberg.org/cache/epub/2678/pg2678.txt")
	if err != nil {
		return
	}
	defer res.Body.Close()

	content, err := io.ReadAll(res.Body)
	if err != nil {
		return
	}

	const poemStartMarkerContent = "\n        XX."
	const poemEndMarkerContent = "\n        XXI."

	poemStartIndex := bytes.Index(content, []byte(poemStartMarkerContent))
	poemEndIndex := bytes.Index(content, []byte(poemEndMarkerContent))

	if poemStartIndex == -1 || poemEndIndex == -1 {
		err = errors.New("cannot find poem")
		return
	}

	poemStartIndex += len(poemStartMarkerContent)

	poemContent = string(bytes.TrimSpace(content[poemStartIndex:poemEndIndex]))

	return
}

func main() {
	poemContent, err := downloadPoem()
	if err != nil {
		panic(err)
	}

	poemContent = strings.ToLower(poemContent)

	letterCounts := make(map[rune]int)

	for _, r := range poemContent {
		if r >= 'a' && r <= 'z' {
			letterCounts[r]++
		}
	}

	sortedLetters := make([]rune, 26)

	for i := 0; i < 26; i++ {
		sortedLetters[i] = 'a' + rune(i)
	}

	sort.Slice(sortedLetters, func(i, j int) bool {
		return letterCounts[sortedLetters[i]] > letterCounts[sortedLetters[j]]
	})

	for _, r := range sortedLetters {
		fmt.Printf("%c :: %d\n", r, letterCounts[r])
	}
}

You can see that I’ve created a downloadPoem helper function, which contains the code we previously discussed, so that we can easily reuse it in future.

In the main function, we convert the poem’s content to lower-case, so that we count all cases of the same letter together, then we simply iterate through each of the runes in the text, incrementing the integer value stored in the letterCounts map for each letter between 'a' and 'z'. We don’t need to worry about incrementing a value in the map that has never been set before, because it will automatically equal the zero value for the integer type, which is just 0 — so when incremented, it will become 1.

Next we add all of the rune keys, i.e. the letters from 'a' to 'z', to a slice, which we will then sort in order of how often each letter occurs in the poem. This allows us to use this slice to iterate over the keys in the map and print out the values in our predetermined order, since maps in Go are, by design, unordered data structures.

Running the program, you should get the following output, showing that the letter E is the most commonly occurring in our poem:

e :: 48
n :: 33
t :: 31
r :: 27
o :: 27
s :: 26
a :: 25
i :: 24
l :: 23
h :: 20
d :: 17
u :: 15
m :: 8
f :: 8
w :: 8
b :: 7
p :: 6
g :: 6
c :: 5
k :: 3
y :: 3
v :: 3
q :: 1
x :: 1
j :: 0
z :: 0

This is as we might expect, since many independent analyses of a much larger corpus of English texts have also come to the conclusion that E is the most commonly used letter in the English language. That’s not hard to understand when you just think, for example, of how often it’s used as a silent letter at the end of words like skate or hinge, as well as in many other places.

In our poem, the word debauchee occurs, which contains five vowels, three of which are the letter E. This is a rare word, but it’s easy to understand its meaning, since it just refers to someone who is debauched (immorally or excessively hedonistic), in the same way that the word employee means someone who is employed.

Counting Words

Now that we’ve added up the individual letters that make up our poem, let’s split the text into words and see if any of them recur:

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"net/http"
	"regexp"
	"sort"
	"strings"
)

func downloadPoem() (poemContent string, err error)

var (
	nonalphabeticRunesRegexp = regexp.MustCompile(`[^a-z]`)
)

func main() {
	poemContent, err := downloadPoem()
	if err != nil {
		panic(err)
	}

	poemContent = strings.ToLower(poemContent)
	poemWords := strings.Fields(poemContent)

	wordCounts := make(map[string]int)

	for _, w := range poemWords {
		w = nonalphabeticRunesRegexp.ReplaceAllLiteralString(w, "")

		wordCounts[w]++
	}

	sortedWords := make([]string, len(wordCounts))

	var i int
	for w := range wordCounts {
		sortedWords[i] = w
		i++
	}

	sort.Slice(sortedWords, func(i, j int) bool {
		return wordCounts[sortedWords[i]] > wordCounts[sortedWords[j]]
	})

	for _, w := range sortedWords {
		fmt.Printf("%s :: %d\n", w, wordCounts[w])
	}
}

The code above is very similar to what we wrote before. It also relies on the downloadPoem helper, but I’ve just included the bodiless function declaration this time to save space.

The major difference is that we now use the strings.Fields function, which splits the input text “around each instance of one or more consecutive white space characters”, as the documentation in the standard library puts it. This separates our poem into individual words.

We also use a regular expression (nonalphabeticRunesRegexp) in order to remove the punctuation from each word, so that, for example, "test:", "test." and "test" would not be considered to be three different words but the same one.

We finally print out each of the words in reverse order of frequency, using a very similar sorting technique to that seen in the letter-counting example. Here is the output I got when running the compiled program:

the :: 7
of :: 4
i :: 3
and :: 2
their :: 2
from :: 2
to :: 2
when :: 2
molten :: 1
run :: 1
see :: 1
against :: 1
shall :: 1
brewed :: 1
in :: 1
all :: 1
am :: 1
days :: 1
foxgloves :: 1
drams :: 1
little :: 1
upon :: 1
through :: 1
but :: 1
out :: 1
endless :: 1
taste :: 1
bee :: 1
sun :: 1
never :: 1
blue :: 1
saints :: 1
air :: 1
scooped :: 1
inebriate :: 1
yield :: 1
dew :: 1
reeling :: 1
drunken :: 1
till :: 1
tippler :: 1
tankards :: 1
snowy :: 1
hats :: 1
an :: 1
debauchee :: 1
summer :: 1
drink :: 1
seraphs :: 1
windows :: 1
leaning :: 1
pearl :: 1
vats :: 1
rhine :: 1
landlords :: 1
butterflies :: 1
swing :: 1
more :: 1
not :: 1
turn :: 1
inns :: 1
a :: 1
liquor :: 1
such :: 1
door :: 1
renounce :: 1
alcohol :: 1

The words that recur tend to be, as we might have expected, common parts of English grammar, such as the definite article, as well as other prepositions, pronouns and conjunctions. Each of the nouns, adjectives and verbs — such as “summer”, “molten” or “bee” — in the poem seems to occur only once.

That’s an important finding, since it tells us something important about our text: some other poets would consciously and conspicuously use a greater degree of repetition to produce certain stylistic effects, but Emily Dickinson is more sparing and thoughtful with her words.

Using the Prose Library to Count Nouns

So far we’ve discovered some interesting things about the poem, but we’ve only really been doing the sort of basic string manipulation that would be possible to achieve with a few lines of code in any programming language.

In this next section, however, we’re going to use a Go library called prose, which is able to analyze our text and try to understand its underlying grammatical structure.

The code below attempts to count only the nouns in our poem, not each and every word:

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"net/http"
	"sort"
	"strings"

	prose "github.com/jdkato/prose/v2"
)

func downloadPoem() (poemContent string, err error)

func main() {
	poemContent, err := downloadPoem()
	if err != nil {
		panic(err)
	}

	doc, err := prose.NewDocument(poemContent)
	if err != nil {
		panic(err)
	}

	nounCounts := make(map[string]int)
	nounIsPlural := make(map[string]bool)

	for _, tok := range doc.Tokens() {
		switch tok.Tag {
		case "NN", "NNS", "NNP", "NNPS":
			noun := strings.ToLower(tok.Text)

			nounCounts[noun]++
			nounIsPlural[noun] = tok.Tag[len(tok.Tag)-1] == 'S'
		}
	}

	sortedNouns := make([]string, len(nounCounts))

	var i int
	for w := range nounCounts {
		sortedNouns[i] = w
		i++
	}

	sort.Slice(sortedNouns, func(i, j int) bool {
		nI, nJ := sortedNouns[i], sortedNouns[j]
		cI, cJ := nounCounts[nI], nounCounts[nJ]

		if cI == cJ {
			return nI < nJ
		}

		return cI > cJ
	})

	for _, w := range sortedNouns {
		fmt.Printf("%s :: %d :: %v\n", w, nounCounts[w], nounIsPlural[w])
	}
}

We first interact with the prose library by creating a new prose.Document struct from our poemContent string. This will allow our text to be tokenized, so that we can iterate over each token. (Most tokens contain a single word, but some are used for punctuation.)

We only work with those which have a tok.Tag that we’re interested in. The various values that the tok.Tag string field can hold are listed in the documentation for prose. We only want to include nouns, so we’re only interested in the following four tags:

Tag	Description
`"NN"`	A singular or mass noun, which is not a proper noun.
`"NNS"`	A plural noun, which is not a proper noun.
`"NNP"`	A singular proper noun.
`"NNPS"`	A plural proper noun.

A proper noun — if you’re unaware — is just the name of a place, person or thing, and it usually begins with a capital letter.

We sort the nouns by frequency and keep a record of whether each noun is singular or plural. When I ran the code, I got the following list of nouns:

from :: 2 :: false
air :: 1 :: false
alcohol :: 1 :: false
blue :: 1 :: false
butterflies :: 1 :: true
days :: 1 :: true
debauchee :: 1 :: false
dew :: 1 :: false
door :: 1 :: false
drams :: 1 :: true
drunken :: 1 :: false
foxglove :: 1 :: false
hats :: 1 :: true
inebriate :: 1 :: false
inns :: 1 :: false
landlords :: 1 :: true
liquor :: 1 :: false
pearl :: 1 :: false
reeling :: 1 :: false
rhine :: 1 :: false
saints :: 1 :: true
snowy :: 1 :: false
summer :: 1 :: false
sun :: 1 :: false
tankards :: 1 :: true
till :: 1 :: false
tippler :: 1 :: false
vats :: 1 :: true
windows :: 1 :: true
yield :: 1 :: false

That’s impressive, but you can see that the artificial-intelligence algorithm used by prose isn’t perfect. It has tagged some words as nouns incorrectly.

It’s unfortunate that the first word given in the list, “from”, is not a noun. In total, six out of the thirty nouns given are incorrectly tagged. That may seem like a high failure rate, but when you consider that 80% (twenty-four out of thirty) of the nouns were labelled correctly, the success rate is clearly much higher than would be achieved by pure chance alone.

It’s very difficult for computers to “understand” human words and the way they fit together, so that is no mean feat.

Things are made even more complicated by the fact that some words in natural languages like English can be used in one of many different ways: for example, the word “blue” can be used as an adjective or noun (and also, more rarely, as a verb). However, prose correctly identified that it was being used as a noun in our poem.

Likewise, the word “inebriate” can be used as an adjective, noun or verb, but it was correctly tagged by prose as a noun here. This is more impressive, because the verbal form is probably the most common, not the noun.

Even so, I noted the following words which were incorrectly tagged:

Word	Grammatical category	Expected tag
`"from"`	preposition	`"IN"`
`"drunken"`	adjective	`"JJ"`
`"reeling"`	present participle or gerund	`"VBG"`
`"snowy"`	adjective	`"JJ"`
`"till"`	preposition	`"IN"`
`"yield"`	verb (3rd-person plural present)	`"VBP"`

So, yes, we have to take the false positives into account, but what’s more impressive is that I only found two false negatives. In other words, even though there were six words that were incorrectly tagged as nouns, only the words “bee” and “seraph” (a type of heavenly angel) in the poem weren’t tagged as nouns as they should have been.

If this also holds true for other texts, we could use prose to find potential nouns and then filter them down manually by removing words that were incorrectly tagged. That could save a lot of time compared to picking out each of the individual nouns by hand.

Counting Verbs With a Custom Poem

In the example below, I have modified our downloadPoem helper function somewhat, so that it takes two markers as arguments, one to signify the start of our poem in the complete text and another to signify the end, rather than relying on internal constants. We can use this updated function to load a different poem, in order to give ourselves a little variation and also to make sure that there was nothing unusual or specific about the previous poem that could have affected our code in ways that we hadn’t considered.

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"net/http"
	"sort"
	"strings"

	prose "github.com/jdkato/prose/v2"
)

var (
	ErrCannotFindPoem = errors.New("cannot find poem")
)

func downloadPoem(startMarkerContent, endMarkerContent string) (poemContent string, err error) {
	res, err := http.Get("https://www.gutenberg.org/cache/epub/2678/pg2678.txt")
	if err != nil {
		return
	}
	defer res.Body.Close()

	content, err := io.ReadAll(res.Body)
	if err != nil {
		return
	}

	poemStartIndex := bytes.Index(content, []byte(startMarkerContent))
	if poemStartIndex == -1 {
		err = ErrCannotFindPoem
		return
	}

	poemEndIndex := bytes.Index(content[poemStartIndex:], []byte(endMarkerContent))
	if poemEndIndex == -1 {
		err = ErrCannotFindPoem
		return
	}

	poemStartIndex += len(startMarkerContent)
	poemEndIndex += poemStartIndex - len(startMarkerContent)

	poemContent = string(bytes.TrimSpace(content[poemStartIndex:poemEndIndex]))

	return
}

func main() {
	const poemStartMarkerContent = "\n   PURPLE CLOVER."
	const poemEndMarkerContent = "\n        XV."

	poemContent, err := downloadPoem(poemStartMarkerContent, poemEndMarkerContent)
	if err != nil {
		panic(err)
	}

	doc, err := prose.NewDocument(poemContent)
	if err != nil {
		panic(err)
	}

	verbCounts := make(map[string]int)
	verbTypes := make(map[string]string)

	for _, tok := range doc.Tokens() {
		switch tok.Tag {
		case "VB", "VBD", "VBG", "VBN", "VBP", "VBZ":
			verb := strings.ToLower(tok.Text)

			verbCounts[verb]++

			switch tok.Tag {
			case "VB":
				verbTypes[verb] = "base verb"
			case "VBD":
				verbTypes[verb] = "past-tense verb"
			case "VBG":
				verbTypes[verb] = "present participle"
			case "VBN":
				verbTypes[verb] = "past participle"
			case "VBP":
				verbTypes[verb] = "3rd-person singular present-tense verb"
			case "VBZ":
				verbTypes[verb] = "other conjugated verb"
			}
		}
	}

	sortedVerbs := make([]string, len(verbCounts))

	var i int
	for w := range verbCounts {
		sortedVerbs[i] = w
		i++
	}

	sort.Slice(sortedVerbs, func(i, j int) bool {
		nI, nJ := sortedVerbs[i], sortedVerbs[j]
		cI, cJ := verbCounts[nI], verbCounts[nJ]

		if cI == cJ {
			return nI < nJ
		}

		return cI > cJ
	})

	for _, w := range sortedVerbs {
		fmt.Printf("%s :: %d :: %s\n", w, verbCounts[w], verbTypes[w])
	}
}

You can see above that I have also improved the error-handling, so that the helper function will return early if the markers cannot be found in the text (perhaps the online edition of the text will get modified at some point, causing our program to stop working).

The poem that we are using in this section is also about the natural world, and it talks about bees and other insects passing by a beautiful flower that lives an elegant and meaningful life but will ultimately be killed by the winter frost. You can read it in full below:

There is a flower that bees prefer,
And butterflies desire;
To gain the purple democrat
The humming-birds aspire.

And whatsoever insect pass,
A honey bears away
Proportioned to his several dearth
And her capacity.

Her face is rounder than the moon,
And ruddier than the gown
Of orchis in the pasture,
Or rhododendron worn.

She doth not wait for June;
Before the world is green
Her sturdy little countenance
Against the wind is seen,

Contending with the grass,
Near kinsman to herself,
For privilege of sod and sun,
Sweet litigants for life.

And when the hills are full,
And newer fashions blow,
Doth not retract a single spice
For pang of jealousy.

Her public is the noon,
Her providence the sun,
Her progress by the bee proclaimed
In sovereign, swerveless tune.

The bravest of the host,
Surrendering the last,
Nor even of defeat aware
When cancelled by the frost.

If you run the code, you will see that it now not only counts the verbs, but also identifies them by subcategory:

is :: 5 :: other conjugated verb
are :: 1 :: 3rd-person singular present-tense verb
bears :: 1 :: other conjugated verb
bees :: 1 :: other conjugated verb
cancelled :: 1 :: past participle
contending :: 1 :: present participle
democrat :: 1 :: other conjugated verb
desire :: 1 :: 3rd-person singular present-tense verb
doth :: 1 :: other conjugated verb
gain :: 1 :: base verb
herself :: 1 :: base verb
prefer :: 1 :: 3rd-person singular present-tense verb
proclaimed :: 1 :: past participle
proportioned :: 1 :: past participle
retract :: 1 :: base verb
rhododendron :: 1 :: other conjugated verb
seen :: 1 :: past participle
surrendering :: 1 :: present participle
wait :: 1 :: past participle
worn :: 1 :: past participle

As with the nouns that we looked at earlier, there are clearly some misidentified words there (for example, “bees” is not a verb, since it doesn’t describe an action but names the animals that the poem discusses). However, so long as you have some method of accounting for these errors, or you don’t need complete accuracy, the tagging functionality provided by prose can be extremely useful.

Splitting a Text Down Into Sentences

It may seem easy to split a text down into its constituent sentences, since we probably just need to look for the punctuation marks that tend to signify the end of a sentence and split at that point. However, that either requires iterating through each rune and checking them against a list of runes that tend to come at the end of a sentence or writing a complex regular expression that finds the breakpoints.

Luckily, prose handles this for us, making it extremely simply to extract sentences from a document, as you can see below:

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"net/http"

	prose "github.com/jdkato/prose/v2"
)

func downloadPoem(startMarkerContent, endMarkerContent string) (poemContent string, err error)

func main() {
	const poemStartMarkerContent = "\n   PURPLE CLOVER."
	const poemEndMarkerContent = "\n        XV."

	poemContent, err := downloadPoem(poemStartMarkerContent, poemEndMarkerContent)
	if err != nil {
		panic(err)
	}

	doc, err := prose.NewDocument(poemContent)
	if err != nil {
		panic(err)
	}

	for i, sentence := range doc.Sentences() {
		fmt.Printf("[%d] %s\n", i, sentence.Text)
	}
}

You will see each of the sentences preceded by an ordered number, if you run that code.

It is now possible to use this functionality to calculate, for example, the average number of words in each sentence. This tells us something interesting about the writer’s style. We can expect that a poet like Emily Dickinson, who’s used to compressing her ideas into short phrases, would probably use fewer words per sentence than a philosopher writing a complex tract or even a technical writer producing the exhaustive documentation for a programming language.

Let’s add the following code to the end of our main function, so we can see how many words she actually uses in an average sentence:

sentences := doc.Sentences()

var totalSentenceLen float64

for _, sentence := range sentences {
	totalSentenceLen += float64(len([]rune(sentence.Text)))
}

averageSentenceLen := totalSentenceLen / float64(len(sentences))

fmt.Printf(
	"\nThe average sentence in this poem contains %.1f characters.\n",
	averageSentenceLen,
)

Notice how we convert each string of text to a slice of runes, before taking its length. This is necessary because the len function on a string returns the number of bytes, whereas we want to know how many runes — in other words, characters — are in each sentence. This isn’t strictly necessary here, since all of the characters in our poem can be encoded using ASCII, so each of our runes will only take up one byte, but it’s still good practice, because if we’d been working on a poem written in, say, classical Chinese, each character would have used more than one byte.

Running the code above prints out the following result:

The average sentence in this poem contains 126.4 characters.

That may be more words per sentence than I’d initially suspected, but it’s still a relatively small number, reflecting Emily Dickinson’s economy with language. The one sentence that runs for eight lines, over the end of the first stanza and the beginning of the second, is clearly pushing the overall average up. In fact, when I modified the code to exclude that sentence, the average came down to 109.8 characters.

Quantitative Analysis of the Writing Style and Readability

Finally, we’re going to try and discover how difficult our text is to read by using some algorithms that attempt to quantify the readability of prose.

For this we need to use a package from the first version of prose: I haven’t found out why, but this functionality no longer seems to be included in the second version that we’ve been using in the previous examples.

package main

import (
	"bytes"
	"errors"
	"fmt"
	"io"
	"net/http"

	"github.com/jdkato/prose/summarize"
)

func downloadPoem(startMarkerContent, endMarkerContent string) (poemContent string, err error)

func main() {
	const poemStartMarkerContent = "\n   PURPLE CLOVER."
	const poemEndMarkerContent = "\n        XV."

	poemContent, err := downloadPoem(poemStartMarkerContent, poemEndMarkerContent)
	if err != nil {
		panic(err)
	}

	doc := summarize.NewDocument(poemContent)
	assessment := doc.Assess()

	fmt.Print("\nSCORES")
	fmt.Print("\n======\n\n")

	fmt.Printf("Automated readability: %24f\n", assessment.AutomatedReadability)
	fmt.Printf("ColemanLiau: %34f\n", assessment.ColemanLiau)
	fmt.Printf("Dale-Chall: %35f\n", assessment.DaleChall)
	fmt.Printf("Flesh-Kincaid: %34f\n", assessment.FleschKincaid)
	fmt.Printf("Flesh reading-ease: %25f\n", doc.FleschReadingEase())
	fmt.Printf("Grade level (mean): %27f\n", assessment.MeanGradeLevel)
	fmt.Printf("Grade level (standard deviation): %13f\n", assessment.MeanGradeLevel)
	fmt.Printf("Gunning-Fog: %34f\n", assessment.GunningFog)
	fmt.Printf("LIX: %42f\n", assessment.LIX)
	fmt.Printf("SMOG: %41f\n", assessment.SMOG)

	fmt.Println()
}

Each of those scores uses a slightly different scale, but they’re all measuring the same thing, how easy or hard it is to read the given text (which, in our case, is the Emily Dickinson poem about the purple-flowered plant).

For example, the “grade level” score estimates the level of schooling that a person may be expected to have achieved in the American educational system in order to be able to read the text comfortably.

The Fleish-Kincaid score, named after two professional educators, is one of the most well known metrics of readability: it is graded from 0 to 100, but, unlike in the other grading systems, a higher score means that the text is easier to read and a lower score means that it’s harder. For example, a Fleish-Kincaid score of between 90 and 100 means that the text should be able to be read by a typical 11-year-old child, while a score of between 0 and 10 means that the text can generally only be read by a professional adult with a high level of university education.

Each score is a positive float64 value, as you can see in the output below:

SCORES
======

Automated readability:                10.98
ColemanLiau:                           9.91
Dale-Chall:                           11.06
Flesh-Kincaid:                         9.84
Flesh reading-ease:                   62.70
Grade level (mean):                   11.01
Grade level (standard deviation):     11.01
Gunning-Fog:                          12.30
LIX:                                  40.76
SMOG:                                 12.03

Our text was given a grade level of approximately 11, which means that it should be easily read by the typical American eleventh grader, who is ordinarily a teenager between 16 and 17. That seems like a fair score, since Emily Dickinson’s poetry has traditionally been studied at that age as preparation for the SAT literature exams in order to gain admission to college.

Golang Project Structure

Tutorials, tips and tricks for writing and structuring code in Go (with additional content for other programming languages)

Basic NLP (Natural Language Processing) With Prose

Language

by James Smith (Golang Project Structure Admin)

Downloading a Poem From Project Gutenberg

Counting Letters

Counting Words

Using the Prose Library to Count Nouns

Counting Verbs With a Custom Poem

Splitting a Text Down Into Sentences

Quantitative Analysis of the Writing Style and Readability

Related

Tags

Leave a Reply Cancel reply

Downloading a Poem From Project Gutenberg

Counting Letters

Counting Words

Using the Prose Library to Count Nouns

Counting Verbs With a Custom Poem

Splitting a Text Down Into Sentences

Quantitative Analysis of the Writing Style and Readability

Share this:

Related

Tags

Leave a Reply Cancel reply