N-gram Language Model Based Next Word Suggestion — Part 1

Keagan Ladds
5 min readSep 19, 2022

Have you ever wondered how, when typing a message on your phone, it’s able to suggest the next word? Or even being able to correct your spelling on the fly, turning your angry WhatsApp to your boss into a laughable matter because you end up telling them to go and duck themselves? One of the many ways to achieve next word suggestion is using an n-gram language model. This article briefly overviews what an n-gram language model is and how it’s used to perform next-word suggestion.

I will say from the get-go that this article outlines an oversimplified approach. I hope to provide a basic understanding of n-gram language models and in doing so get you interested enough to want to explore more on your own.

What Are N-Grams Anyway?

Fundamental to understanding how an n-gram language model can be used to suggest the next word is understanding what n-grams are. Fortunately, they are a fairly simple construct. Simply put, n-grams are contiguous n-length sequences of characters or words from a given sample text or corpus. An important property of n-grams is that the order in which the items appear in the sequences is the same order in which they appear in the sample text. For an illustrative example, take the text given below:

“Model Based Next Word Suggestion”

1-grams or unigrams

[('model',), ('based',), ('next',), ('word',), ('suggestion',)]

2-grams or bigrams

[('model', 'based'), ('based', 'next'), ('next', 'word'), ('word', 'suggestion')]

3-grams or trigrams

[('model', 'based', 'next'), ('based', 'next', 'word'), ('next', 'word', 'suggestion')]

NOTE: The order in which the words appear in each tuple is important. However, the order in which each tuple appears in the collection of tuples is not.

It can be seen that n-grams are nothing more than the collection of all the possible contiguous n-length sequences from a corpus. Another important thing to notice from the examples is that, given n-1 words of the sample text, it is possible to determine the next word. Though, for obvious reasons, this is not possible with a unigram. In the case of the bigram, given the word “based”, it is easy to determine that the word that should follow is “next”. The same follows in the case of a trigram, given the words “next” and “word”, it is possible to determine that the next word that should follow is “suggestion”. Armed with this knowledge, it is now possible to forge ahead in developing a simple n-gram language model.

From Corpus to Language Model

In the case of a small corpus, as presented above, there was only one possible next word for the given previous word(s). What happens when you have a larger corpus which results in duplicate n-grams and even multiple possible next words given the previous word(s)? To answer that question, consider the example texts and bigrams below:

“The Tortoise never stopped for a moment, right to the end of the course.”

[('the', 'tortoise'), ('tortoise', 'never'), ('never', 'stopped'), ('stopped', 'for'), ('for', 'a'), ('a', 'moment'), ('moment', 'right'), ('right', 'to'), ('to', 'the'), ('the', 'end'), ('end', 'of'), ('of', 'the'), ('the', 'course')]

As with the first example corpus, most of the words only have one obvious next word. This means that the likelihood or probability of one of these words following another is 1. Now looking at the word “the”. The words, “tortoise”, “end” and “course” are each equally likely to follow the word “the”. Out of the 3 occurrences of the word “the”, it is followed by the word “tortoise”, “end” and “course” each once. That is, the probability of each of the words appearing after the word “the” is 1/3 or 0.33333

When suggesting the next word, most of the words have a single next word suggestion. The word the however has 3 possible next-word suggestions.

“The Tortoise never stopped for a moment, right to the end of the course. The Hare ran fast and reached the end.”

[('the', 'tortoise'), ('tortoise', 'never'), ('never', 'stopped'), ('stopped', 'for'), ('for', 'a'), ('a', 'moment'), ('moment', 'right'), ('right', 'to'), ('to', 'the'), ('the', 'end'), ('end', 'of'), ('of', 'the'), ('the', 'course.'), ('course.', 'the'), ('the', 'hare'), ('hare', 'ran'), ('ran', 'fast'), ('fast', 'and'), ('and', 'reached'), ('reached', 'the'), ('the', 'end')]

This example corpus is very similar, but now the phrase “the end” is included more than once. Out of the 5 occurrences of the word “the”, it is followed by the word “end” twice. This makes the probability of “end” following “the” 0.4 and makes it the most likely word to appear after the word “the”.

The word the now has 4 possible next-word suggestions. Notice how the word end is suggested first, indicating that it’s the most likely.

The basis of an n-gram language model is the extracted knowledge of the probability that a word follows one or more previous words. Using these probabilities it is possible to statistically know the most likely next word given an input of previous word(s). Perhaps the simplest method for suggesting the next word would be to generate a bigram-based language model. And then returning the list of words where the probability of a word occurring after a given word is greater than 0. For a small corpus, this method will yield good results. However, as the corpus and number of unique bigrams grow larger, the list of suggestions will grow too large to prove useful as a means of providing a suggestion. In this case, more context is needed and higher order n-grams (tri, four or even five-grams) should be used.

Conclusion

N-gram language models are built on the extracted statistical likelihoods or probabilities that a given word follows on or more previous words. These probabilities can be extracted by transforming an input corpus into n-length contiguous sequences of words called n-grams. This article serves as a very simple introduction to n-gram language models and ignores some of the more complex topics such as smoothing and model evaluation.

Gimme! Gimme! Gimme!

If this article has interested you enough to want to see some code or a demo, feel free to check out the GitHub repo and demo below.

GitHub

Demo

A next-word suggestion demo of the n-gram language model trained on Emma, by Jane Austen is linked below. Please note that it’s hosted on Heroku free-tier so will take a while to spin up if it’s the first time it’s being served in a while.

Like Share and Subscribe

If you enjoyed this article I would really appreciate it if you share my article. If you would like to read more of my content in the future please follow me.

--

--

Keagan Ladds

Software engineer @ Tesla & serial tinkerer, writing about the tech things that keep me up at night.