Skip to Content
Getting Started with Natural Language Processing
Language Models: Bag-of-Words

How can we help a machine make sense of a bunch of word tokens? We can help computers make predictions about language by training a language model on a corpus (a bunch of example text).

Language models are probabilistic computer models of language. We build and use these models to figure out the likelihood that a given sound, letter, word, or phrase will be used. Once a model has been trained, it can be tested out on new texts.

One of the most common language models is the unigram model, a statistical language model commonly known as bag-of-words. As its name suggests, bag-of-words does not have much order to its chaos! What it does have is a tally count of each instance for each word. Consider the following text example:

The squids jumped out of the suitcases.

Provided some initial preprocessing, bag-of-words would result in a mapping like:

{"the": 2, "squid": 1, "jump": 1, "out": 1, "of": 1, "suitcase": 1}

Now look at this sentence and mapping: “Why are your suitcases full of jumping squids?”

{"why": 1, "be": 1, "your": 1, "suitcase": 1, "full": 1, "of": 1, "jump": 1, "squid": 1}

You can see how even with different word order and sentence structures, “jump,” “squid,” and “suitcase” are shared topics between the two examples. Bag-of-words can be an excellent way of looking at language when you want to make predictions concerning topic or sentiment of a text. When grammar and word order are irrelevant, this is probably a good model to use.



We’ve turned a passage from Through the Looking Glass by Lewis Carroll into a list of words (aside from stopwords, which we’ve removed) using nltk preprocessing. Run your code to see the full list.


Now let’s turn this list into a bag-of-words using Counter()!

Comment out the print statement and set bag_of_looking_glass_words equal to a call of Counter() on normalized. Print bag_of_looking_glass_words. What are the most common words?


Try changing text to another string of your choosing and see what happens!

Folder Icon

Sign up to start coding

Already have an account?