“Hope is the thing with feathers That perches in the soul And sings the tune without the words And never stops at all.”
So goes Emily Dickinson’s poem Hope is the thing with feathers. And just as Emily proclaims, your hope and perseverance have taken you to the end of this lesson!
Let’s recount all you have learned:
Let’s practice what you’ve learned about tf-idf using the work of another poet with an inclination for the dreary.
“Once upon a review exercise dreary…” begins Edgar Allen Poe’s The Raven (or so it almost does).
We can use Poe’s classic poem to demonstrate another means of defining documents in a tf-idf analysis. Rather than use different poems as their own documents, we can consider each stanza of The Raven as its own document and try to gain insight into the meaning and insight of the individual stanzas.
In raven.py The Raven is broken down to individual stanzas and stored in
the_raven_stanzas to view the first stanza.
In script.py the stanzas of
the_raven_stanzas are preprocessed. Let’s calculate the tf-idf scores for each term-document pair.
Begin by creating a
TfidfVectorizer object named
vectorizer with keyword argument
Fit and transform your
vectorizer on the corpus of preprocessed stanzas. Save the result to a variable named
Now you just need to get the vocabulary of unique terms used in The Raven.
Paste the below line into the “get vocabulary of terms” section of script.py to display the tf-idf matrix.
feature_names = vectorizer.get_feature_names()
Which stanzas share similarly high/low tf-idf scores for the same terms?