Elegant Python 3 reproduction of “most common words from a story”
I started off trying to reproduce the results of this Medium article by Tirthajyoti Sarkar for Towards Data Science. The idea and dataset for this post is 100% picked up from there. As I was writing the code, I realized it could be achieved in a more elegant and briefer way and using some Python libraries most data scientists use to write clean and self-documenting code. This is a documentation of a briefer and easier to understand script that achieves the same results.
I picked the same text file from Project Gutenberg as the original article. The stopwords file I used is from the Princeton website here.
This is implemented in Python 3.6 with some features like f-strings that may not work in Python 3.5. Please feel free to replace f-strings with .format wherever required.
import collections
import re
import matplotlib.pyplot as plt
% matplotlib inlinefile = open('PrideAndPrejudice.txt', 'r')
file = file.read()stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['a', 'i', 'mr', 'ms', 'mrs', 'one', 'two', 'said']))wordcount = collections.defaultdict(int)"""
the next paragraph does all the counting and is the main point of difference from the original article. More on this is explained later.
"""# \W is regex for characters that are not alphanumerics.
# all non-alphanumerics are replaced with a blank space using re.subpattern = r"\W"
for word in file.lower().split():
word = re.sub(pattern, '', word)
if word not in stopwords:
wordcount[word] += 1# printing most common wordsto_print = int(input("How many top words do you wish to print?"))
print(f"The most common {n} words are:")# the next line sorts the default dict on the values in decreasing # order and prints the first "to_print".mc = sorted(wordcount.items(), key=lambda k_v: k_v[1], reverse=True)
[:to_print] # this is continued from the previous assignmentfor word, count in mc:
print(word, ":", count)# Draw the bart chartmc = dict(mc)
names = list(mc.keys())
values = list(mc.values())plt.bar(range(len(mc)),values,tick_label=names)
plt.savefig('bar.png')
plt.show()
I believe my stopwords file is different from the original post, which is why words like ‘much’ and ‘must’ are not in my top-10. The few code things I did differently are:
- Using regexes instead of explicit replace statements. Regexes are easily understandable and extremely powerful tools for tasks that would otherwise take a lot more lines of code. Regexes do it in one line. Using re.sub I replaced all non-alphanumerics with a blank space, something that was done in the original article over several lines.
- Instead of using the dict and then Counter from collections, I used a defaultdict from collections. A defaultdict does not give a KeyError when a key is not present in a dict, it instead creates a new item when you try to access it. Declaring a defaultdict as collections.defaultdict(int), assigns a default value of 0(zero) to the new item. Therefore, the command —
wordcount[word] += 1
can be used without using an if statement, and every time wordcount encounters a new key, it adds the new key to the dictionary and assigns it a value 1. Keys that are already in wordcount are incremented by 1.
- I also did not use a pandas dataframe to create the bar plot. This is a personal preference and while it MIGHT make a difference in the case of very large files, the extra space does not really affect our example. I just wanted to demonstrate another way of doing this, without having to import pandas just for this simple task.
Here’s an extra snippet of code to generate a word cloud for the most common words. I used this library.
from wordcloud import WordCloudwc = WordCloud().generate_from_frequencies(wordcount)
plt.figure()
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
Please feel free to comment if I missed something, or if you think this code can be made better.
Check out a blogpost detailing comprehensive Exploratory Data Analysis using Python here —
https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools