For a presentation on the statistical software-package Mplus I needed something relevant, simple, colorful and not to boring to put on my opening slide. Meeting all those criteria: a wordcloud of the Mplus user manual.
The webservice Wordle gives aesthetically pleasing wordclouds with minimal effort. You can paste text directly in the webinterface or enter a wordcount summary. As the manual is to large to paste into the webinterface (and I wanted to toy with language processing), I wrote a small snippet to make the wordcount summary using the Python Natural Language Toolkit.
Installing the NLTK is straightforward assuming you have something like pip or easy_install
running.
pip install pyyaml # first, otherwise the next step gives an error
pip install nltk
Using the pdftotext
utility, build-in in most GNU/Linux-systems, we can convert the manual from PDF-format to plain text. At your prompt type:
curl http://www.statmodel.com/download/usersguide/Mplus%20Users%20Guide%20v6.pdf > UG6.pdf
pdftotext UG6.pdf
head UG6.txt
The script wordcount.py
, will read the filename specified on the command line, tokenize, remove non-words and return the word frequencies in the right format for Wordl.
#! /usr/bin/env python
# wordcount.py: parse and return word frequency
import sys, nltk
f = open(sys.argv[1], 'rU')
txt = f.read()
f.close()
tokens = nltk.word_tokenize(txt) # tokenize text
clean_tokens = []
for word in tokens:
word = word.lower()
if word.isalpha(): # drop all non-words
clean_tokens.append(word)
# make frequency distribution of words
fd = nltk.FreqDist(clean_tokens)
for token in fd:
print token, ':', fd[token]
Yes, this can also be done using build-in Python functions. The NLTK however gives you things like stemming and collocations out of the box, if you want to process the text further.
Write the output to a plain text file
python wordcount.py UG6.txt > wordle_input.txt
You can manually edit the resulting textfile, to remove for instance words like ‘a’ and ‘the’. If everything is OK, paste it in the Wordle advanced interface. For the Mplus User Guide, this results in the following wordcloud: