For a presentation on the statistical software-package Mplus I needed something relevant, simple, colorful and not to boring to put on my opening slide. Meeting all those criteria: a wordcloud of the Mplus user manual.
The webservice Wordle gives aesthetically pleasing wordclouds with minimal effort. You can paste text directly in the webinterface or enter a wordcount summary. As the manual is to large to paste into the webinterface (and I wanted to toy with language processing), I wrote a small snippet to make the wordcount summary using the Python Natural Language Toolkit.
Installing the NLTK is straightforward assuming you have something like pip or
pip install pyyaml # first, otherwise the next step gives an error pip install nltk
pdftotext utility, build-in in most GNU/Linux-systems, we can convert the manual from PDF-format to plain text. At your prompt type:
curl http://www.statmodel.com/download/usersguide/Mplus%20Users%20Guide%20v6.pdf > UG6.pdf pdftotext UG6.pdf head UG6.txt
wordcount.py, will read the filename specified on the command line, tokenize, remove non-words and return the word frequencies in the right format for Wordl.
#! /usr/bin/env python # wordcount.py: parse and return word frequency import sys, nltk f = open(sys.argv, 'rU') txt = f.read() f.close() tokens = nltk.word_tokenize(txt) # tokenize text clean_tokens =  for word in tokens: word = word.lower() if word.isalpha(): # drop all non-words clean_tokens.append(word) # make frequency distribution of words fd = nltk.FreqDist(clean_tokens) for token in fd: print token, ':', fd[token]
Yes, this can also be done using build-in Python functions. The NLTK however gives you things like stemming and collocations out of the box, if you want to process the text further.
Write the output to a plain text file
python wordcount.py UG6.txt > wordle_input.txt
You can manually edit the resulting textfile, to remove for instance words like ‘a’ and ‘the’. If everything is OK, paste it in the Wordle advanced interface. For the Mplus User Guide, this results in the following wordcloud: