From pdf to wordcloud using the Python NLTK

Posted on Sun 15 August 2010 in misc

For a presentation on the statistical software-package Mplus I needed something relevant, simple, colorful and not to boring to put on my opening slide. Meeting all those criteria: a wordcloud of the Mplus user manual.

The webservice Wordle gives aesthetically pleasing wordclouds with minimal effort. You can paste text directly in the webinterface or enter a wordcount summary. As the manual is to large to paste into the webinterface (and I wanted to toy with language processing), I wrote a small snippet to make the wordcount summary using the Python Natural Language Toolkit.

Installing the NLTK is straightforward assuming you have something like pip or easy_install running.

pip install pyyaml # first, otherwise the next step gives an error
pip install nltk

Using the pdftotext utility, build-in in most GNU/Linux-systems, we can convert the manual from PDF-format to plain text. At your prompt type:

curl http://www.statmodel.com/download/usersguide/Mplus%20Users%20Guide%20v6.pdf > UG6.pdf
pdftotext UG6.pdf
head UG6.txt

The script wordcount.py, will read the filename specified on the command line, tokenize, remove non-words and return the word frequencies in the right format for Wordl.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#! /usr/bin/env python
# wordcount.py: parse & return word frequency
import sys, nltk

f = open(sys.argv[1], 'rU')
txt = f.read()
f.close()

tokens = nltk.word_tokenize(txt) # tokenize text
clean_tokens = []

for word in tokens:
    word = word.lower()
    if word.isalpha(): # drop all non-words
        clean_tokens.append(word)

# make frequency distribution of words          
fd = nltk.FreqDist(clean_tokens)
for token in fd:
    print token, ':', fd[token]

Yes, this can be done using build-in Python functions. The NLTK however gives you things like stemming and collocations out of the box, if you want to process the text further.

Write the output to a plain text file.

python wordcount.py UG6.txt > wordle_input.txt

You can manually edit the resulting textfile, to remove for instance words like 'a' and 'the'. If everything is OK, paste it in the Wordle advanced interface. For the Mplus User Guide, this results in the following wordcloud: