DIY Book Index Generation using Python
by Carson Reynolds
Recently I published Devices Alter Perception 2010 in paperback format. I learned a bit about editing books and technical material when I studied Technical Communication as an undergraduate (although that department has since morphed into Human-Centered Design and Engineering).
Working with Gunnar Green of TheGreenEyl, we took the PDFs supplied by workshop participants and reworked the text and layout using InDesign 5. While InDesign can “generate” an index, it actually needs a set of keywords. InDesign really just renders the index and isn’t smart enough to figure out what words should be included.
At this point I did a bit of research into what is available for index generation. Over on MetaFilter you drop in to an informative thread about the variety of options. You might want to hire a professional indexer or pay for some software. But if you are like me you might start to view index generation as a very simple pattern recognition problem.
What follows are some building blocks for index generation written in Python and using nltk. I used them to generate an index for the multi-author workshop proceedings book mentioned above. The instructions assume you have Python and nltk installed on your computer.
Get your text into plaintext format with a decent encoding. I’d recommend UTF-8 as it has good unicode support. For convenience I’ve named this plaintext file “book.txt”.
Start simple by just making a histogram of the words appearing in the text. I wanted to just make a CSV file that I could open in Excel, Numbers, Mathematica or R. After looking over the most frequently occurring words, I was able to manually screen out parts of speech, punctuation and find the words which were meaningful to my book.
The following code opens up a file book.txt and prints out a comma-seperated list consisting of the term frequency and the term itself:
import nltk from nltk import word_tokenize import operator from collections import defaultdict text = open('book.txt').read() words = word_tokenize(text) # improvement due to patch by Paul Masurel frequencies = defaultdict(int) for w in words: frequencies[w] += 1 # sort the list before printing for word, freq in sorted(frequencies.iteritems(), key=operator.itemgetter(1)): print str(freq) + "," + word
Looking over a typical book, one quickly realizes that a hefty part of a good index consists in multiple word phrases. And so I also built some programs to identify the most frequently occurring pairs of words (also known as bigrams):
import nltk from nltk.collocations import * from nltk.probability import * from nltk import word_tokenize from collections import defaultdict text=open('book.txt').read() words = word_tokenize(text) bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_words(words) finder.apply_freq_filter(3) print finder.nbest(bigram_measures.pmi, 50)
A program to find trigrams is quite similar:
from nltk.collocations import * from nltk.probability import * from nltk import word_tokenize text=open('book.txt').read() words = word_tokenize(text) trigram_measures = nltk.collocations.TrigramAssocMeasures() finder = TrigramCollocationFinder.from_words(words) finder.apply_freq_filter(2) print finder.nbest(trigram_measures.pmi, 50)
Combining the output of Steps 2 and 3 I created a list of 1-3 word keywords. Were I not in such a rush to get the book to press I might have paused to make output format of the bigram and trigram code similar to the CSV file, but I found that I only used a small number of phrases from this analysis which was easy enough to copy by hand. I then in my spreadsheet program whittled these lists down to a final index list which can be sorted alphabetically. If you’d rather not use a a spreadsheet the following script can alphabetize for you:
#!/usr/bin/python import sys if len(sys.argv) < 3: print "usage: alphabetize.py [input filename] [output filename]" exit(-1) inputFile = open(sys.argv, "r") outputFile = open(sys.argv, "w") for line in sorted(inputFile, key = str.lower): outputFile.write(line)
While these tools are a bit primitive, they do give you a bit more insight into how the index itself is built and thus allow you make your own decisions about what will be indexed. Think of them as helper tools to narrow down the list of candidates you might want to include based on frequency.