What are Significant Phrases?
LingPipe provides a simple way find statistically significant phrases in a document collection. There are two types of significance of interest.
Collocations
Collocations are phrases which are seen together more than you would
expect given an estimate of how frequent each token is and how often
they are seen together. For example in 600 posts from
the rec.sport.hockey
newsgroup, the collocations
of length two found by LingPipe are:
Collocations in Order of Significance: Score: 260391.00000000006 with : Evan Pritchard Score: 260391.00000000006 with : Rachel Holme Score: 260391.00000000006 with : Durham Wasps Score: 260391.00000000006 with : Abo Akademi Score: 260391.00000000006 with : Solar Terresterial Score: 260391.00000000003 with : Los Angeles Score: 260391.0 with : Petteri Kortelainen Score: 260391.0 with : Anna Matyas Score: 260391.0 with : Keyword Description Score: 260391.0 with : Milton Keynes Score: 260391.0 with : Marcus Lindroos . . . Score: 156230.99987556748 with : Roman Era Score: 145516.46723953017 with : Tampa Bay Score: 142432.13662060775 with : Simon Fraser Score: 142432.13662060772 with : Capital District Score: 142028.72720988328 with : Rex Wang Score: 140207.30760545374 with : Lori Iannamico Score: 138871.46655196382 with : Tie Breaker
These are two token phrases which are ranked by how often they are seen together as opposed to how often each is seen alone. For example, 'Los Angeles' has a higher score than 'Tie Breaker' because we see 'Los' 67 times, 'Angeles' 67 times and 'Los Angeles' 67 times. So 'Los' and 'Angeles' always occurs with the larger phrase--a high correlation. On the other hand 'Tie' occurs 15 times, 'Breaker' 8 times and 'Tie Breaker' 8 times, so Tie only occurs with the larger phrase half the time, less of a correlation.
Relatively New Terms
This technique evaluates the significance of phrases in one collection versus another, finding phrases that occur significantly more often in the foreground corpus than they would be expected to from the background corpus. The interesting phrases are those that we see more often in the foreground model than expected in the background model. Returning to our Google example, this is a way to tell if there is more news than usual about a phrase--if 'George Bush' counts go way up, there is likely some big stories about him.
In our hockey data set, we see that 400 (chronologically later) articles have surprisingly more information about the following capitalized phrases:
New Terms in Order of Signficance: Score: 1.942594719053515E28 with : The Logistician Score: 3.64670037870023E27 with : Von Rospach Score: 7.573888901798824E24 with : Clint Malarchuk Score: 1.9306325340669786E24 with : Drozinski Tim Score: 1.6830864226219717E24 with : Allegheny College Score: 7.380822220736823E23 with : John Franjione Score: 9.025704916227892E21 with : Robert Andolina Score: 5.693776247868884E21 with : Paul Brownlow Score: 1.4125238917251475E20 with : Steven Kipling Score: 1.2786191514800556E20 with : Chuq Von Score: 1.61250626918026547E18 with : Swift Current Score: 1.25417154269577446E18 with : Scorers Rnk Score: 4.0378852066884986E17 with : Carol Jarosz Score: 9.314669849549016E16 with : Ray Borque Score: 8.7988271207488672E16 with : John Madden Score: 1.5837886331023705E15 with : Ali Lemer Score: 2.717240891101768E14 with : J Coyle
'Ali Lemer' was talked about once in the background model and 7 times in the foreground model, so that is an intersting new phrase to appear.
Popular Web Applications
Amazon's "Statistically Improbable Phrases"
Amazon provides a description of their statistically improbable phrase extraction in:
They explain that they look for phrases that appear more frequently in a given book than in the entire collection of books and list the phrases in this order.
You can see an example of SIP extraction for which Amazon does a good job:
Amazon's "Capitalized Phrases"
This looks just like their statistically improbable phrases, only with a restriction to phrases that are capitalized. There is an exaplanation at:
Yahoo's "Buzz Index"
Another application of this technique can be found in Finding new terms appears to be the basis of:
- Yahoo's Buzz Index
which they explain in their
Note particularly their hand-filtering of results. They additionally classify phrases in their indices that are considered "movers" (that is, moving up or down in frequency) by category.
Google's "In the News"
Google's "In The News" feature is a sidebar near the top of
Like Yahoo, they seem to restrict attention to capitalized phrases.
Running an example
If you have not already, download and install LingPipe. That done,
change directory to
demos/tutorial/interestingPhrases
. You will also need to
untar/zip the data in demos/rec.sport.hockey.tar.gz
and
type the following on a single line (replacing the colon":" with a semicolon ";" if using Windows):
java -cp "../../../lingpipe-4.1.2.jar:build/classes" InterestingPhrases
The resulting output will be collocations for the background model and new terms form the foreground model given the background model. See below for more details on how the software was configured.
The Code
First we assemble the background model by visiting a directory of text
files and training up a tokenized language model (a model over tokens
rather than over characters as in our classifier demo). The rubber hits the road in just a few lines of the InterestingPhrases.java
class:
IndoEuropeanTokenizerFactory tokenizerFactory = new IndoEuropeanTokenizerFactory(); System.out.println("Training background model"); TokenizedLM backgroundModel = buildModel(tokenizerFactory, NGRAM, BACKGROUND_DIR); backgroundModel.sequenceCounter().prune(3); System.out.println("\nAssembling collocations in Training"); SortedSet<ScoredObject<String[]>> coll = backgroundModel.collocationSet(NGRAM_REPORTING_LENGTH, MIN_COUNT,MAX_COUNT); System.out.println("\nCollocations in Order of Significance:"); report(coll);
We need to specify what kind of tokenization we are using in the
training and for English a reasonable assumption is the supplied IndoEuropeanTokenizerFactory
which provides tokenizers. Recall that this will take a text like
"So many morons...
and produce an array of tokens:
{ "\"", "So", "many", "morons", "..." }
If you need a different tokenization, for instance you don't want punctuation to be a token, then this class is the place to start digging around.
Next we have the method buildModel
which will take a directory of files and return a tokenized language model. That method is:
private static TokenizedLM buildModel(TokenizerFactory tokenizerFactory, int ngram, File directory) throws IOException { String[] trainingFiles = directory.list(); TokenizedLM model = new TokenizedLM(tokenizerFactory,ngram); System.out.println("Training on " + directory); for (int j = 0; j < trainingFiles.length; ++j) { File file = new File(directory,trainingFiles[j]); String text = Files.readFromFile(file,"ISO-8859-1"); model.handle(text); } return model; }
We create a new TokenizedLM
which requires us to set how many tokens of data it is sampling. Once
that object is created we have a bit of training to do with the
train()
method which takes text from each file. Pretty simple.
Popping back to where buildModel()
is called, the next
interesting line is
backgroundModel.sequenceCounter().prune(3);
which makes
the collocation calculation run much faster because open ended parts
of the internal data structures are cleaned up.
Next we get the goodies with the method call collocations(int,int,int)
which pulls phrases of specified token length nGram
,
minimum number of instances minCount
, and how many
phrases to return in the ranking maxReturned
. Some
experimenting is required to see what sort of minCount to use since if
it is set too low then you get a bunch of noisy single/low count
instances and if it is set too high you may lose interesting
examples. The method returns an array of ScoredObject
instances which is a convient class to use when there is a
double value associated with an object.
Now that we have collocations, 101 of them to be exact of length 2
given the constant definitions, let's do something with them--like print
them out in order. The method report()
does just that as
shown below:
static void report(SortedSet<ScoredObject<String[]>> nGrams) { for (ScoredObject<String[]> nGram : nGrams) { double score = nGram.score(); String[] toks = nGram.getObject(); report_filter(score,toks); } } static void report_filter(double score, String[] toks) { String accum = ""; for (int j=0; j<toks.length; ++j) { if (nonCapWord(toks[j])) return; accum += " "+toks[j]; } System.out.println("Score: "+score+" with :"+accum); } private static boolean nonCapWord(String tok) { if (!Character.isUpperCase(tok.charAt(0))) return true; for (int i = 1; i < tok.length(); ++i) if (!Character.isLowerCase(tok.charAt(i))) return true; return false; }
The array is iterated over in the obvious fashion, and the
score
and accum
variables built up in
report_filter. We are only looking at words which start with a capital
letter and it all ends in a glorious call to println
. In
an actual application you might want to find the phrases in the
documents and provide a hyperlink from a web page, compare to
yesterday's collocations or populate a database for a trend-spotting
data-mining application.
Finding New Terms in Comparison with Another Data Set
Sometimes a more useful definition of interesting phrase is "Am I seeing phrases in one data set more often than I would expect given another data set?" For example, "What is in today's news that is more common than expected given the past week of news?" With very little additional work we can get just such a capability with LingPipe.
All we need do is train up a foreground model and compare it to the background model we already have for the collocations demo above. Continuing in the source for InterestingPhrases.java we have:
System.out.println("Training foreground model"); TokenizedLM foregroundModel = buildModel(tokenizerFactory, NGRAM, FOREGROUND_DIR); foregroundModel.sequenceCounter().prune(3); System.out.println("\nAssembling New Terms in Test vs. Training"); SortedSet<ScoredObject<String[]>> newTerms = foregroundModel.newTermSet(NGRAM_REPORTING_LENGTH, MIN_COUNT, MAX_COUNT, backgroundModel); System.out.println("\nNew Terms in Order of Signficance:"); report(newTerms);
We simply build a model with some different data, in this case it is
more and chronologically later data from rec.sport.hockey, prune it,
and apply the method newTerms
with appropriate parameters. Then we use the same report
method as with the collocations.
References
For a survey of research on collocations and novel phrase finding, see:
- Christopher D. Manning and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Chapter 5: Collocations.
For an early paper applying statistical hypothesis testing to phrase extraction, see:
- Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics Journal.