What are Significant Phrases?

LingPipe provides a simple way find statistically significant phrases in a document collection. There are two types of significance of interest.

Collocations

Collocations are phrases which are seen together more than you would expect given an estimate of how frequent each token is and how often they are seen together. For example in 600 posts from the rec.sport.hockey newsgroup, the collocations of length two found by LingPipe are:

Collocations in Order of Significance:
Score: 260391.00000000006 with : Evan Pritchard
Score: 260391.00000000006 with : Rachel Holme
Score: 260391.00000000006 with : Durham Wasps
Score: 260391.00000000006 with : Abo Akademi
Score: 260391.00000000006 with : Solar Terresterial
Score: 260391.00000000003 with : Los Angeles
Score: 260391.0 with : Petteri Kortelainen
Score: 260391.0 with : Anna Matyas
Score: 260391.0 with : Keyword Description
Score: 260391.0 with : Milton Keynes
Score: 260391.0 with : Marcus Lindroos
.
.
.
Score: 156230.99987556748 with : Roman Era
Score: 145516.46723953017 with : Tampa Bay
Score: 142432.13662060775 with : Simon Fraser
Score: 142432.13662060772 with : Capital District
Score: 142028.72720988328 with : Rex Wang
Score: 140207.30760545374 with : Lori Iannamico
Score: 138871.46655196382 with : Tie Breaker

These are two token phrases which are ranked by how often they are seen together as opposed to how often each is seen alone. For example, 'Los Angeles' has a higher score than 'Tie Breaker' because we see 'Los' 67 times, 'Angeles' 67 times and 'Los Angeles' 67 times. So 'Los' and 'Angeles' always occurs with the larger phrase--a high correlation. On the other hand 'Tie' occurs 15 times, 'Breaker' 8 times and 'Tie Breaker' 8 times, so Tie only occurs with the larger phrase half the time, less of a correlation.

Relatively New Terms

This technique evaluates the significance of phrases in one collection versus another, finding phrases that occur significantly more often in the foreground corpus than they would be expected to from the background corpus. The interesting phrases are those that we see more often in the foreground model than expected in the background model. Returning to our Google example, this is a way to tell if there is more news than usual about a phrase--if 'George Bush' counts go way up, there is likely some big stories about him.

In our hockey data set, we see that 400 (chronologically later) articles have surprisingly more information about the following capitalized phrases:

New Terms in Order of Signficance:
Score: 1.942594719053515E28 with : The Logistician
Score: 3.64670037870023E27 with : Von Rospach
Score: 7.573888901798824E24 with : Clint Malarchuk
Score: 1.9306325340669786E24 with : Drozinski Tim
Score: 1.6830864226219717E24 with : Allegheny College
Score: 7.380822220736823E23 with : John Franjione
Score: 9.025704916227892E21 with : Robert Andolina
Score: 5.693776247868884E21 with : Paul Brownlow
Score: 1.4125238917251475E20 with : Steven Kipling
Score: 1.2786191514800556E20 with : Chuq Von
Score: 1.61250626918026547E18 with : Swift Current
Score: 1.25417154269577446E18 with : Scorers Rnk
Score: 4.0378852066884986E17 with : Carol Jarosz
Score: 9.314669849549016E16 with : Ray Borque
Score: 8.7988271207488672E16 with : John Madden
Score: 1.5837886331023705E15 with : Ali Lemer
Score: 2.717240891101768E14 with : J Coyle

'Ali Lemer' was talked about once in the background model and 7 times in the foreground model, so that is an intersting new phrase to appear.

Popular Web Applications

Amazon's "Statistically Improbable Phrases"

Amazon provides a description of their statistically improbable phrase extraction in:

They explain that they look for phrases that appear more frequently in a given book than in the entire collection of books and list the phrases in this order.

You can see an example of SIP extraction for which Amazon does a good job:

Amazon's "Capitalized Phrases"

This looks just like their statistically improbable phrases, only with a restriction to phrases that are capitalized. There is an exaplanation at:

Yahoo's "Buzz Index"

Another application of this technique can be found in Finding new terms appears to be the basis of:

which they explain in their

Note particularly their hand-filtering of results. They additionally classify phrases in their indices that are considered "movers" (that is, moving up or down in frequency) by category.

Google's "In the News"

Google's "In The News" feature is a sidebar near the top of

Like Yahoo, they seem to restrict attention to capitalized phrases.

Running an example

If you have not already, download and install LingPipe. That done, change directory to demos/tutorial/interestingPhrases. You will also need to untar/zip the data in demos/rec.sport.hockey.tar.gz and type the following on a single line (replacing the colon":" with a semicolon ";" if using Windows):

java
-cp "../../../lingpipe-4.1.0.jar:build/classes"
InterestingPhrases

The resulting output will be collocations for the background model and new terms form the foreground model given the background model. See below for more details on how the software was configured.

The Code

First we assemble the background model by visiting a directory of text files and training up a tokenized language model (a model over tokens rather than over characters as in our classifier demo). The rubber hits the road in just a few lines of the InterestingPhrases.java class:

IndoEuropeanTokenizerFactory tokenizerFactory
    = new IndoEuropeanTokenizerFactory();

System.out.println("Training background model");
TokenizedLM backgroundModel = buildModel(tokenizerFactory,
                                         NGRAM,
                                         BACKGROUND_DIR);

backgroundModel.sequenceCounter().prune(3);

System.out.println("\nAssembling collocations in Training");
SortedSet<ScoredObject<String[]>> coll
    = backgroundModel.collocationSet(NGRAM_REPORTING_LENGTH,
                                     MIN_COUNT,MAX_COUNT);

System.out.println("\nCollocations in Order of Significance:");
report(coll);

We need to specify what kind of tokenization we are using in the training and for English a reasonable assumption is the supplied IndoEuropeanTokenizerFactory which provides tokenizers. Recall that this will take a text like

"So many morons...

and produce an array of tokens:

{ "\"", "So", "many", "morons", "..." }

If you need a different tokenization, for instance you don't want punctuation to be a token, then this class is the place to start digging around.

Next we have the method buildModel which will take a directory of files and return a tokenized language model. That method is:


private static TokenizedLM buildModel(TokenizerFactory tokenizerFactory,
                                      int ngram,
                                      File directory)
    throws IOException {

    String[] trainingFiles = directory.list();
    TokenizedLM model
        = new TokenizedLM(tokenizerFactory,ngram);

    System.out.println("Training on " + directory);

    for (int j = 0; j < trainingFiles.length; ++j) {
        File file = new File(directory,trainingFiles[j]);
        String text = Files.readFromFile(file,"ISO-8859-1");
        model.handle(text);
    }
    return model;
}

We create a new TokenizedLM which requires us to set how many tokens of data it is sampling. Once that object is created we have a bit of training to do with the train() method which takes text from each file. Pretty simple.

Popping back to where buildModel() is called, the next interesting line is backgroundModel.sequenceCounter().prune(3); which makes the collocation calculation run much faster because open ended parts of the internal data structures are cleaned up.

Next we get the goodies with the method call collocations(int,int,int) which pulls phrases of specified token length nGram, minimum number of instances minCount, and how many phrases to return in the ranking maxReturned. Some experimenting is required to see what sort of minCount to use since if it is set too low then you get a bunch of noisy single/low count instances and if it is set too high you may lose interesting examples. The method returns an array of ScoredObject instances which is a convient class to use when there is a double value associated with an object.

Now that we have collocations, 101 of them to be exact of length 2 given the constant definitions, let's do something with them--like print them out in order. The method report() does just that as shown below:

static void report(SortedSet<ScoredObject<String[]>> nGrams) {
    for (ScoredObject<String[]> nGram : nGrams) {
        double score = nGram.score();
        String[] toks = nGram.getObject();
        report_filter(score,toks);
    }
}
    
static void report_filter(double score, String[] toks) {
    String accum = "";
    for (int j=0; j<toks.length; ++j) {
        if (nonCapWord(toks[j])) return;
        accum += " "+toks[j];
    }
    System.out.println("Score: "+score+" with :"+accum);
}

private static boolean nonCapWord(String tok) {
    if (!Character.isUpperCase(tok.charAt(0)))
        return true;
    for (int i = 1; i < tok.length(); ++i) 
        if (!Character.isLowerCase(tok.charAt(i))) 
            return true;
    return false;
}

The array is iterated over in the obvious fashion, and the score and accum variables built up in report_filter. We are only looking at words which start with a capital letter and it all ends in a glorious call to println. In an actual application you might want to find the phrases in the documents and provide a hyperlink from a web page, compare to yesterday's collocations or populate a database for a trend-spotting data-mining application.

Finding New Terms in Comparison with Another Data Set

Sometimes a more useful definition of interesting phrase is "Am I seeing phrases in one data set more often than I would expect given another data set?" For example, "What is in today's news that is more common than expected given the past week of news?" With very little additional work we can get just such a capability with LingPipe.

All we need do is train up a foreground model and compare it to the background model we already have for the collocations demo above. Continuing in the source for InterestingPhrases.java we have:

System.out.println("Training foreground model");
TokenizedLM foregroundModel = buildModel(tokenizerFactory,
                                         NGRAM,
                                         FOREGROUND_DIR);
foregroundModel.sequenceCounter().prune(3);

System.out.println("\nAssembling New Terms in Test vs. Training");
SortedSet<ScoredObject<String[]>> newTerms 
    = foregroundModel.newTermSet(NGRAM_REPORTING_LENGTH,
                                 MIN_COUNT,
                                 MAX_COUNT,
                                 backgroundModel);
System.out.println("\nNew Terms in Order of Signficance:");
report(newTerms);

We simply build a model with some different data, in this case it is more and chronologically later data from rec.sport.hockey, prune it, and apply the method newTerms with appropriate parameters. Then we use the same report method as with the collocations.

References

For a survey of research on collocations and novel phrase finding, see:

For an early paper applying statistical hypothesis testing to phrase extraction, see: