What is Text Classification?

Text classification typically involves assigning a document to a category by automated or human means. LingPipe provides a classification facility that takes examples of text classifications--typically generated by a human--and learns how to classify further documents using what it learned with language models. There are many other ways to construct classifiers, but language models are particularly good at some versions of this task.

20 Newsgroups Demo

A publicly available data set to work with is the 20 newsgroups data available from the

20 Newsgroups Home Page

4 Newsgroups Sample

We have included a sample of 4 newsgroups with the LingPipe distribution in order to allow you to run the tutorial out of the box. You may also download and run over the entire 20 newsgroup dataset. LingPipe's performance over the whole data set is state of the art.

Quick Start

Once you have downloaded and installed LingPipe, change directories to the one containing this read-me:

> cd demos/tutorial/classify

You may then run the demo from the command line (placing all of the code on one line):

On Windows:

java
-cp "../../../lingpipe-4.1.0.jar;
     classifyNews.jar"
ClassifyNews

On Linux, Mac OS X, and other Unix-like operating systems:

java
-cp "../../../lingpipe-4.1.0.jar:
     classifyNews.jar"
ClassifyNews

or through Ant:

ant classifyNews

The demo will then train on the data in demos/fourNewsGroups/4news-train/ and evaluate on demos/4newsgroups/4news-test. The results of scoring are printed to the command line and explained in the rest of this tutorial.

The Code

The entire source for the example is ClassifyNews.java. We will be using the API from Classifier and its subclasses to train the classifier, and Classifcation to evaluate it. The code should be pretty self explanatory in terms of how training and evaluation are done. Below I go over the API calls.

Training

We are going to train up a set of character based language models (one per newsgroup as named in the static array CATEGORIES) that processes data in 6 character sequences as specified by the NGRAM_SIZE constant.

private static String[] CATEGORIES
    = { "soc.religion.christian",
        "talk.religion.misc",
        "alt.atheism",
        "misc.forsale" };

private static int NGRAM_SIZE = 6;

The smaller your data generally the smaller the n-gram sample, but you can play around with different values--reasonable ranges are from 1 to 16 with 6 being a good general starting place.

The actual classifier involves one language model per classifier. In this case, we are going to use process language models (LanguageModel.Process). There is a factory method in DynamicLMClassifier to construct actual models.

DynamicLMClassifier classifier
  = DynamicLMClassifier
    .createNGramBoundary(CATEGORIES,
                         NGRAM_SIZE);

There are two other kinds of language model classifiers that may be constructed, for bounded character language models and tokenized language models.

Training a classifier simply involves providing examples of text of the various categories. This is called through the handle method after first constructing a classification from the category and a classified object from the classification and text:

Classification classification
    = new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
    = new Classified<CharSequence>(text,classification);
classifier.handle(classified);

That's all you need to train up a language model classifier. Now we can see what it can do with some evaluation data.

Classifying News Articles

The DynamicLMClassifier is pretty slow when doing classification so it is generally worth going through a compile step to produce the more efficient compiled version, which will classify character sequences into joint classification results. A simple way to do that is in the code as:

JointClassifier<CharSequence> compiledClassifier
    = (JointClassifier<CharSequence>)
      AbstractExternalizable.compile(classifier);

Now the rubber hits the road and we can can see how well the machine learning is doing. The example code both reports classifications to the console and evaluates the performance. The crucial lines of code are:

JointClassification jc = compiledClassifier.classifyJoint(text);
String bestCategory = jc.bestCategory();
String details = jc.toString();

The text is an article that was not trained on and the JointClassification is the result of evaluating the text against all the language models. Contained in it is a bestCategory() method that returns the highest scoring language model name for the text. Just to be sure that some statistics are involved the toString() method dumps out all the results and they are presented as:

Testing on soc.religion.christian/21417
Best Cat: soc.religion.christian
Rank Cat Score P(Cat|In) log2 P(Cat,In)
0=soc.religion.christian -1.56 0.45 -1.56
1=talk.religion.misc -2.68 0.20 -2.68
2=alt.atheism -2.70 0.20 -2.70
3=misc.forsale -3.25 0.13 -3.25

Scoring Accuracy

The remaining API of note is how the system is scored against a gold standard. In this case our testing data. Since we know what newsgroup the article came from we can evaluate how well the software is doing with the JointClassifierEvaluator class.

boolean storeInputs = true;
JointClassifierEvaluator<CharSequence> evaluator
    = new JointClassifierEvaluator<CharSequence>(compiledClassifier,
                                                                CATEGORIES,
                                                                storeInputs);

This class wraps the compiledClassifier in an evaluation framework that provide very rich reporting of how well the system is doing. Later in the code it is populated with data points with the method addCase(), after first constructing a classified object as for training:

Classification classification
    = new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
    = new Classified<CharSequence>(text,classification);
evaluator.handle(classified);

This will get a JointClassification for the text and then keep track of the results for reporting later. After all the data is run, then many methods exist to see how well the software did. In the demo code we just print out the total accuracy via the ConfusionMatrix class, but it is well worth looking at the relevant Javadoc for what reporting is available.

Cross-Validation

Running Cross-Validation

There's an ant target crossValidateNews which cross-validates the news classifier over 10 folds. Here's what a run looks like:

> cd $LINGPIPE/demos/tutorial/classify
> ant crossValidateNews

Reading data.
Num instances=250.
Permuting corpus.
 FOLD        ACCU
    0  1.00 +/- 0.00
    1  0.96 +/- 0.08
    2  0.84 +/- 0.14
    3  0.92 +/- 0.11
    4  1.00 +/- 0.00
    5  0.96 +/- 0.08
    6  0.88 +/- 0.13
    7  0.84 +/- 0.14
    8  0.88 +/- 0.13
    9  0.84 +/- 0.14

This reports that there are 250 training examples. With 10 folds, that'll be 225 traniing and 25 test cases each. The accuracy for each fold is reported along with the 95% normal approximation to the binomial confidence interval per run (with no smoothing on the binomial estimate, hence the 0.00 variance for fold 4). The moral of this story is that small training sizes lead to large variance.

Cross-validation is a means of using a single corpus to train and evaluate without deciding ahead of time how to carve the data into test and training portions. This is often used for evaluation, but more properly should be used only for development.

How Cross-Validation Works

Cross-validation divides a corpus into a number of evenly sized portions called folds. Then for each fold, the data not in the fold is used to train a classifier which is then evaluated on the current fold. The results are then pooled across the folds, which greatly reduces the variance in the evaluation, reflected in narrower confidence intervals.

Implementing a Cross-Validating Corpus

LingPipe supplies a convenient corpus.Corpus class which is meant to be used for generic training and testing applications like cross-validation. The corpus class is typed based on the handler type H intended to handle its data. The basis of the corpus class is a pair of methods visitTest(H) and visitTrain(H) which send handlers every training instance or every testing instance respectively.

LingPipe implements cross-validation for evaluation with the class corpus.XValidatingObjectCorpus. This corpus implementation just stores the data in parallel lists and uses it to implement the visit-test and visit-train methods of the corpus.

Permuting Inputs

It is critical in evaluating classifiers to pay attention to correlations in the corpus. In the case of the 20 newsgroups data, which is organized by category, a naive 10% cross-validation would remove most or all of a category's training data, which would lie in a continuous run.

To solve this problem, the cross-validating corpus implementation includes a method to permute the corpus using a specified random implementation.

We implemented the randomizer with a fixed seed so that experiments would be repeatable. Change the seed to get a different set of runs. You should see the variance even more clearly after more runs.

Cross-Validation Implementation

The command-line implementation for cross-validating is in src/CrossValidateNews.java. The code mostly repeats the simple classifier code. First, we create a cross-validating corpus, then store all of the data from both the training and test directories.

XValidatingObjectCorpus<Classified<CharSequence>> corpus
    = new XValidatingObjectCorpus<Classified<CharSequence>>(NUM_FOLDS);

for (String category : CATEGORIES) {

    Classification c = new Classification(category);
    File trainCatDir = new File(TRAINING_DIR,category);
    for (File trainingFile : trainCatDir.listFiles()) {
        String text = Files.readFromFile(trainingFile,"ISO-8859-1");
        Classified<CharSequence> classified
            = new Classified<CharSequence>(text,c);
        corpus.handle(classified);
    }

    File testCatDir = new File(TESTING_DIR,category);
    for (File testFile : testCatDir.listFiles()) {
        String text = Files.readFromFile(testFile,"ISO-8859-1");
        Classified<CharSequence> classified
            = new Classified<CharSequence>(text,c);
        corpus.handle(classified);
    }
}

The corpus is then permuted using a new random number generator, which randomizes any order-related correlations in the text:

long seed = 42L;
corpus.permuteCorpus(new Random(seed));

Note that we have fixed the seed value for the random number generator. Choosing another one would produce a different shuffling of the inputs.

Now that the corpus is created, we loop over the folds, evaluating each one using the methods supplied by the corpus:

for (int fold = 0; fold < NUM_FOLDS; ++fold) {
    corpus.setFold(fold);

    DynamicLMClassifier<NGramProcessLM> classifier
        = DynamicLMClassifier.createNGramProcess(CATEGORIES,NGRAM_SIZE);
    corpus.visitTrain(classifier);

    JointClassifier<CharSequence> compiledClassifier
        = (JointClassifier<CharSequence>)
          AbstractExternalizable.compile(classifier);

    boolean storeInputs = true;
    JointClassifierEvaluator<CharSequence> evaluator
        = new JointClassifierEvaluator<CharSequence>(compiledClassifier,
                                                                    CATEGORIES,
                                                                    storeInputs);

    corpus.visitTest(evaluator);
    System.out.printf("%5d  %4.2f +/- %4.2f\n", fold,
                      evaluator.confusionMatrix().totalAccuracy(),
                      evaluator.confusionMatrix().confidence95());
}

For each fold, the fold is first set on the corpus. Then a trainable classifier is created and the corpus is used to train it through the visitTrain() method. Then the classifier is compiled and used to construct an evaluator. The evaluator is then run over the test cases by the corpus method visitTest(). Finally, the resulting accuracy and 95% confidence interval are printed.

Leave-One-Out Evaluations

The limit of cross-validation is when each fold consists of a single example. This is called "leave one out" (LOO). This is easily achieved in the general corpus implementation by setting the number of folds equal to the number of data points. The only potential problem is rounding errors in arithmetic, so leave-one-out evals are typically done with specialized implementations. Also, in doing leave-one-out, there is no point in compiling the classifier before running it.

References

For a general introduction to cross-validation, see:

For a survey of statistical classification and examples of classifiers using character language models, see: