Introduction

What is Logistic Regression?

Logistic regression is a discriminitive probabilistic classification model that operates over real-valued vector inputs. The dimensions of the input vectors being classified are called "features" and there is no restriction against them being correlated. Logistic regression is one of the best probabilistic classifiers, measured in both log loss and first-best classification accuracy across a number of tasks.

The logistic regression implementation in LingPipe provides multinomial classification; that is, it allows more than two possible output categories.

The main drawback of logistic regression is that it's relatively slow to train compared to the other LingPipe classifiers. It also requires extensive tuning in the form of feature selection and implementation to achieve state-of-the-art classification performance.

What's in the Tutorial?

This tutorial covers both the vector-based implementation in the statistics package and the use of feature extractors for classifying arbitrary objects in the classification package. The tutorial will cover basic estimation, the effects of different choices and parameterizations of priors, and tuning the estimator's search.

The tutorial will also cover the basics of feature-based classification in LingPipe. Feature extractors convert arbitrary objects into feature vectors, which may then be converted to actual vectors for use in logistic regression.

Also Known As (AKA)

For the sake of terminological clarity (and search engine optimization), here are some aliases for multinomial logistic regression.

Polytomous Logistic Regression

Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression.

Maximum Entropy Classifier

Logistic regression estimation obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".

Neural Network: Classification with a Single Neuron

Binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. This is sometimes called classification with a single neuron.

LingPipe's stochastic gradient descent is equivalent to a stochastic back-propagation algorithm over the single-output neural network.

Ridge Regression and the Lasso

Maximum a priori (MAP) estimation with Gaussian priors is often referred to as "ridge regression"; with Laplace priors MAP estimation is known as the "lasso".

Shrinkage and Regularized Regression

MAP estimation with Gaussian, Laplace or Cauchy priors is known as parameter shrinkage.

Gaussian and Laplace priors are equivalent to regularized regression, with the Gaussian version being regularized with the L2 norm (Euclidean distance, called the Frobenius norm for matrices of parameters) and the Laplace version being regularized with the L1 norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.

Generalized Linear Model and Softmax

Logistic regression is a generalized linear model with the logit link function. The logistic link function is sometimes called softmax and given its use of exponentiation to convert linear predictors to probabilities, it is sometimes called an exponential model.

Logistic Regression Models

Logistic regression models provide multi-category classification in cases where the categories are exhaustive and mutually exclusive. That is, every instance belongs to exactly one category.

Inputs are coded as real-valued vectors of a fixed dimensionality. The dimensions are often called predictors or features. There is no requirement that they be independent, and with regularization, they may even be highly or fully linearly correlated.

The model consists of parameter vectors for categories of the dimensionality of inputs. The last category does not get a parameter vector; or equivalently, it gets a constant 0 parameter vector.

More formally, if the inputs are of dimension d and there are k categories, the model consists of k-1 vectors β[0],...,β[k-2]. Then for a given input vector x of dimensionality k, the conditional probability of a category given the input is defined to be:

p(0 | x)  exp(β[0] * x)
p(1 | x)  exp(β[1] * x)
...
p(k-2 | x)  exp(β[k-2] * x)
p(k-1 | x)  exp(0 * x)

Normalizing by the sum of the exponentiated bases yields the probability estimates:

p(0 | x) = exp(β[0]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
p(1 | x) = exp(β[1]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
...
p(k-2 | x) = exp(β[k-2]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
p(k-1 | x) = exp(0*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))

Writing it out in summation notation, for c < k-1:

p(c | x) = exp(β[c] * x) / (1 + Σi < k-1 exp(β[i]*x))

and for c = k-1:

p(k-1 | x) = 1 / (1 + Σi < k-1 exp(β[i]*x))

Example of Logistic Regression

Logistic regression models are estimated from training data consisting of a sequence of vectors and their reference categories. The vectors are arbitrary, with their dimensions representing features of the input objects being classified. The categories are discrete, and should be numbered contiguously from 0 to the number of categories minus one.

The Wallet Problem

The first example we consider is drawn from chapter 5 of the following book:

The data is based on a survey of 195 undergraduates, and attempts to predict their answer to the question "If you found a wallet on the street, would you...", with the following possible responses:

Wallet Problem Outcomes
OutcomeDescription
0keep both
1keep the money, return the wallet
2return both

The input vectors are five dimensional, consisting of the following features, the descriptions of which are directly transcribed from (Allison 1999):

Wallet Problem Predictors
DimensionDescriptionValues
0Intercept1: always
1Male 1: male
0: female
2Business 1: enrolled in business school
0: not enrolled in business school
3Punish Variable describing whether student was physically punished by parents at various ages:
1: punished in elementary school, but not in middle or high school
2: punished in elementary and middle school, but not in high school
3: punished at all three levels
4Explain Response to question "When you were punished, did your parents generally explain why what you did was wrong?"
1: almost always
0: sometimes or never

LingPipe requires an explicit representation of the intercept feature, which is implicit in (Allison 1999). The intercept is treated just like other features, but is assumed to take on value 1.0 in all inputs. Thus it provides an input-independent bias term for estimation. Most problems benefit from the addition of such an intercept feature.

Where do Features Come From?

The predictors in this problem are all discrete, most defining binary variables with the physical punishment model taking on three ordinal values. It is also possible to include continuous inputs for regression problems such as token counts in linguistic examples or fetaures like width of petals in flower species classification problems.

Here's the first few training examples out of the complete set of 195:

Wallet Problem Data (Sample)
OutcomeInterceptMaleBusinessPunishExplain
11.00.00.02.00.0
11.00.00.02.01.0
21.00.00.01.01.0
21.00.00.02.00.0
01.01.00.01.01.0
...

For example, the third training instance represents a survey response for a woman (male=0.0) who is not in business school (business=0.0), who was punished only in elementary school (punish=1.0), had her punishment explained almost always (explain=1.0), and who said she'd return both the wallet and the money (outcome=2). The fifth training example represents a man who's not in business school, was punished only in elementary school, had his punishment explained, and answered that he would keep both the money and wallet.

Our first logistic regression model is estimated from 195 of these training cases, yielding a classifier that given the five input feature values (intercept, male, business, punish and explain), assigns probabilities to the three outcomes (keep both, return only money, return both).

Coding the Problem

The source code for the wallet problem may be found in the file src/WalletProblem.java.

In order to train a logistic regression model, LingPipe requires the inputs to be coded as instances of matrix.Vector and outputs to be coded as integers. These are presented as parallel arrays of vectors and output integers.

To keep things simple, the outputs and inputs are coded directly as used as static constants. Here are the outputs:

    static final int[] OUTPUTS = new int[] {
        1,
        1,
        2,
        2,
        0,
        ...
    }

The inputs are coded as dense vector instances:

    static final Vector[] INPUTS = new Vector[] {
        new DenseVector(new double[] { 1, 0, 0, 2, 0 }),
        new DenseVector(new double[] { 1, 0, 0, 2, 1 }),
        new DenseVector(new double[] { 1, 0, 0, 1, 1 }),
        new DenseVector(new double[] { 1, 0, 0, 2, 0 }),
        new DenseVector(new double[] { 1, 1, 0, 1, 1 }),
        ...
    };

Note how these two parallel arrays directly encode the sample data as presented in the previous table.

Estimating the Regression Coefficients

Running the code using the ant target wallet prints out the estimated regression coefficients:

> ant wallet

Computing Wallet Problem Logistic Regression
Outcome=0  -3.47   1.27   1.18   1.08  -1.60
Outcome=1  -1.29   1.17   0.42   0.20  -0.80

An estimated model consists of a sequence of weight vectors for one minus the number of output categories. We don't need a vector for the last category, because it can be taken to be zero without loss of generality (see Carpenter 2008).

Implicit Coefficients for Final Outcome

The final outcome has all zero coefficients. Filling in for the wallet example, this gives us:

Outcome=2   0.00   0.00   0.00   0.00   0.00

These are not printed as part of the model output.

Interpreting the Regresison Coefficients

Because logistic regression involves a simple linear predictor, the regression coefficients may be interpreted fairly directly.

Intercept

The values of the intercept parameter are -3.47 for outcome 0 (keep both), -1.29 for (keep money, return wallet), and 0.0 implicitly for outcome 2 (return-both). Because of the definition of probability and the fact that the intercept feature dimension is always 1.0, the linear basis for outcomes 0, 1 and 2 start off on an uneven footing. To make the keep-both outcome most likely, another feature or combination of features will have to contribute more than 3.47 to the linear basis.

Computing Probabilities

The main point of fitting a model is to be able to interpret probabilities for events. For instance, take male business students who were punished at all three levels without explanation. That provides an input vector of (1,1,1,3,0). The probabilities work out as:

p(keep-both|1,1,1,3,0)    exp(-3.47*1 + 1.27*1 + 1.18*1 + 1.08*3 + -1.6*0)
                         = exp(-1.54) = 0.21

p(keep-money|1,1,1,3,0)   exp(-1.29*1 + 1.17*1 + 0.42*1 + 0.20*3 + -0.8*0)
                         = exp(-0.3) = 0.74

p(return-both|1,1,1,3,0)  exp(0.0*1 + 0.0*1 + 0.0*1 + 0.0*3 + 0.0*0) 
                         = exp(0) = 1

Division by the sum of exponentiated linear predictors yields the probabilities:

p(keep-both|1,1,1,3,0)   = 0.21 / (0.21 + 0.74 + 1) = 0.11
p(keep-money|1,1,1,3,0)  = 0.74 / (0.21 + 0.74 + 1) = 0.38
p(return-both|1,1,1,3,0) = 1.00 / (0.21 + 0.74 + 1) = 0.51

If we repeat this exercise for women (second feature = 0), we get:

p(keep-both|1,0,1,3,0)   = 0.30 / (0.30 + 0.51 + 1.00) = 0.16
p(keep-money|1,0,1,3,0)  = 0.51 / (0.30 + 0.51 + 1.00) = 0.28
p(return-both|1,0,1,3,0) = 1.00 / (0.30 + 0.51 + 1.00) = 0.55

According to this model, among business students punished at all three levels without explanation, women are less likely to waffle; they're more likely to keep both the money and the wallet and also more likely to return both than men, who are prone to keep the money and return the wallet.

Code Walk Through

The code is all in the main() method. The estimation is done with the following one-liner, with the imports from the stats package listed:

import com.aliasi.stats.AnnealingSchedule;
import com.aliasi.stats.LogisticRegression;
import com.aliasi.stats.RegressionPrior;

public static void main(String[] args) {
    LogisticRegression regression
        = LogisticRegression.estimate(INPUTS,
                                      OUTPUTS,
                                      RegressionPrior.noninformative(),
                                      AnnealingSchedule.inverse(.05,100),
                                      0.000000001, // min improve
                                      1, // min epochs
	                              10000, // max epochs
                                      null);  // no print feedback

The parameters to the estimate() method involve the inputs and outputs, model prior hyperparameter, parameters to control the search, and a progress monitor parameter.

Input Vectors and Output Categories

The first two arguments to the estimate() method are just the parallel arrays of input vectors and output categories; these were defined in the previous section.

Prior Hyperparameter

The hyperparameter controlling fitting in the model is the prior. The priors are defined in the stats.RegressionPrior class. This example uses a so-called noninformative prior, the upshot of which is that the estimate will be a maximum likelihood estimate (the parameters which assign the highest likelihood to the entire corpus of training data). We consider other priors in the next section.

Search Parameters

The remaining parameters all control the search for the estimate. Logistic regression has no analytic solution, so estimating parameters from data requires numerical optimization. LingPipe employs stochastic gradient descent (SGD), a general, highly-scalable online optimization algorithm. SGD makes several passes through the data, adjusting the parameters a little bit based on examples one at a time.

The first search parameter, is the annealing schedule. Simulated annealing is a widely used technique in numerical optimization. It involves starting with large learning rates and gradually reducing the activity of the learner over time. The annealing schedule used in this demo is exponential, meaing that the learning rate at each step is an exponential function. The parameters 0.005 and 0.9999 are the initial learning rate and the base of the exponent. There is more information about annealing in the class documentation for stats.AnnealingSchedule.

The second search parameter, 0.000000001 indicates how tight the estimate must be before stopping the search. This is measured in relative corpus log likelihood. That is, if the corpus log likelihood in an epoch (run through all the input/output pairs) is reduced by less than 0.0000001 percent, the search is terminated.

The third and fourth search parameters, 1 and 100000 indicate the minimum and maximum number of times each training example is visited.

Progress Monitor Parameter

The final parameter, null here, is for a java.io.PrintWriter to which feedback about the progress of the search will be printed. A standard value for this would be new PrintWriter(System.out), which would provide progress reports to standard output.

Applying a Trained Model

Once a regression model is trained, it may be used to probabilistically classify new vectors of the same dimensionality as the training data.

The sample code in wallet problem goes on with some randomly generated data to do classification.

...
Input Vector        Outcome Conditional Probabilities
1.0 0.0 0.0 1.0 1.0  p(0|input)=0.02  p(1|input)=0.13  p(2|input)=0.86
1.0 0.0 1.0 0.0 0.0  p(0|input)=0.07  p(1|input)=0.28  p(2|input)=0.66
1.0 0.0 1.0 3.0 1.0  p(0|input)=0.28  p(1|input)=0.18  p(2|input)=0.54

The third input represents a female business student who was physically punished through high school with explanation. the model predicts she is 28 percent likely to keep the wallet and money, and only 54% likely to return both.

The code to compute the outcome probabilities given the output just feeds the input vectors to the regression model to produce an array of output conditional probabilities (omitting some of the print statements):

    for (Vector testCase : TEST_INPUTS) {
        double[] conditionalProbs = regression.classify(testCase);
        for (int i = 0; i < testCase.numDimensions(); ++i)
            System.out.printf("%3.1f ",testCase.value(i));
        for (int k = 0; k < conditionalProbs.length; ++k)
            System.out.printf(" p(%d|input)=%4.2f ",k,conditionalProbs[k]);
     }

The variable TEST_INPUTS is an array of vector objects, of the same format as the training inputs array. The key method call in the code is in bold, applying the trained regresison model to classify a test case. The rest just goes through the output and prints it out in a readable fashion.

Regularization with Priors

Regression models have a tendency to overfit their training data, so priors are introduced to control the complexity of the fitted model.

The Overfitting Problem

Problems with Maximum Likelihood

Logistic regression models with large numbers of features and limited amounts of training data are highly prone to overfitting under maximum likelihood estimation. A model is overfit if it is a tight match to the training data but does not generalize well to new data. The model is called "overfit" because it is too closely tailored to the training data. The maximum likelihood estimation procedure is at the root of the problem, because it simply fits the training data as tightly as possible.

Linearly Separable Problems

A particularly pathological case of overfitting is when the data is linearly separable. A simple case is when a feature value (dimension of the input) is positive if and only for a single output. For instance, in a study of 195 students, it might have turned out that every male kept the wallet and money. In this case, the coefficient for outcome 0 for the male feature will be unbounded; making it larger always increases the probability.

Priors on Coefficients

To compensate for the tendency of regression models to overfit, it is common to establish prior expectations for the values of parameters. These prior densities are designed to favor simple models. Simplicity for regression models means small regression coefficients, so the priors tend to concentrate parameters around zero. With smaller coefficients, the change in probability for a given change in an input dimension is less and thus the overall estimate is less variable.

Varieties of Priors

LingPipe implements three priors for regression: Cauchy (Student-t with one degree of freedom), Gaussian (normal), and Laplace (double exponential). The priors are listed here in order of how fat their tails are. The Cauchy distribution is so dispersed, in fact, that it does not have a finite mean or variance. The Laplace distribution is so peaked around its mean that it tends to drive most posterior coefficient estimates to its mean.

Because we wish to push coefficients toward zero, we only consider priors with mean (or median in the case of the Cauchy) zero. The variance (or scale in the case of the Cauchy) will determine how fat the distribution is, but the scale of the tails relative to the rest of the distribution is controlled by variance.

Priors with means of zero exert a shrinkage effect on parameters relative to maximum likelihood estimates. Applying priors is thus sometimes called "shrinkage".

Running the Demo

The ant target regularization demonstrates regularization with priors over the wallet data.

> ant regularization

VARIANCE=0.0010

Prior=LaplaceRegressionPrior(Variance=0.0010, noninformativeIntercept=true)
0) -1.62,  0.00,  0.00,  0.00,  0.00,
1) -0.88,  0.00,  0.00,  0.00,  0.00,

Prior=GaussianRegressionPrior(Variance=0.0010, noninformativeIntercept=true)
0) -1.63, -0.00,  0.00,  0.01, -0.02,
1) -0.84,  0.03,  0.01,  0.02,  0.01,

Prior=CauchyRegressionPrior(Scale=0.0010, noninformativeIntercept=true)
0) -3.36,  0.00,  0.00,  1.08, -0.00,
1) -0.88,  0.01,  0.00,  0.01,  0.00,

...

VARIANCE=0.512

Prior=LaplaceRegressionPrior(Variance=0.512, noninformativeIntercept=true)
0) -3.00,  0.63,  0.57,  0.92, -0.91,
1) -1.12,  0.88,  0.03,  0.06, -0.39,

Prior=GaussianRegressionPrior(Variance=0.512, noninformativeIntercept=true)
0) -3.13,  0.75,  0.76,  0.96, -0.98,
1) -1.23,  0.90,  0.28,  0.15, -0.51,

Prior=CauchyRegressionPrior(Scale=0.512, noninformativeIntercept=true)
0) -3.14,  0.80,  0.77,  0.98, -1.12,
1) -1.26,  0.96,  0.23,  0.16, -0.50,

...

VARIANCE=524.288

Prior=LaplaceRegressionPrior(Variance=524.288, noninformativeIntercept=true)
0) -3.46,  1.24,  1.15,  1.08, -1.57,
1) -1.27,  1.17,  0.42,  0.19, -0.78,

Prior=GaussianRegressionPrior(Variance=524.288, noninformativeIntercept=true)
0) -3.48,  1.27,  1.17,  1.09, -1.60,
1) -1.28,  1.18,  0.43,  0.20, -0.79,

Prior=CauchyRegressionPrior(Scale=524.288, noninformativeIntercept=true)
0) -3.48,  1.27,  1.17,  1.09, -1.60,
1) -1.28,  1.18,  0.43,  0.20, -0.79,

With very low prior variance, as shown in the first example with a prior variance of 0.001, the coefficients are driven close to zero in the posterior. As the variance increases, the results get closer and closer to the maximum likelihood estimates, with only the Laplace prior only just barely shrinking a few parameters.

Also note that for a given variance (or scale for the Cauchy), the Cauchy exerts the least push toward zero and the Laplace the most push toward zero. In the natural language problems we consider in the next section, a fairly liberal Laplace prior still drives most posterior parameters to zero.

Code Walk Through

We return to the wallet example in a demo of the effects of regularization in src/RegularizationDemo.java. The main() method just loops over variances trying all the priors:

	for (double variance = 0.001; variance <= 1000; variance *= 2.0) {
	    System.out.println("\n\nVARIANCE=" + variance);
	    evaluate(RegressionPrior.laplace(variance,true));
	    evaluate(RegressionPrior.gaussian(variance,true));
	    evaluate(RegressionPrior.cauchy(variance,true));
	}

The evaluation program just fits a model and prints out the results, just as in the wallet example:

static void evaluate(RegressionPrior prior) {
    LogisticRegression regression
        = LogisticRegression.estimate(WalletProblem.INPUTS,
                                      WalletProblem.OUTPUTS,
                                      prior,
                                      AnnealingSchedule.inverse(.05,100),
                                      0.0000001,
                                      10,
                                      5000,
                                      null);
        Vector[] betas = regression.weightVectors();
    ...    

Feature Extractors and Text Classification

As implemented in the LingPipe stats package, logistic regression operates over input vectors, integer outcomes, and arrays of conditional probabilities. This is the basic material required to implement a classifier that produces conditional probability classifications.

The Logistic Regression Classifier

Several classes are implicated in adapting the stats package logistic regression models to implementations of classifiers. First, a feature extractor is used to convert arbitrary objects into mappings from string-based features to values. Second, a symbol table converts these features into dimensions. Together, a feature extractor and symbol table support the conversion of arbitrray objects into vectors. Finally, another symbol table is used to convert the string-based category representations in the classification package into the integer representation required by the statistics package.

The class classify.LogisticRegressionClassifier handles all the details of this adaptation, as shown in the code examples below.

Running the Demo

There's a simple demo implementation of natural language classification based on the 4-newsgroup data distributed with LingPipe and discussed in the Topic Classification Tutorial. The data is the bodies of messages to four easily confusible news groups:

The demo is run with the ant target nl-topics:

> ant nl-topics

Reading data.
Num instances=250.
Permuting corpus.

EVALUATING FOLDS

Logistic Regression Progress Report
Number of dimensions=1462
Number of Outcomes=4
Number of Parameters=4386
Prior:
LaplaceRegressionPrior(Variance=0.5, noninformativeIntercept=true)
Annealing Schedule=Exponential(initialLearningRate=0.0020, base=0.9975)

Minimum Epochs=100
Maximum Epochs=1000
Minimum Improvement Per Period=1.0E-7
Has Sparse Inputs=true
Has Informative Prior=true
...

The first part of the output reports back on some of the praameters set in the code. For instance, there are 4386 unique feature dimensions, the prior is a Laplace prior with variance 0.5 and an noninformative intercept on the intercept, the annealing schedule is exponential, and so on.

and then provides feedback on the epoch-by-epoch progress of the stochastic gradient descent algorithm used for estimation.

...
epoch=    0 lr=0.002000000 ll=  -392.4239 lp=   -34.4435 llp=  -426.8675 llp*=  -426.8675       :00
epoch=    1 lr=0.001995000 ll=  -342.0040 lp=   -43.1246 llp=  -385.1286 llp*=  -385.1286       :00
epoch=    2 lr=0.001990013 ll=  -294.9343 lp=   -43.7030 llp=  -338.6373 llp*=  -338.6373       :00
epoch=    3 lr=0.001985037 ll=  -249.8116 lp=   -44.6577 llp=  -294.4693 llp*=  -294.4693       :00
...
epoch=  997 lr=0.000164891 ll=   -53.4495 lp=   -54.3246 llp=  -107.7740 llp*=  -107.7740       :15
epoch=  998 lr=0.000164478 ll=   -53.4494 lp=   -54.3239 llp=  -107.7732 llp*=  -107.7732       :15
epoch=  999 lr=0.000164067 ll=   -53.4492 lp=   -54.3232 llp=  -107.7725 llp*=  -107.7725       :15
...

In each epoch, the algorithm visits every training instance and adjusts each coefficient based on the current model and the trianing instance. The reports indicate the epoch number, the learning rate for that epoch (lr), the log likelihood of the data in the model (ll), the log likelihood of the current set of coefficients (lp), the sum of the two log likelihoods (llp) [note that this is just negative error], the best sum so far (llp*), and finally, the time, down to the second. In this case, estimation took 15 seconds resulting in a log likelihood of -53.4 and log prior -107.8.

After the feedback on estimation, the demo program prints out the features by name and their coefficient weights. In this instance, features are alphabetic or numeric tokens (or the intercept). Here are the top positive and negative coefficients for each category, as well as some zero coefficients from the first category, alt.atheism:

CLASSIFIER & FEATURES

NUMBER OF CATEGORIES=4
NUMBER OF FEATURES=1462

  CATEGORY=alt.atheism
                 Jim        0.542459
                some        0.519327
            atheists        0.454521
                  on        0.346886
                they        0.264611
               model        0.259512
                  at        0.233116
                  is        0.225087
             article        0.154355
                 ICO        0.153443
                 TEK        0.153443
                vice        0.153192
                  of        0.137563
                 The        0.136769
                 mcl        0.134621
            timmbake        0.134621
...
            approach       -0.000000
              causes       -0.000000
             equally       -0.000000
              happen       -0.000000
            ignoring       -0.000000
         immediately       -0.000000
               later       -0.000000
            provided       -0.000000
              regard       -0.000000
            separate       -0.000000
               small       -0.000000
              sounds       -0.000000
               stand       -0.000000
                week       -0.000000
             willing       -0.000000
               women       -0.000000
             America       -0.000000
            cultural       -0.000000
           disciples       -0.000000
            speaking       -0.000000
             implied       -0.000000
              debate       -0.000000
...
                   7       -0.179450
                that       -0.237454
                   2       -0.239784
                  do       -0.269343
              Mormon       -0.291933
                very       -0.293706
           Christian       -0.339846
                  ca       -0.374287
                  in       -0.425005
  *&^INTERCEPT%$^&**       -0.947079

Here the name "Jim" is the most positively indicative of the alt.atheism topic, and the intercept the most negative feature. For some reason this includes the word in and very and do, which seem unlikely features to discriminate atheism from other religious topics. This is a problem with unigram (single token) features -- they are often perplexing. Features like "Christian" may show up because the alt.atheism board may refer less to Christians (or Mormons) as a class.

Compare the alt.atheism topic with the misc.forsale topic, where words like "PC" or "sale" are strongly positive and again words like "Mormon" are negative, with some perplexing entries like "that".

  CATEGORY=misc.forsale
                 for        0.888025
                  PC        0.747638
               drive        0.700343
                  or        0.510954
                   2        0.377641
                 edu        0.282761
                 300        0.253986
                  on        0.251271
               would        0.194138
                sale        0.147936
                  00        0.103214
  *&^INTERCEPT%$^&**        0.083822
...
                  ca       -0.000098
                Book       -0.058855
                 the       -0.084477
                  In       -0.108700
                  of       -0.120941
              Mormon       -0.246687
                  to       -0.461844
                that       -0.726345
                  Re       -0.883255

Finally, here are the top positive and negative features for soc.religion.christian:

  CATEGORY=soc.religion.christian
             rutgers        1.068456
                life        0.763778
                 has        0.700812
                 May        0.676530
                Mary        0.565283
               athos        0.530967
                 who        0.354184
            doctrine        0.263468
            Orthodox        0.242234
             Trinity        0.193247
                   s        0.178808
              verses        0.142781
...
                NNTP       -0.026967
                  we       -0.046827
                 edu       -0.047974
                   A       -0.055054
              Mormon       -0.056953
                 The       -0.060776
              Robert       -0.077967
                 the       -0.079859
                were       -0.108695
                  ca       -0.120587
                  it       -0.156520

                 you       -0.157730
        Organization       -0.164869
                   a       -0.219583
           Christian       -0.294424
        Distribution       -0.389873
  *&^INTERCEPT%$^&**       -0.437988
             Posting       -0.470168
                Host       -0.486523

Oddly, this category has junk from the mail headers and signatures and what not, like "rutgers", "NNTP" and "Posting".

Variance of Coefficient Estimates

To get some feeling for the variability of the feature estimates, here are the top positive and negative features for the second fold of a four-way cross-validation of which the above reports the first fold:

CATEGORY=misc.forsale
               for        1.215978
             drive        0.748811
*&^INTERCEPT%$^&**        0.629428
                PC        0.555030
                on        0.525066
              sale        0.515190
                 2        0.479129
              Host        0.266548
           Posting        0.266545
...
               COM       -0.000069
              been       -0.000102
               Sun       -0.000180
               are       -0.034722
                in       -0.070332
               the       -0.114678
                of       -0.272626
                In       -0.276476
                to       -0.372242
                Re       -0.459132
               that       -0.806237

Code Walkthrough

The code for generating this demo is in src/TextClassificationDemo.java. First, it builds up a corpus instance just as in the cross-validation demo in the topic classification tutorial:

public static void main(String[] args) throws Exception {
    ...
    PrintWriter progressWriter = new PrintWriter(System.out,true);
    int numFolds = 4;
    XValidatingClassificationCorpus<CharSequence> corpus
        = ...
    corpus.permuteCorpus(new Random(7117)); // destroys runs of categories    

    TokenizerFactory tokenizerFactory
         = new RegExTokenizerFactory("\\p{L}+|\\d+"); // letter+ | digit+
    FeatureExtractor<CharSequence> featureExtractor
        = new TokenFeatureExtractor(tokenizerFactory);
    int minFeatureCount = 5;
    boolean addInterceptFeature = true;
    boolean noninformativeIntercept = true;
    double priorVariance = 0.5;
    RegressionPrior prior 
        = RegressionPrior.laplace(priorVariance,noninformativeIntercept);
    AnnealingSchedule annealingSchedule
         = AnnealingSchedule.exponential(0.002,0.9975);
    double minImprovement = 0.0000001;
    int minEpochs = 100;
    int maxEpochs = 1000;

    for (int fold = 0; fold < numFolds; ++fold) {
        corpus.setFold(fold);
        LogisticRegressionClassifier<CharSequence> classifier
            = LogisticRegressionClassifier.<CharSequence>train(featureExtractor,
                                                               corpus,
                                                               minFeatureCount,
                                                               addInterceptFeature,
                                                               prior,
                                                               annealingSchedule,
                                                               minImprovement,
                                                               minEpochs,
                                                               maxEpochs,
                                                               progressWriter);
    ...

The training method for the logistic regression classifier is almost identical to the static method for estimating logistic regression models in the stats package. The main difference is that the classifier requires an instance of util.FeatureExtractor. The feature extractor interface defines a single method:

public interface FeatureExtractor<E> {
    public Map<String,? extends Number> features(E in);
}

In the code above, we use a pre-built adapter that converts a tokenizer factory into a feature extractor. The tokenizer.TokenFeatureExtractor is constructed with a tokenizer factory, which it then uses to tokenize character sequences it receives. The resulting mapping is simply the count of the tokens in the input.

The features are printed out in order by the classifier itself:

        ...
        progressWriter.println("\nCLASSIFIER & FEATURES\n");
        progressWriter.println(classifier);
        ...

The evaluation is done in the usual way, by having the corpus walk the evaluator over the test section:

    ...
    progressWriter.println("\nEVALUATION\n");
    ClassifierEvaluator<CharSequence,ConditionalClassification> evaluator 
        = new ClassifierEvaluator<CharSequence,ConditionalClassification>(classifier,CATEGORIES);
    corpus.visitTest(evaluator);
    progressWriter.printf("FOLD=%5d  ACC=%4.2f  +/-%4.2f\n", 
                          fold,
                          evaluator.confusionMatrix().totalAccuracy(),
                          evaluator.confusionMatrix().confidence95());
}