Introduction

What is Logistic Regression?

Logistic regression is a discriminitive probabilistic classification model that operates over real-valued vector inputs. The dimensions of the input vectors being classified are called "features" and there is no restriction against them being correlated. Logistic regression is one of the best probabilistic classifiers, measured in both log loss and first-best classification accuracy across a number of tasks.

The logistic regression implementation in LingPipe provides multinomial classification; that is, it allows more than two possible output categories.

The main drawback of logistic regression is that it's relatively slow to train compared to the other LingPipe classifiers. It also requires extensive tuning in the form of feature selection and implementation to achieve state-of-the-art classification performance.

What's in the Tutorial?

This tutorial covers both the vector-based implementation in the statistics package and the use of feature extractors for classifying arbitrary objects in the classification package. The tutorial will cover basic estimation, the effects of different choices and parameterizations of priors, and tuning the estimator's search.

The tutorial will also cover the basics of feature-based classification in LingPipe. Feature extractors convert arbitrary objects into feature vectors, which may then be converted to actual vectors for use in logistic regression.

Also Known As (AKA)

For the sake of terminological clarity (and search engine optimization), here are some aliases for multinomial logistic regression.

Polytomous Logistic Regression

Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression.

Maximum Entropy Classifier

Logistic regression estimation obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".

Neural Network: Classification with a Single Neuron

Binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. This is sometimes called classification with a single neuron.

LingPipe's stochastic gradient descent is equivalent to a stochastic back-propagation algorithm over the single-output neural network.

Ridge Regression and the Lasso

Maximum a priori (MAP) estimation with Gaussian priors is often referred to as "ridge regression"; with Laplace priors MAP estimation is known as the "lasso".

Shrinkage and Regularized Regression

MAP estimation with Gaussian, Laplace or Cauchy priors is known as parameter shrinkage.

Gaussian and Laplace priors are equivalent to regularized regression, with the Gaussian version being regularized with the L2 norm (Euclidean distance, called the Frobenius norm for matrices of parameters) and the Laplace version being regularized with the L1 norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.

Generalized Linear Model and Softmax

Logistic regression is a generalized linear model with the logit link function. The logistic link function is sometimes called softmax and given its use of exponentiation to convert linear predictors to probabilities, it is sometimes called an exponential model.

Logistic Regression Models

Logistic regression models provide multi-category classification in cases where the categories are exhaustive and mutually exclusive. That is, every instance belongs to exactly one category.

Inputs are coded as real-valued vectors of a fixed dimensionality. The dimensions are often called predictors or features. There is no requirement that they be independent, and with regularization, they may even be highly or fully linearly correlated.

The model consists of parameter vectors for categories of the dimensionality of inputs. The last category does not get a parameter vector; or equivalently, it gets a constant 0 parameter vector.

More formally, if the inputs are of dimension d and there are k categories, the model consists of k-1 vectors β[0],...,β[k-2]. Then for a given input vector x of dimensionality k, the conditional probability of a category given the input is defined to be:

p(0 | x)  exp(β[0] * x)
p(1 | x)  exp(β[1] * x)
...
p(k-2 | x)  exp(β[k-2] * x)
p(k-1 | x)  exp(0 * x)

Normalizing by the sum of the exponentiated bases yields the probability estimates:

p(0 | x) = exp(β[0]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
p(1 | x) = exp(β[1]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
...
p(k-2 | x) = exp(β[k-2]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
p(k-1 | x) = exp(0*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))

Writing it out in summation notation, for c < k-1:

p(c | x) = exp(β[c] * x) / (1 + Σi < k-1 exp(β[i]*x))

and for c = k-1:

p(k-1 | x) = 1 / (1 + Σi < k-1 exp(β[i]*x))

Example of Logistic Regression

Logistic regression models are estimated from training data consisting of a sequence of vectors and their reference categories. The vectors are arbitrary, with their dimensions representing features of the input objects being classified. The categories are discrete, and should be numbered contiguously from 0 to the number of categories minus one.

The Wallet Problem

The first example we consider is drawn from chapter 5 of the following book:

The data is based on a survey of 195 undergraduates, and attempts to predict their answer to the question "If you found a wallet on the street, would you...", with the following possible responses:

Wallet Problem Outcomes
OutcomeDescription
0keep both
1keep the money, return the wallet
2return both

The input vectors are five dimensional, consisting of the following features, the descriptions of which are directly transcribed from (Allison 1999):

Wallet Problem Predictors
DimensionDescriptionValues
0Intercept1: always
1Male 1: male
0: female
2Business 1: enrolled in business school
0: not enrolled in business school
3Punish Variable describing whether student was physically punished by parents at various ages:
1: punished in elementary school, but not in middle or high school
2: punished in elementary and middle school, but not in high school
3: punished at all three levels
4Explain Response to question "When you were punished, did your parents generally explain why what you did was wrong?"
1: almost always
0: sometimes or never

LingPipe requires an explicit representation of the intercept feature, which is implicit in (Allison 1999). The intercept is treated just like other features, but is assumed to take on value 1.0 in all inputs. Thus it provides an input-independent bias term for estimation. Most problems benefit from the addition of such an intercept feature.

Where do Features Come From?

The predictors in this problem are all discrete, most defining binary variables with the physical punishment model taking on three ordinal values. It is also possible to include continuous inputs for regression problems such as token counts in linguistic examples or fetaures like width of petals in flower species classification problems.

Here's the first few training examples out of the complete set of 195:

Wallet Problem Data (Sample)
OutcomeInterceptMaleBusinessPunishExplain
11.00.00.02.00.0
11.00.00.02.01.0
21.00.00.01.01.0
21.00.00.02.00.0
01.01.00.01.01.0
...

For example, the third training instance represents a survey response for a woman (male=0.0) who is not in business school (business=0.0), who was punished only in elementary school (punish=1.0), had her punishment explained almost always (explain=1.0), and who said she'd return both the wallet and the money (outcome=2). The fifth training example represents a man who's not in business school, was punished only in elementary school, had his punishment explained, and answered that he would keep both the money and wallet.

Our first logistic regression model is estimated from 195 of these training cases, yielding a classifier that given the five input feature values (intercept, male, business, punish and explain), assigns probabilities to the three outcomes (keep both, return only money, return both).

Coding the Problem

The source code for the wallet problem may be found in the file src/WalletProblem.java.

In order to train a logistic regression model, LingPipe requires the inputs to be coded as instances of matrix.Vector and outputs to be coded as integers. These are presented as parallel arrays of vectors and output integers.

To keep things simple, the outputs and inputs are coded directly as used as static constants. Here are the outputs:

    static final int[] OUTPUTS = new int[] {
        1,
        1,
        2,
        2,
        0,
        ...
    }

The inputs are coded as dense vector instances:

    static final Vector[] INPUTS = new Vector[] {
        new DenseVector(new double[] { 1, 0, 0, 2, 0 }),
        new DenseVector(new double[] { 1, 0, 0, 2, 1 }),
        new DenseVector(new double[] { 1, 0, 0, 1, 1 }),
        new DenseVector(new double[] { 1, 0, 0, 2, 0 }),
        new DenseVector(new double[] { 1, 1, 0, 1, 1 }),
        ...
    };

Note how these two parallel arrays directly encode the sample data as presented in the previous table.

Estimating the Regression Coefficients

Running the code using the ant target wallet prints out the estimated regression coefficients:

> ant wallet

Computing Wallet Problem Logistic Regression
Outcome=0  -3.47   1.27   1.18   1.08  -1.60
Outcome=1  -1.29   1.17   0.42   0.20  -0.80

An estimated model consists of a sequence of weight vectors for one minus the number of output categories. We don't need a vector for the last category, because it can be taken to be zero without loss of generality (see Carpenter 2008).

Implicit Coefficients for Final Outcome

The final outcome has all zero coefficients. Filling in for the wallet example, this gives us:

Outcome=2   0.00   0.00   0.00   0.00   0.00

These are not printed as part of the model output.

Interpreting the Regresison Coefficients

Because logistic regression involves a simple linear predictor, the regression coefficients may be interpreted fairly directly.

Intercept

The values of the intercept parameter are -3.47 for outcome 0 (keep both), -1.29 for (keep money, return wallet), and 0.0 implicitly for outcome 2 (return-both). Because of the definition of probability and the fact that the intercept feature dimension is always 1.0, the linear basis for outcomes 0, 1 and 2 start off on an uneven footing. To make the keep-both outcome most likely, another feature or combination of features will have to contribute more than 3.47 to the linear basis.

Computing Probabilities

The main point of fitting a model is to be able to interpret probabilities for events. For instance, take male business students who were punished at all three levels without explanation. That provides an input vector of (1,1,1,3,0). The probabilities work out as:

p(keep-both|1,1,1,3,0)    exp(-3.47*1 + 1.27*1 + 1.18*1 + 1.08*3 + -1.6*0)
                         = exp(-1.54) = 0.21

p(keep-money|1,1,1,3,0)   exp(-1.29*1 + 1.17*1 + 0.42*1 + 0.20*3 + -0.8*0)
                         = exp(-0.3) = 0.74

p(return-both|1,1,1,3,0)  exp(0.0*1 + 0.0*1 + 0.0*1 + 0.0*3 + 0.0*0) 
                         = exp(0) = 1

Division by the sum of exponentiated linear predictors yields the probabilities:

p(keep-both|1,1,1,3,0)   = 0.21 / (0.21 + 0.74 + 1) = 0.11
p(keep-money|1,1,1,3,0)  = 0.74 / (0.21 + 0.74 + 1) = 0.38
p(return-both|1,1,1,3,0) = 1.00 / (0.21 + 0.74 + 1) = 0.51

If we repeat this exercise for women (second feature = 0), we get:

p(keep-both|1,0,1,3,0)   = 0.30 / (0.30 + 0.51 + 1.00) = 0.16
p(keep-money|1,0,1,3,0)  = 0.51 / (0.30 + 0.51 + 1.00) = 0.28
p(return-both|1,0,1,3,0) = 1.00 / (0.30 + 0.51 + 1.00) = 0.55

According to this model, among business students punished at all three levels without explanation, women are less likely to waffle; they're more likely to keep both the money and the wallet and also more likely to return both than men, who are prone to keep the money and return the wallet.

Code Walk Through

The code is all in the main() method. The estimation is done with the following one-liner, with the imports from the stats package listed:

import com.aliasi.stats.AnnealingSchedule;
import com.aliasi.stats.LogisticRegression;
import com.aliasi.stats.RegressionPrior;

public static void main(String[] args) {
    LogisticRegression regression
        = LogisticRegression.estimate(INPUTS,
                                      OUTPUTS,
                                      RegressionPrior.noninformative(),
                                      AnnealingSchedule.inverse(.05,100),
                                      null, // null reporter        
                                      0.000000001, // min improve
                                      1, // min epochs
	                              10000); // max epochs

The parameters to the estimate() method involve the inputs and outputs, model prior hyperparameter, parameters to control the search, and a progress monitor parameter.

Input Vectors and Output Categories

The first two arguments to the estimate() method are just the parallel arrays of input vectors and output categories; these were defined in the previous section.

Prior Hyperparameter

The hyperparameter controlling fitting in the model is the prior. The priors are defined in the stats.RegressionPrior class. This example uses a so-called noninformative prior, the upshot of which is that the estimate will be a maximum likelihood estimate (the parameters which assign the highest likelihood to the entire corpus of training data). We consider other priors in the next section.

Search Parameters

The remaining parameters all control the search for the estimate. Logistic regression has no analytic solution, so estimating parameters from data requires numerical optimization. LingPipe employs stochastic gradient descent (SGD), a general, highly-scalable online optimization algorithm. SGD makes several passes through the data, adjusting the parameters a little bit based on examples one at a time.

The first search parameter, is the annealing schedule. Simulated annealing is a widely used technique in numerical optimization. It involves starting with large learning rates and gradually reducing the activity of the learner over time. The annealing schedule used in this demo is exponential, meaing that the learning rate at each step is an exponential function. The parameters 0.005 and 0.9999 are the initial learning rate and the base of the exponent. There is more information about annealing in the class documentation for stats.AnnealingSchedule.

The second search parameter, 0.000000001 indicates how tight the estimate must be before stopping the search. This is measured in relative corpus log likelihood. That is, if the corpus log likelihood in an epoch (run through all the input/output pairs) is reduced by less than 0.0000001 percent, the search is terminated.

The third and fourth search parameters, 1 and 100000 indicate the minimum and maximum number of times each training example is visited.

Progress Monitor Parameter

The null parameter can optionally be populated a com.aliasi.io.Reporter to which feedback about the progress of the search will be printed. A standard value would be to create a reporter using com.aliasi.io.Reporters.stdOut() to print to standard output.

Applying a Trained Model

Once a regression model is trained, it may be used to probabilistically classify new vectors of the same dimensionality as the training data.

The sample code in wallet problem goes on with some randomly generated data to do classification.

...
Input Vector        Outcome Conditional Probabilities
1.0 0.0 0.0 1.0 1.0  p(0|input)=0.02  p(1|input)=0.13  p(2|input)=0.86
1.0 0.0 1.0 0.0 0.0  p(0|input)=0.07  p(1|input)=0.28  p(2|input)=0.66
1.0 0.0 1.0 3.0 1.0  p(0|input)=0.28  p(1|input)=0.18  p(2|input)=0.54

The third input represents a female business student who was physically punished through high school with explanation. the model predicts she is 28 percent likely to keep the wallet and money, and only 54% likely to return both.

The code to compute the outcome probabilities given the output just feeds the input vectors to the regression model to produce an array of output conditional probabilities (omitting some of the print statements):

    for (Vector testCase : TEST_INPUTS) {
        double[] conditionalProbs = regression.classify(testCase);
        for (int i = 0; i < testCase.numDimensions(); ++i)
            System.out.printf("%3.1f ",testCase.value(i));
        for (int k = 0; k < conditionalProbs.length; ++k)
            System.out.printf(" p(%d|input)=%4.2f ",k,conditionalProbs[k]);
     }

The variable TEST_INPUTS is an array of vector objects, of the same format as the training inputs array. The key method call in the code is in bold, applying the trained regresison model to classify a test case. The rest just goes through the output and prints it out in a readable fashion.

Regularization with Priors

Regression models have a tendency to overfit their training data, so priors are introduced to control the complexity of the fitted model.

The Overfitting Problem

Problems with Maximum Likelihood

Logistic regression models with large numbers of features and limited amounts of training data are highly prone to overfitting under maximum likelihood estimation. A model is overfit if it is a tight match to the training data but does not generalize well to new data. The model is called "overfit" because it is too closely tailored to the training data. The maximum likelihood estimation procedure is at the root of the problem, because it simply fits the training data as tightly as possible.

Linearly Separable Problems

A particularly pathological case of overfitting is when the data is linearly separable. A simple case is when a feature value (dimension of the input) is positive if and only for a single output. For instance, in a study of 195 students, it might have turned out that every male kept the wallet and money. In this case, the coefficient for outcome 0 for the male feature will be unbounded; making it larger always increases the probability.

Priors on Coefficients

To compensate for the tendency of regression models to overfit, it is common to establish prior expectations for the values of parameters. These prior densities are designed to favor simple models. Simplicity for regression models means small regression coefficients, so the priors tend to concentrate parameters around zero. With smaller coefficients, the change in probability for a given change in an input dimension is less and thus the overall estimate is less variable.

Varieties of Priors

LingPipe implements three priors for regression: Cauchy (Student-t with one degree of freedom), Gaussian (normal), and Laplace (double exponential). The priors are listed here in order of how fat their tails are. The Cauchy distribution is so dispersed, in fact, that it does not have a finite mean or variance. The Laplace distribution is so peaked around its mean that it tends to drive most posterior coefficient estimates to its mean.

Because we wish to push coefficients toward zero, we only consider priors with mean (or median in the case of the Cauchy) zero. The variance (or scale in the case of the Cauchy) will determine how fat the distribution is, but the scale of the tails relative to the rest of the distribution is controlled by variance.

Priors with means of zero exert a shrinkage effect on parameters relative to maximum likelihood estimates. Applying priors is thus sometimes called "shrinkage".

Running the Demo

The ant target regularization demonstrates regularization with priors over the wallet data.

> ant regularization

VARIANCE=0.0010

Prior=LaplaceRegressionPrior(Variance=0.0010, noninformativeIntercept=true)
0) -1.62,  0.00,  0.00,  0.00,  0.00,
1) -0.88,  0.00,  0.00,  0.00,  0.00,

Prior=GaussianRegressionPrior(Variance=0.0010, noninformativeIntercept=true)
0) -1.63, -0.00,  0.00,  0.01, -0.02,
1) -0.84,  0.03,  0.01,  0.02,  0.01,

Prior=CauchyRegressionPrior(Scale=0.0010, noninformativeIntercept=true)
0) -3.36,  0.00,  0.00,  1.08, -0.00,
1) -0.88,  0.01,  0.00,  0.01,  0.00,

...

VARIANCE=0.512

Prior=LaplaceRegressionPrior(Variance=0.512, noninformativeIntercept=true)
0) -3.00,  0.63,  0.57,  0.92, -0.91,
1) -1.12,  0.88,  0.03,  0.06, -0.39,

Prior=GaussianRegressionPrior(Variance=0.512, noninformativeIntercept=true)
0) -3.13,  0.75,  0.76,  0.96, -0.98,
1) -1.23,  0.90,  0.28,  0.15, -0.51,

Prior=CauchyRegressionPrior(Scale=0.512, noninformativeIntercept=true)
0) -3.14,  0.80,  0.77,  0.98, -1.12,
1) -1.26,  0.96,  0.23,  0.16, -0.50,

...

VARIANCE=524.288

Prior=LaplaceRegressionPrior(Variance=524.288, noninformativeIntercept=true)
0) -3.46,  1.24,  1.15,  1.08, -1.57,
1) -1.27,  1.17,  0.42,  0.19, -0.78,

Prior=GaussianRegressionPrior(Variance=524.288, noninformativeIntercept=true)
0) -3.48,  1.27,  1.17,  1.09, -1.60,
1) -1.28,  1.18,  0.43,  0.20, -0.79,

Prior=CauchyRegressionPrior(Scale=524.288, noninformativeIntercept=true)
0) -3.48,  1.27,  1.17,  1.09, -1.60,
1) -1.28,  1.18,  0.43,  0.20, -0.79,

With very low prior variance, as shown in the first example with a prior variance of 0.001, the coefficients are driven close to zero in the posterior. As the variance increases, the results get closer and closer to the maximum likelihood estimates, with only the Laplace prior only just barely shrinking a few parameters.

Also note that for a given variance (or scale for the Cauchy), the Cauchy exerts the least push toward zero and the Laplace the most push toward zero. In the natural language problems we consider in the next section, a fairly liberal Laplace prior still drives most posterior parameters to zero.

Code Walk Through

We return to the wallet example in a demo of the effects of regularization in src/RegularizationDemo.java. The main() method just loops over variances trying all the priors:

	for (double variance = 0.001; variance <= 1000; variance *= 2.0) {
	    System.out.println("\n\nVARIANCE=" + variance);
	    evaluate(RegressionPrior.laplace(variance,true));
	    evaluate(RegressionPrior.gaussian(variance,true));
	    evaluate(RegressionPrior.cauchy(variance,true));
	}

The evaluation program just fits a model and prints out the results, just as in the wallet example:

static void evaluate(RegressionPrior prior) {
    LogisticRegression regression
        = LogisticRegression.estimate(WalletProblem.INPUTS,
                                      WalletProblem.OUTPUTS,
                                      prior,
                                      AnnealingSchedule.inverse(.05,100),
                                      null,
                                      0.0000001,
                                      10,
                                      5000);
        Vector[] betas = regression.weightVectors();
    ...    

Feature Extractors and Text Classification

As implemented in the LingPipe stats package, logistic regression operates over input vectors, integer outcomes, and arrays of conditional probabilities. This is the basic material required to implement a classifier that produces conditional probability classifications.

The Logistic Regression Classifier

Several classes are implicated in adapting the stats package logistic regression models to implementations of classifiers. First, a feature extractor is used to convert arbitrary objects into mappings from string-based features to values. Second, a symbol table converts these features into dimensions. Together, a feature extractor and symbol table support the conversion of arbitrray objects into vectors. Finally, another symbol table is used to convert the string-based category representations in the classification package into the integer representation required by the statistics package.

The class classify.LogisticRegressionClassifier handles all the details of this adaptation, as shown in the code examples below.

Running the Demo

There's a simple demo implementation of natural language classification based on the 4-newsgroup data distributed with LingPipe and discussed in the Topic Classification Tutorial. The data is the bodies of messages to four easily confusible news groups:

The demo is run with the ant target nl-topics:

> ant nl-topics

Reading data.
Num instances=250.
Permuting corpus.

EVALUATING FOLDS

Logistic Regression Progress Report
Number of dimensions=1462
Number of Outcomes=4
Number of Parameters=4386
Prior:
LaplaceRegressionPrior(Variance=0.5, noninformativeIntercept=true)
Annealing Schedule=Exponential(initialLearningRate=0.0020, base=0.9975)

Minimum Epochs=100
Maximum Epochs=1000
Minimum Improvement Per Period=1.0E-7
Has Sparse Inputs=true
Has Informative Prior=true
...

The first part of the output reports back on some of the praameters set in the code. For instance, there are 4386 unique feature dimensions, the prior is a Laplace prior with variance 0.5 and an noninformative intercept on the intercept, the annealing schedule is exponential, and so on.

and then provides feedback on the epoch-by-epoch progress of the stochastic gradient descent algorithm used for estimation.

...
epoch=    0 lr=0.002000000 ll=  -392.4239 lp=   -34.4435 llp=  -426.8675 llp*=  -426.8675       :00
epoch=    1 lr=0.001995000 ll=  -342.0040 lp=   -43.1246 llp=  -385.1286 llp*=  -385.1286       :00
epoch=    2 lr=0.001990013 ll=  -294.9343 lp=   -43.7030 llp=  -338.6373 llp*=  -338.6373       :00
epoch=    3 lr=0.001985037 ll=  -249.8116 lp=   -44.6577 llp=  -294.4693 llp*=  -294.4693       :00
...
epoch=  997 lr=0.000164891 ll=   -53.4495 lp=   -54.3246 llp=  -107.7740 llp*=  -107.7740       :15
epoch=  998 lr=0.000164478 ll=   -53.4494 lp=   -54.3239 llp=  -107.7732 llp*=  -107.7732       :15
epoch=  999 lr=0.000164067 ll=   -53.4492 lp=   -54.3232 llp=  -107.7725 llp*=  -107.7725       :15
...

In each epoch, the algorithm visits every training instance and adjusts each coefficient based on the current model and the trianing instance. The reports indicate the epoch number, the learning rate for that epoch (lr), the log likelihood of the data in the model (ll), the log likelihood of the current set of coefficients (lp), the sum of the two log likelihoods (llp) [note that this is just negative error], the best sum so far (llp*), and finally, the time, down to the second. In this case, estimation took 15 seconds resulting in a log likelihood of -53.4 and log prior -107.8.

After the feedback on estimation, the demo program prints out the features by name and their coefficient weights. In this instance, features are alphabetic or numeric tokens (or the intercept). Here are the top positive and negative coefficients for each category, as well as some zero coefficients from the first category, alt.atheism:

CLASSIFIER & FEATURES

NUMBER OF CATEGORIES=4
NUMBER OF FEATURES=1462

  CATEGORY=alt.atheism
                 Jim        0.542459
                some        0.519327
            atheists        0.454521
                  on        0.346886
                they        0.264611
               model        0.259512
                  at        0.233116
                  is        0.225087
             article        0.154355
                 ICO        0.153443
                 TEK        0.153443
                vice        0.153192
                  of        0.137563
                 The        0.136769
                 mcl        0.134621
            timmbake        0.134621
...
            approach       -0.000000
              causes       -0.000000
             equally       -0.000000
              happen       -0.000000
            ignoring       -0.000000
         immediately       -0.000000
               later       -0.000000
            provided       -0.000000
              regard       -0.000000
            separate       -0.000000
               small       -0.000000
              sounds       -0.000000
               stand       -0.000000
                week       -0.000000
             willing       -0.000000
               women       -0.000000
             America       -0.000000
            cultural       -0.000000
           disciples       -0.000000
            speaking       -0.000000
             implied       -0.000000
              debate       -0.000000
...
                   7       -0.179450
                that       -0.237454
                   2       -0.239784
                  do       -0.269343
              Mormon       -0.291933
                very       -0.293706
           Christian       -0.339846
                  ca       -0.374287
                  in       -0.425005
  *&^INTERCEPT%$^&**       -0.947079

Here the name "Jim" is the most positively indicative of the alt.atheism topic, and the intercept the most negative feature. For some reason this includes the word in and very and do, which seem unlikely features to discriminate atheism from other religious topics. This is a problem with unigram (single token) features -- they are often perplexing. Features like "Christian" may show up because the alt.atheism board may refer less to Christians (or Mormons) as a class.

Compare the alt.atheism topic with the misc.forsale topic, where words like "PC" or "sale" are strongly positive and again words like "Mormon" are negative, with some perplexing entries like "that".

  CATEGORY=misc.forsale
                 for        0.888025
                  PC        0.747638
               drive        0.700343
                  or        0.510954
                   2        0.377641
                 edu        0.282761
                 300        0.253986
                  on        0.251271
               would        0.194138
                sale        0.147936
                  00        0.103214
  *&^INTERCEPT%$^&**        0.083822
...
                  ca       -0.000098
                Book       -0.058855
                 the       -0.084477
                  In       -0.108700
                  of       -0.120941
              Mormon       -0.246687
                  to       -0.461844
                that       -0.726345
                  Re       -0.883255

Finally, here are the top positive and negative features for soc.religion.christian:

  CATEGORY=soc.religion.christian
             rutgers        1.068456
                life        0.763778
                 has        0.700812
                 May        0.676530
                Mary        0.565283
               athos        0.530967
                 who        0.354184
            doctrine        0.263468
            Orthodox        0.242234
             Trinity        0.193247
                   s        0.178808
              verses        0.142781
...
                NNTP       -0.026967
                  we       -0.046827
                 edu       -0.047974
                   A       -0.055054
              Mormon       -0.056953
                 The       -0.060776
              Robert       -0.077967
                 the       -0.079859
                were       -0.108695
                  ca       -0.120587
                  it       -0.156520

                 you       -0.157730
        Organization       -0.164869
                   a       -0.219583
           Christian       -0.294424
        Distribution       -0.389873
  *&^INTERCEPT%$^&**       -0.437988
             Posting       -0.470168
                Host       -0.486523

Oddly, this category has junk from the mail headers and signatures and what not, like "rutgers", "NNTP" and "Posting".

Variance of Coefficient Estimates

To get some feeling for the variability of the feature estimates, here are the top positive and negative features for the second fold of a four-way cross-validation of which the above reports the first fold:

CATEGORY=misc.forsale
               for        1.215978
             drive        0.748811
*&^INTERCEPT%$^&**        0.629428
                PC        0.555030
                on        0.525066
              sale        0.515190
                 2        0.479129
              Host        0.266548
           Posting        0.266545
...
               COM       -0.000069
              been       -0.000102
               Sun       -0.000180
               are       -0.034722
                in       -0.070332
               the       -0.114678
                of       -0.272626
                In       -0.276476
                to       -0.372242
                Re       -0.459132
               that       -0.806237

Code Walkthrough

The code for generating this demo is in src/TextClassificationDemo.java. First, it builds up a corpus instance just as in the cross-validation demo in the topic classification tutorial:

public static void main(String[] args) throws Exception {
    ...
    PrintWriter progressWriter = new PrintWriter(System.out,true);
    int numFolds = 4;
    XValidatingObjectCorpus<Classified<CharSequence>> corpus
        = new XValidatingObjectCorpus<Classified<CharSequence>>(numFolds);

    ...

    corpus.permuteCorpus(new Random(7117)); // destroys runs of categories    

    TokenizerFactory tokenizerFactory
         = new RegExTokenizerFactory("\\p{L}+|\\d+"); // letter+ | digit+
    FeatureExtractor<CharSequence> featureExtractor
        = new TokenFeatureExtractor(tokenizerFactory);
    int minFeatureCount = 5;
    boolean addInterceptFeature = true;
    boolean noninformativeIntercept = true;
    double priorVariance = 0.5;
    RegressionPrior prior 
        = RegressionPrior.laplace(priorVariance,noninformativeIntercept);
    AnnealingSchedule annealingSchedule
         = AnnealingSchedule.exponential(0.002,0.9975);
    double minImprovement = 0.0000001;
    int minEpochs = 100;
    int maxEpochs = 1000;

    for (int fold = 0; fold < numFolds; ++fold) {
        corpus.setFold(fold);
        Reporter reporter = Reporters.writer(progressWriter);
        LogisticRegressionClassifier<CharSequence> classifier
            = LogisticRegressionClassifier.<CharSequence>train(corpus,
                                                               featureExtractor,
                                                               minFeatureCount,
                                                               addInterceptFeature,
                                                               prior,
                                                               annealingSchedule,
                                                               minImprovement,
                                                               minEpochs,
                                                               maxEpochs,
                                                               reporter);
    ...

The training method for the logistic regression classifier is almost identical to the static method for estimating logistic regression models in the stats package. The main difference is that the classifier requires an instance of util.FeatureExtractor. The feature extractor interface defines a single method:

public interface FeatureExtractor<E> {
    public Map<String,? extends Number> features(E in);
}

In the code above, we use a pre-built adapter that converts a tokenizer factory into a feature extractor. The tokenizer.TokenFeatureExtractor is constructed with a tokenizer factory, which it then uses to tokenize character sequences it receives. The resulting mapping is simply the count of the tokens in the input.

The features are printed out in order by the classifier itself:

        ...
        progressWriter.println("\nCLASSIFIER & FEATURES\n");
        progressWriter.println(classifier);
        ...

The evaluation is done in the usual way, by having the corpus walk the evaluator over the test section:

    ...
    progressWriter.println("\nEVALUATION\n");
    boolean storeInputs = false;
    ConditionalClassifierEvaluator<CharSequence> evaluator 
        = new ConditionalClassifierEvaluator<CharSequence>(classifier,CATEGORIES,storeInputs);
    corpus.visitTest(evaluator);
    progressWriter.printf("FOLD=%5d  ACC=%4.2f  +/-%4.2f\n", 
                          fold,
                          evaluator.confusionMatrix().totalAccuracy(),
                          evaluator.confusionMatrix().confidence95());
}

Discrete Choice Analysis

Discrete choice analysis models one or more agent's selections when presented with one or more alternative options. The alternatives are represented as vectors and the decision is probabilistically normalized using muli-logit, just as for logistic regression. The vectors representing the alternatives may include information about the agent making the choice; their content is really up to the application.

DCA models are easy to construct directly or to estimate from a set of training data. Each training instance consists of one or more alternatives represented as vectors along with an indication of which choice was made.

Feature-Extraction for DCA

As for logistic regression, it is possible to represent choices among alternatives that are arbitrary objects. A user-supplied feature extractor is used to map the alternative objects to vectors.

DCA Examples

DCA Transportation Example

A common example is transportation, where choices might include walking, bicycling, taknig a bus, taking a train, driving, etc. Each choice is represented a a set of features, such as time, cost, safety, and so on. The agent making the choice may be represented by further features such as disposable income and impatience. A model then considers how to weight these features to represent choices.

DCA Query-Specific Document Ranking Example

Another application is to rank documents with respect to a query. If each document and query pair is represented by a feature vector, the choice probabilities may be used for ranking.

DCA Coreference Example

Coreference is essentially a clustering task on entity mentions in text. For instance, in Babe Ruth's Wikipedia entry (as downloaded on 12 March 2010 at 2:59 PM EST) the Yankee slugger is referred to witht he names "Base Ruth", "Ruth", and "the Babe", the pronoun "he" and the common noun phrase "the first true American sports celebrity superstar". On the othe hand, in the same article, "Baby Ruth" refers to a candy bar, "The Babe Ruth Story" t a film, and "Babe Ruth Day" to a recurrning event.

As a choice problem, coreference is often approached by working through a document in text order, making a linkage decision for each referential noun phrase. The noun phrase may be linked to any of the previously introduced entities or it may be the first mention of an entity. Obviously the first mention is a new entity. The second mention may either refer to the same entity as the first mention, or may introduce a new entity.

These options may be represented as alternative choices, with features such as gender, distance, number of intervening noun phrases, similarity of phrasing such as token overlap, edit distance, etc., syntactic structure, discourse parallelism, semantic knowledge of common nouns, and so on.

Cross-document coreference is often approached the same way, as a sequential choice problem. In this case, you'd want to use different features than for within-document coreference. You can use information such as string overlap, as well as meta information such document source, date, author, language, etc.

Mathematical Definition

A K-dimensional discrete choice model is parameterized by a single K-dimensional real-valued vector β. A set of N alternatives is represented by N K-dimensional vectors α[1],...,α[N], one for each alternative.

DCA is normalized in the same way as logistic regression, with the multi-logit transform. The multilogit takes a vector of arbitrary real numbers and converts them into probabilities that sum to 1.0. This is done by simply exponentiating and normalizing. Thus the probability of choosing alternative n is proportional to exp(β'α[n]) (where x' is the transpose of an implicit column vector x). In symbols,

p(n|α,β) ∝ exp(β'α[n]).

where normalization is computed by the usual summation,

p(n|α,β) = exp(β'α[n]) / Σm in 1:N exp(β'α[m]).

Stochastic Gradient MAP Estimator for DCA

We use the same set of tools to fit DCA models as logistic regression models. Specifically, fitting is carried out via stochastic gradient descent using a specified learning rate annealing schedule. Priors may be placed on the coefficients, resulting in maximum a posteriori (MAP) estimates. and priors on coefficients. Convergence may be monitored to within a certain tolerance or the number of training epochs may be specified directly. A reporter may be configured and passed into the fitter to monitor the fit on an epoch-by-epoch basis.

DCA Simulation and API Demo

A good strategy for testing an estimator is to simulate a data set according to the model and see if the estimator finds the model parameters.

Here, we take a very simple model with β = (3,-2,1). For instance, this might be a very simple model for proper name coreference where dimension 1 represents string similarity, dimension 2 represents distance, and dimension 3 represents discourse parallelism. Thus better choices have higher string similarity, lower distance, and higher parallelism.

Running the Demo

The demo may be run directly from ant:

> cd $LINGPIPE/demos/tutorial/logistic-regression
> ant dca

The first output provided dumps the training data, consisting of 1000 samples randomly generated according to the model (we'll see the code shortly):

DCA Demo

Sample 0 random choice prob=0.011066323782317866
  0 p=0.0010942559149001196 xs= 0=0.5537208017939629 1=0.26016864313905447 2=-1.7522309679972126
* 1 p=0.9912422418199297 xs= 0=-0.20898887445894693 1=-1.8909983320493842 2=3.0424484541278125
  2 p=6.61481706344221E-7 xs= 0=-2.3236185850552404 1=-0.40191224523950836 2=-1.8554774163187067
  3 p=0.007342638209775382 xs= 0=-0.8585314472003086 1=-2.6506128889054077 2=-1.4334136804432425
  4 p=2.5835112077154026E-4 xs= 0=-1.4926364565738457 1=-0.9874988045707495 2=0.44799562310696617
  5 p=6.185145291665587E-5 xs= 0=-1.3091994188446996 1=-1.2314203046062415 2=-2.0197424935306385

Sample 1 random choice prob=0.5716203055299767
  0 p=0.4774808501121091 xs= 0=-0.34658066612216726 1=-3.7171702891729055 2=2.847613891265687
  1 p=5.3291162894215095E-6 xs= 0=-2.727452048150046 1=-3.929332197507756 2=-1.8371896878683784
  2 p=5.631246738398885E-9 xs= 0=-5.097774786769612 1=-3.261921315683861 2=-0.24400954923978324
* 3 p=0.5194675028142764 xs= 0=2.578318143603154 1=-0.5382777113119868 2=0.5149828170181778
  4 p=0.003046289030927008 xs= 0=-0.6398287520090654 1=-3.5369997185026127 2=-0.9669006257184916
  5 p=2.294939187015502E-8 xs= 0=-1.0199809307787397 1=2.233346753932399 2=-0.08189440303456576
  6 p=3.45760010680944E-10 xs= 0=-2.215343033840159 1=3.724642835294389 2=2.291482275418305

Sample 2 random choice prob=0.6571403489348165
  0 p=0.0033735721977343185 xs= 0=-2.1173544097861843 1=-0.9525961733238733 2=-1.5757085479755961
  1 p=0.0014417439480520734 xs= 0=-2.809800209893541 1=-1.313051420770636 2=-1.069400368972158
* 2 p=0.9951846838542135 xs= 0=-1.6884130661052648 1=-1.43110875271501 2=1.8673984117586233

...

Sample 998 random choice prob=0.7185102111403285
* 0 p=0.9999870403750512 xs= 0=4.94972268738353 1=-0.9018199885770234 2=1.0128061582237957
  1 p=1.2385892859586818E-5 xs= 0=0.7649756068129245 1=-0.95647020022203 2=2.1588075307465373
  2 p=3.807341515916061E-7 xs= 0=1.540848982493464 1=0.019785197111358486 2=-1.69851386624265
  3 p=1.8148517752607114E-7 xs= 0=-0.1960515695662867 1=-1.8262256057851793 2=-0.9207611987335641
  4 p=3.9743816403578186E-12 xs= 0=-2.7740831520265568 1=-0.42268913434325006 2=-1.108653597466911
  5 p=1.150878487520586E-8 xs= 0=-0.4714941289898704 1=-2.440351927961609 2=-4.080749502726419

Sample 999 random choice prob=0.979861378724821
* 0 p=0.9938278009515112 xs= 0=2.1921160621760403 1=-1.8106535867532818 2=0.5622076284200712
  1 p=8.531431715509726E-7 xs= 0=0.3150259443121377 1=0.30802222712621374 2=-3.537317523230222
  2 p=1.9513289313863527E-7 xs= 0=0.568094267556212 1=3.425238707385209 2=0.46266391914347
  3 p=0.006144525436311744 xs= 0=0.9501447084912402 1=-1.1087942598824052 2=0.6058379024682905
  4 p=2.1205752876833113E-5 xs= 0=-2.6827568717213035 1=-2.9647540869667397 2=2.123578704514057
  5 p=4.19065365895493E-7 xs= 0=-0.025821863917245638 1=1.5799428062033578 2=-0.681833406242208
  6 p=5.000517302099065E-6 xs= 0=-0.653631008901344 1=1.243981268705386 2=3.008940687919986
  7 p=5.67485413433939E-13 xs= 0=-2.5063853494611035 1=4.837516266561886 2=-0.23731845331128532

...

The training samples have varying number of alternatives. For instance, sample 0 and sample 998 have five altenratives, whereas sample 2 only has three alternatives. Each alternative is represented by a vector, following xs=. For instance, the first alternative in sample 0 is roughly (0.55,0.26,-1.75), and the second alternative is represented by the vector (0.21,-1.90,3.04). Each alternative for a sample is preceded by its probability in the model computed according to the formula above.

Each training example simulates a choice by generating a random number between 0 and 1, shown after the sample number. For instance, sample 1 generated the random number that was roughly 0.572. This resulted in alternative 3 being chosen. But alternative 3 only had a 0.52 chance of being chosen; if the random number were lower than 0.47, alternative 0 would have been selected. These choices make up the training data.

The next piece of output reports the estimation parameters: an epoch-by-epoch basis:

...
    :00 estimate()
    :00 # training cases=1000
    :00 regression prior=GaussianRegressionPrior(Variance=4.0, noninformativeIntercept=true)
    :00 annealing schedule=Exponential(initialLearningRate=0.1, base=0.99)
    :00 min improvement=1.0E-5
    :00 min epochs=5
    :00 max epochs=500

We have a fairly low prior variance and use an exponential decay learning rate that decays fairly quickly. We have specified it to stop after relative convergence to to within 0.0001, or about four or five decimal places.

Then, like logistic regression, we get an epoch-by-epoch report of the gradient descent,

    :00 epoch=    0 lr=0.100000000 ll=  -229.2331 lp=    -3.7495 llp=  -232.9826 llp*=  -232.9826
    :00 epoch=    1 lr=0.099000000 ll=  -230.3207 lp=    -3.8286 llp=  -234.1493 llp*=  -232.9826
    :00 epoch=    2 lr=0.098010000 ll=  -230.6148 lp=    -3.8482 llp=  -234.4630 llp*=  -232.9826
    :00 epoch=    3 lr=0.097029900 ll=  -230.5728 lp=    -3.8521 llp=  -234.4249 llp*=  -232.9826
    :00 epoch=    4 lr=0.096059601 ll=  -230.4288 lp=    -3.8516 llp=  -234.2804 llp*=  -232.9826
    :00 epoch=    5 lr=0.095099005 ll=  -230.2566 lp=    -3.8498 llp=  -234.1064 llp*=  -232.9826
    :00 epoch=    6 lr=0.094148015 ll=  -230.0782 lp=    -3.8476 llp=  -233.9258 llp*=  -232.9826
    :00 epoch=    7 lr=0.093206535 ll=  -229.9002 lp=    -3.8453 llp=  -233.7455 llp*=  -232.9826
    :00 epoch=    8 lr=0.092274469 ll=  -229.7245 lp=    -3.8430 llp=  -233.5674 llp*=  -232.9826
    :00 epoch=    9 lr=0.091351725 ll=  -229.5515 lp=    -3.8407 llp=  -233.3922 llp*=  -232.9826
    :00 epoch=   10 lr=0.090438208 ll=  -229.3815 lp=    -3.8385 llp=  -233.2200 llp*=  -232.9826
    :00 epoch=   11 lr=0.089533825 ll=  -229.2144 lp=    -3.8362 llp=  -233.0506 llp*=  -232.9826
    :00 epoch=   12 lr=0.088638487 ll=  -229.0501 lp=    -3.8340 llp=  -232.8841 llp*=  -232.8841
    :00 epoch=   13 lr=0.087752102 ll=  -228.8885 lp=    -3.8319 llp=  -232.7204 llp*=  -232.7204
    :00 epoch=   14 lr=0.086874581 ll=  -228.7297 lp=    -3.8297 llp=  -232.5594 llp*=  -232.5594
...
    :03 epoch=  252 lr=0.007944545 ll=  -217.6262 lp=    -3.6695 llp=  -221.2957 llp*=  -221.2957
    :03 epoch=  253 lr=0.007865100 ll=  -217.6224 lp=    -3.6693 llp=  -221.2917 llp*=  -221.2917
    :03 epoch=  254 lr=0.007786449 ll=  -217.6186 lp=    -3.6692 llp=  -221.2878 llp*=  -221.2878
    :03 epoch=  255 lr=0.007708584 ll=  -217.6149 lp=    -3.6691 llp=  -221.2840 llp*=  -221.2840
    :03 epoch=  256 lr=0.007631498 ll=  -217.6113 lp=    -3.6689 llp=  -221.2802 llp*=  -221.2802
    :03 epoch=  257 lr=0.007555183 ll=  -217.6077 lp=    -3.6688 llp=  -221.2765 llp*=  -221.2765
    :03 epoch=  258 lr=0.007479632 ll=  -217.6042 lp=    -3.6687 llp=  -221.2729 llp*=  -221.2729
    :03 Converged with rollingAverageRelativeDiff=9.894823299509656E-6
...

We see that the estimate covnerged to within our specified bound after 259 epochs (we count from 0). The columns are the same as for logistic regression. First, the time (here only 3 seconds total including all the output), the learning rate for the epoch (note the steep decline frome poch 0 to epoch 258), the log likelihood of the training data given the model parameters, the log prio value (note the likelihood dominates the prior here with only 3 parameters), and a sum of log likelihood and prior followed by the best found log likelihood and prior.

Finally, we see how well the estmiator recovered the model parameters:

ACTUAL coeffs= 0=3.0 1=-2.0 2=1.0
FIT coeffs= 0=3.017291072658265 1=-1.9885202732810667 2=1.0845623307943484

That's very clsoe given 1000 training examples, so we can see that the API is doing the right thing.

DCA Code Walkthrough

The DCA code is in src/Dca.java.

Simulating the Training Data

Our first job is to simulate those training examples. We actually do this by first constructig a discrete chooser with our simulated coefficient vector:

public static void main(String[] args) {

    double[] simCoeffs 
        = new double[] { 3.0, -2.0, 1.0 };
    Vector simCoeffVector
        = new DenseVector(simCoeffs);
    DiscreteChooser simChooser = new DiscreteChooser(simCoeffVector);
...

Next, we generate the alternatives using a fixed random number generator (you can change it and other parameters to get different training data). First, we create a two-dimensioanl array of vectors to represent the alternative choices and a parallel one-dimensional array of integers to represent the outcomes.

...
    int numDims = simCoeffs.length;
    int numSamples = 1000;

    Random random = new Random(42);
    Vector[][] alternativess = new Vector[numSamples][];
    int[] choices = new int[numSamples];
...
Next, we iterate over the samples, first generating a random number of choices between 1 and 8 and creating a vector to hold them in the alternatives vector:
...
    for (int i = 0; i < numSamples; ++i) {
        int numChoices = 1 + random.nextInt(8);
        alternativess[i] = new Vector[numChoices];
...

Next, we generate a value for each dimension of each alternative by sampling from a Gaussian distribution with mean of 0 and standard deviation of 2. Finally, we set the altenative to a dense vector with those coefficients:

...
        for (int k = 0; k < numChoices; ++k) {
            double[] xs = new double[numDims];
            for (int d = 0; d < numDims; ++d) {
                xs[d] = 2.0 * random.nextGaussian();
            }
            alternativess[i][k] = new DenseVector(xs);
        }
...

Next, we extract the choice probabilities from the simulated discrete choice model, simChooser. We then sample an outcome in the usual way by generating a random number between 0 and 1 (choiceProb). We take the first outcome where the cumulative probability is above that value.

        double[] choiceProbs = simChooser.choiceProbs(alternativess[i]);
        double choiceProb = random.nextDouble();
        double cumProb = 0.0;
        for (int k = 0; k < numChoices; ++k) {
            cumProb += choiceProbs[k];
            if (choiceProb < cumProb || k == (numChoices - 1)) {
                choices[i] = k;
                break;
            }
        }
        System.out.println("\nSample " + i + " random choice prob=" + choiceProb);
        for (int k = 0; k < numChoices; ++k) {
            System.out.println((choices[i] == k ? "* " : "  ") + k
                               + " p=" + choiceProbs[k]
                               + " xs=" + alternativess[i][k]);
        }
    }

This time we've left in the print in order to illustrate the working of the choices array, the choice probabilities, and the alternatives vector.

Estimating a DCA Model from Data

The rest of the demo consists of creating the estimator parameters and then running it over the training data. The parameters are set up just as in logistic regression, with:

...
    double priorVariance = 4.0;
    boolean nonInformativeIntercept = true;
    RegressionPrior prior 
        = RegressionPrior.gaussian(priorVariance,nonInformativeIntercept);
    int priorBlockSize = 100;
        
    double initialLearningRate = 0.1;
    double decayBase = 0.99;
    AnnealingSchedule annealingSchedule
        = AnnealingSchedule.exponential(initialLearningRate,decayBase);

    double minImprovement = 0.00001;
    int minEpochs = 5;
    int maxEpochs = 500;
        
    Reporter reporter = Reporters.stdOut().setLevel(LogLevel.DEBUG);
...

Then we call the static estimate method on the dca.DiscreteChooser class:

...
    DiscreteChooser chooser
        = DiscreteChooser.estimate(alternativess,
                                   choices,
                                   prior,
                                   priorBlockSize,
                                   annealingSchedule,
                                   minImprovement,
                                   minEpochs,
                                   maxEpochs,
                                   reporter);
...

The estimated model is then assigned to the variable chooser. Its parameters are printed along with the simulated value with:

...
        Vector coeffVector = chooser.coefficients();
        System.out.println("\nACTUAL coeffs=" + simCoeffVector);
        System.out.println("FIT coeffs=" + coeffVector);
    }

Serialization

A discrete chooser may be serialized.

Training with Objects and Feature Extractors

Object based training data replaces the alternative vectors in the two-dimensional array alternativess with a two-dimensional array of objects and a feature extractor. A symbol table is used to convert the training data into vectors, and then the basic DCA estimator is called, again, just like for logistic regression.