Introduction
What is Logistic Regression?
Logistic regression is a discriminitive probabilistic classification model that operates over real-valued vector inputs. The dimensions of the input vectors being classified are called "features" and there is no restriction against them being correlated. Logistic regression is one of the best probabilistic classifiers, measured in both log loss and first-best classification accuracy across a number of tasks.
The logistic regression implementation in LingPipe provides multinomial classification; that is, it allows more than two possible output categories.
The main drawback of logistic regression is that it's relatively slow to train compared to the other LingPipe classifiers. It also requires extensive tuning in the form of feature selection and implementation to achieve state-of-the-art classification performance.
What's in the Tutorial?
This tutorial covers both the vector-based implementation in the statistics package and the use of feature extractors for classifying arbitrary objects in the classification package. The tutorial will cover basic estimation, the effects of different choices and parameterizations of priors, and tuning the estimator's search.
The tutorial will also cover the basics of feature-based classification in LingPipe. Feature extractors convert arbitrary objects into feature vectors, which may then be converted to actual vectors for use in logistic regression.
Also Known As (AKA)
For the sake of terminological clarity (and search engine optimization), here are some aliases for multinomial logistic regression.
Polytomous Logistic Regression
Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression.
Maximum Entropy Classifier
Logistic regression estimation obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".
Neural Network: Classification with a Single Neuron
Binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. This is sometimes called classification with a single neuron.
LingPipe's stochastic gradient descent is equivalent to a stochastic back-propagation algorithm over the single-output neural network.
Ridge Regression and the Lasso
Maximum a priori (MAP) estimation with Gaussian priors is often referred to as "ridge regression"; with Laplace priors MAP estimation is known as the "lasso".
Shrinkage and Regularized Regression
MAP estimation with Gaussian, Laplace or Cauchy priors is known as parameter shrinkage.
Gaussian and Laplace priors are equivalent to regularized regression, with the Gaussian version being regularized with the L2 norm (Euclidean distance, called the Frobenius norm for matrices of parameters) and the Laplace version being regularized with the L1 norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.
Generalized Linear Model and Softmax
Logistic regression is a generalized linear model with the logit link function. The logistic link function is sometimes called softmax and given its use of exponentiation to convert linear predictors to probabilities, it is sometimes called an exponential model.
Logistic Regression Models
Logistic regression models provide multi-category classification in cases where the categories are exhaustive and mutually exclusive. That is, every instance belongs to exactly one category.
Inputs are coded as real-valued vectors of a fixed dimensionality. The dimensions are often called predictors or features. There is no requirement that they be independent, and with regularization, they may even be highly or fully linearly correlated.
The model consists of parameter vectors for categories of the dimensionality of inputs. The last category does not get a parameter vector; or equivalently, it gets a constant 0 parameter vector.
More formally, if the inputs are of dimension d and
there are k categories, the model consists of k-1
vectors β[0],...,β[k-2]. Then for a given
input vector x of dimensionality k, the
conditional probability of a category given the input is defined to be:
p(0 | x) ∝ exp(β[0] * x) p(1 | x) ∝ exp(β[1] * x) ... p(k-2 | x) ∝ exp(β[k-2] * x) p(k-1 | x) ∝ exp(0 * x)
Normalizing by the sum of the exponentiated bases yields the probability estimates:
p(0 | x) = exp(β[0]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x)) p(1 | x) = exp(β[1]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x)) ... p(k-2 | x) = exp(β[k-2]*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x)) p(k-1 | x) = exp(0*x) / (exp(β[0]*x) + ... + exp(β[k-2]*x) + exp(0*x))
Writing it out in summation notation, for c < k-1:
p(c | x) = exp(β[c] * x) / (1 + Σi < k-1 exp(β[i]*x))
and for c = k-1:
p(k-1 | x) = 1 / (1 + Σi < k-1 exp(β[i]*x))
Example of Logistic Regression
Logistic regression models are estimated from training data consisting of a sequence of vectors and their reference categories. The vectors are arbitrary, with their dimensions representing features of the input objects being classified. The categories are discrete, and should be numbered contiguously from 0 to the number of categories minus one.
The Wallet Problem
The first example we consider is drawn from chapter 5 of the following book:
- Allison, Paul David. 1999. Logistic Regression Using the SAS System: Theory and Application. SAS Institute.
The data is based on a survey of 195 undergraduates, and attempts to predict their answer to the question "If you found a wallet on the street, would you...", with the following possible responses:
| Wallet Problem Outcomes | |
|---|---|
| Outcome | Description |
| 0 | keep both |
| 1 | keep the money, return the wallet |
| 2 | return both |
The input vectors are five dimensional, consisting of the following features, the descriptions of which are directly transcribed from (Allison 1999):
| Wallet Problem Predictors | ||
|---|---|---|
| Dimension | Description | Values |
| 0 | Intercept | 1: always |
| 1 | Male | 1: male 0: female |
| 2 | Business | 1: enrolled in business school 0: not enrolled in business school |
| 3 | Punish | Variable describing whether student was physically punished by parents at various ages: 1: punished in elementary school, but not in middle or high school 2: punished in elementary and middle school, but not in high school 3: punished at all three levels |
| 4 | Explain | Response to question "When you were punished, did your parents generally explain why what you did was wrong?" 1: almost always 0: sometimes or never |
LingPipe requires an explicit representation of the intercept feature, which is implicit in (Allison 1999). The intercept is treated just like other features, but is assumed to take on value 1.0 in all inputs. Thus it provides an input-independent bias term for estimation. Most problems benefit from the addition of such an intercept feature.
Where do Features Come From?
The predictors in this problem are all discrete, most defining binary variables with the physical punishment model taking on three ordinal values. It is also possible to include continuous inputs for regression problems such as token counts in linguistic examples or fetaures like width of petals in flower species classification problems.
Here's the first few training examples out of the complete set of 195:
| Wallet Problem Data (Sample) | |||||
|---|---|---|---|---|---|
| Outcome | Intercept | Male | Business | Punish | Explain |
| 1 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 1 | 1.0 | 0.0 | 0.0 | 2.0 | 1.0 |
| 2 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 2 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 |
| ... | |||||
For example, the third training instance represents a survey response for a woman (male=0.0) who is not in business school (business=0.0), who was punished only in elementary school (punish=1.0), had her punishment explained almost always (explain=1.0), and who said she'd return both the wallet and the money (outcome=2). The fifth training example represents a man who's not in business school, was punished only in elementary school, had his punishment explained, and answered that he would keep both the money and wallet.
Our first logistic regression model is estimated from 195 of these training cases, yielding a classifier that given the five input feature values (intercept, male, business, punish and explain), assigns probabilities to the three outcomes (keep both, return only money, return both).
Coding the Problem
The source code for the wallet problem may be found in the
file src/WalletProblem.java.
In order to train a logistic regression model, LingPipe requires
the inputs to be coded as instances of matrix.Vector
and outputs to be coded as integers. These are presented as
parallel arrays of vectors and output integers.
To keep things simple, the outputs and inputs are coded directly as used as static constants. Here are the outputs:
static final int[] OUTPUTS = new int[] {
1,
1,
2,
2,
0,
...
}
The inputs are coded as dense vector instances:
static final Vector[] INPUTS = new Vector[] {
new DenseVector(new double[] { 1, 0, 0, 2, 0 }),
new DenseVector(new double[] { 1, 0, 0, 2, 1 }),
new DenseVector(new double[] { 1, 0, 0, 1, 1 }),
new DenseVector(new double[] { 1, 0, 0, 2, 0 }),
new DenseVector(new double[] { 1, 1, 0, 1, 1 }),
...
};
Note how these two parallel arrays directly encode the sample data as presented in the previous table.
Estimating the Regression Coefficients
Running the code using the ant target wallet
prints out the estimated regression coefficients:
> ant wallet Computing Wallet Problem Logistic Regression Outcome=0 -3.47 1.27 1.18 1.08 -1.60 Outcome=1 -1.29 1.17 0.42 0.20 -0.80
An estimated model consists of a sequence of weight vectors for one minus the number of output categories. We don't need a vector for the last category, because it can be taken to be zero without loss of generality (see Carpenter 2008).
Implicit Coefficients for Final Outcome
The final outcome has all zero coefficients. Filling in for the wallet example, this gives us:
Outcome=2 0.00 0.00 0.00 0.00 0.00
These are not printed as part of the model output.
Interpreting the Regresison Coefficients
Because logistic regression involves a simple linear predictor, the regression coefficients may be interpreted fairly directly.
Intercept
The values of the intercept parameter are -3.47 for outcome 0 (keep both), -1.29 for (keep money, return wallet), and 0.0 implicitly for outcome 2 (return-both). Because of the definition of probability and the fact that the intercept feature dimension is always 1.0, the linear basis for outcomes 0, 1 and 2 start off on an uneven footing. To make the keep-both outcome most likely, another feature or combination of features will have to contribute more than 3.47 to the linear basis.
Computing Probabilities
The main point of fitting a model is to be able to interpret
probabilities for events. For instance, take male business students
who were punished at all three levels without explanation. That
provides an input vector of (1,1,1,3,0). The probabilities
work out as:
p(keep-both|1,1,1,3,0) ∝ exp(-3.47*1 + 1.27*1 + 1.18*1 + 1.08*3 + -1.6*0)
= exp(-1.54) = 0.21
p(keep-money|1,1,1,3,0) ∝ exp(-1.29*1 + 1.17*1 + 0.42*1 + 0.20*3 + -0.8*0)
= exp(-0.3) = 0.74
p(return-both|1,1,1,3,0) ∝ exp(0.0*1 + 0.0*1 + 0.0*1 + 0.0*3 + 0.0*0)
= exp(0) = 1
Division by the sum of exponentiated linear predictors yields the probabilities:
p(keep-both|1,1,1,3,0) = 0.21 / (0.21 + 0.74 + 1) = 0.11 p(keep-money|1,1,1,3,0) = 0.74 / (0.21 + 0.74 + 1) = 0.38 p(return-both|1,1,1,3,0) = 1.00 / (0.21 + 0.74 + 1) = 0.51
If we repeat this exercise for women (second feature = 0), we get:
p(keep-both|1,0,1,3,0) = 0.30 / (0.30 + 0.51 + 1.00) = 0.16 p(keep-money|1,0,1,3,0) = 0.51 / (0.30 + 0.51 + 1.00) = 0.28 p(return-both|1,0,1,3,0) = 1.00 / (0.30 + 0.51 + 1.00) = 0.55
According to this model, among business students punished at all three levels without explanation, women are less likely to waffle; they're more likely to keep both the money and the wallet and also more likely to return both than men, who are prone to keep the money and return the wallet.
Code Walk Through
The code is all in the main() method. The
estimation is done with the following one-liner, with
the imports from the stats package listed:
import com.aliasi.stats.AnnealingSchedule;
import com.aliasi.stats.LogisticRegression;
import com.aliasi.stats.RegressionPrior;
public static void main(String[] args) {
LogisticRegression regression
= LogisticRegression.estimate(INPUTS,
OUTPUTS,
RegressionPrior.noninformative(),
AnnealingSchedule.inverse(.05,100),
0.000000001, // min improve
1, // min epochs
10000, // max epochs
null); // no print feedback
The parameters to the estimate() method involve
the inputs and outputs, model prior hyperparameter, parameters
to control the search, and a progress monitor parameter.
Input Vectors and Output Categories
The first two arguments to the estimate() method
are just the parallel arrays of input vectors and output categories;
these were defined in the previous section.
Prior Hyperparameter
The hyperparameter controlling fitting in the model is the prior.
The priors are defined in the stats.RegressionPrior
class. This example uses a so-called noninformative prior, the upshot
of which is that the estimate will be a maximum likelihood estimate
(the parameters which assign the highest likelihood to the entire
corpus of training data). We consider other priors in the next
section.
Search Parameters
The remaining parameters all control the search for the estimate. Logistic regression has no analytic solution, so estimating parameters from data requires numerical optimization. LingPipe employs stochastic gradient descent (SGD), a general, highly-scalable online optimization algorithm. SGD makes several passes through the data, adjusting the parameters a little bit based on examples one at a time.
The first search parameter, is the annealing schedule. Simulated annealing
is a widely used technique in numerical optimization. It involves
starting with large learning rates and gradually reducing the activity
of the learner over time. The annealing schedule used in this demo is
exponential, meaing that the learning rate at each step is an
exponential function. The parameters 0.005 and 0.9999 are the initial
learning rate and the base of the exponent. There is more information
about annealing in the class documentation for stats.AnnealingSchedule.
The second search parameter, 0.000000001 indicates how
tight the estimate must be before stopping the search. This is measured
in relative corpus log likelihood. That is, if the corpus log likelihood
in an epoch (run through all the input/output pairs) is reduced
by less than 0.0000001 percent, the search is terminated.
The third and fourth search parameters, 1 and
100000 indicate the minimum and maximum number of times
each training example is visited.
Progress Monitor Parameter
The final parameter, null here, is for
a java.io.PrintWriter to which feedback about
the progress of the search will be printed. A standard
value for this would be new PrintWriter(System.out),
which would provide progress reports to standard output.
Applying a Trained Model
Once a regression model is trained, it may be used to probabilistically classify new vectors of the same dimensionality as the training data.
The sample code in wallet problem goes on with some randomly generated data to do classification.
... Input Vector Outcome Conditional Probabilities 1.0 0.0 0.0 1.0 1.0 p(0|input)=0.02 p(1|input)=0.13 p(2|input)=0.86 1.0 0.0 1.0 0.0 0.0 p(0|input)=0.07 p(1|input)=0.28 p(2|input)=0.66 1.0 0.0 1.0 3.0 1.0 p(0|input)=0.28 p(1|input)=0.18 p(2|input)=0.54
The third input represents a female business student who was physically punished through high school with explanation. the model predicts she is 28 percent likely to keep the wallet and money, and only 54% likely to return both.
The code to compute the outcome probabilities given the output just feeds the input vectors to the regression model to produce an array of output conditional probabilities (omitting some of the print statements):
for (Vector testCase : TEST_INPUTS) {
double[] conditionalProbs = regression.classify(testCase);
for (int i = 0; i < testCase.numDimensions(); ++i)
System.out.printf("%3.1f ",testCase.value(i));
for (int k = 0; k < conditionalProbs.length; ++k)
System.out.printf(" p(%d|input)=%4.2f ",k,conditionalProbs[k]);
}
The variable TEST_INPUTS is an array of vector objects,
of the same format as the training inputs array. The key method call in the
code is in bold, applying the trained regresison model to classify
a test case. The rest just goes through the output and prints it out
in a readable fashion.
Regularization with Priors
Regression models have a tendency to overfit their training data, so priors are introduced to control the complexity of the fitted model.
The Overfitting Problem
Problems with Maximum Likelihood
Logistic regression models with large numbers of features and limited amounts of training data are highly prone to overfitting under maximum likelihood estimation. A model is overfit if it is a tight match to the training data but does not generalize well to new data. The model is called "overfit" because it is too closely tailored to the training data. The maximum likelihood estimation procedure is at the root of the problem, because it simply fits the training data as tightly as possible.
Linearly Separable Problems
A particularly pathological case of overfitting is when the data is linearly separable. A simple case is when a feature value (dimension of the input) is positive if and only for a single output. For instance, in a study of 195 students, it might have turned out that every male kept the wallet and money. In this case, the coefficient for outcome 0 for the male feature will be unbounded; making it larger always increases the probability.
Priors on Coefficients
To compensate for the tendency of regression models to overfit, it is common to establish prior expectations for the values of parameters. These prior densities are designed to favor simple models. Simplicity for regression models means small regression coefficients, so the priors tend to concentrate parameters around zero. With smaller coefficients, the change in probability for a given change in an input dimension is less and thus the overall estimate is less variable.
Varieties of Priors
LingPipe implements three priors for regression: Cauchy (Student-t with one degree of freedom), Gaussian (normal), and Laplace (double exponential). The priors are listed here in order of how fat their tails are. The Cauchy distribution is so dispersed, in fact, that it does not have a finite mean or variance. The Laplace distribution is so peaked around its mean that it tends to drive most posterior coefficient estimates to its mean.
Because we wish to push coefficients toward zero, we only consider priors with mean (or median in the case of the Cauchy) zero. The variance (or scale in the case of the Cauchy) will determine how fat the distribution is, but the scale of the tails relative to the rest of the distribution is controlled by variance.
Priors with means of zero exert a shrinkage effect on parameters relative to maximum likelihood estimates. Applying priors is thus sometimes called "shrinkage".
Running the Demo
The ant target regularization demonstrates regularization with priors
over the wallet data.
> ant regularization VARIANCE=0.0010 Prior=LaplaceRegressionPrior(Variance=0.0010, noninformativeIntercept=true) 0) -1.62, 0.00, 0.00, 0.00, 0.00, 1) -0.88, 0.00, 0.00, 0.00, 0.00, Prior=GaussianRegressionPrior(Variance=0.0010, noninformativeIntercept=true) 0) -1.63, -0.00, 0.00, 0.01, -0.02, 1) -0.84, 0.03, 0.01, 0.02, 0.01, Prior=CauchyRegressionPrior(Scale=0.0010, noninformativeIntercept=true) 0) -3.36, 0.00, 0.00, 1.08, -0.00, 1) -0.88, 0.01, 0.00, 0.01, 0.00, ... VARIANCE=0.512 Prior=LaplaceRegressionPrior(Variance=0.512, noninformativeIntercept=true) 0) -3.00, 0.63, 0.57, 0.92, -0.91, 1) -1.12, 0.88, 0.03, 0.06, -0.39, Prior=GaussianRegressionPrior(Variance=0.512, noninformativeIntercept=true) 0) -3.13, 0.75, 0.76, 0.96, -0.98, 1) -1.23, 0.90, 0.28, 0.15, -0.51, Prior=CauchyRegressionPrior(Scale=0.512, noninformativeIntercept=true) 0) -3.14, 0.80, 0.77, 0.98, -1.12, 1) -1.26, 0.96, 0.23, 0.16, -0.50, ... VARIANCE=524.288 Prior=LaplaceRegressionPrior(Variance=524.288, noninformativeIntercept=true) 0) -3.46, 1.24, 1.15, 1.08, -1.57, 1) -1.27, 1.17, 0.42, 0.19, -0.78, Prior=GaussianRegressionPrior(Variance=524.288, noninformativeIntercept=true) 0) -3.48, 1.27, 1.17, 1.09, -1.60, 1) -1.28, 1.18, 0.43, 0.20, -0.79, Prior=CauchyRegressionPrior(Scale=524.288, noninformativeIntercept=true) 0) -3.48, 1.27, 1.17, 1.09, -1.60, 1) -1.28, 1.18, 0.43, 0.20, -0.79,
With very low prior variance, as shown in the first example with a prior variance of 0.001, the coefficients are driven close to zero in the posterior. As the variance increases, the results get closer and closer to the maximum likelihood estimates, with only the Laplace prior only just barely shrinking a few parameters.
Also note that for a given variance (or scale for the Cauchy), the Cauchy exerts the least push toward zero and the Laplace the most push toward zero. In the natural language problems we consider in the next section, a fairly liberal Laplace prior still drives most posterior parameters to zero.
Code Walk Through
We return to the wallet example in a demo of the effects of regularization in
src/RegularizationDemo.java.
The main() method just loops over variances trying all the priors:
for (double variance = 0.001; variance <= 1000; variance *= 2.0) {
System.out.println("\n\nVARIANCE=" + variance);
evaluate(RegressionPrior.laplace(variance,true));
evaluate(RegressionPrior.gaussian(variance,true));
evaluate(RegressionPrior.cauchy(variance,true));
}
The evaluation program just fits a model and prints out the results, just as in the wallet example:
static void evaluate(RegressionPrior prior) {
LogisticRegression regression
= LogisticRegression.estimate(WalletProblem.INPUTS,
WalletProblem.OUTPUTS,
prior,
AnnealingSchedule.inverse(.05,100),
0.0000001,
10,
5000,
null);
Vector[] betas = regression.weightVectors();
...
Feature Extractors and Text Classification
As implemented in the LingPipe stats package,
logistic regression operates over input vectors, integer
outcomes, and arrays of conditional probabilities. This
is the basic material required to implement
a classifier that produces conditional probability classifications.
The Logistic Regression Classifier
Several classes are implicated in adapting the stats package logistic regression models to implementations of classifiers. First, a feature extractor is used to convert arbitrary objects into mappings from string-based features to values. Second, a symbol table converts these features into dimensions. Together, a feature extractor and symbol table support the conversion of arbitrray objects into vectors. Finally, another symbol table is used to convert the string-based category representations in the classification package into the integer representation required by the statistics package.
The class
classify.LogisticRegressionClassifier handles all the details of this adaptation, as shown
in the code examples below.
Running the Demo
There's a simple demo implementation of natural language classification based on the 4-newsgroup data distributed with LingPipe and discussed in the Topic Classification Tutorial. The data is the bodies of messages to four easily confusible news groups:
soc.religion.christiantalk.religion.miscalt.atheismmisc.forsale
The demo is run with the ant target nl-topics:
> ant nl-topics Reading data. Num instances=250. Permuting corpus. EVALUATING FOLDS Logistic Regression Progress Report Number of dimensions=1462 Number of Outcomes=4 Number of Parameters=4386 Prior: LaplaceRegressionPrior(Variance=0.5, noninformativeIntercept=true) Annealing Schedule=Exponential(initialLearningRate=0.0020, base=0.9975) Minimum Epochs=100 Maximum Epochs=1000 Minimum Improvement Per Period=1.0E-7 Has Sparse Inputs=true Has Informative Prior=true ...
The first part of the output reports back on some of the praameters set in the code. For instance, there are 4386 unique feature dimensions, the prior is a Laplace prior with variance 0.5 and an noninformative intercept on the intercept, the annealing schedule is exponential, and so on.
and then provides feedback on the epoch-by-epoch progress of the stochastic gradient descent algorithm used for estimation.
... epoch= 0 lr=0.002000000 ll= -392.4239 lp= -34.4435 llp= -426.8675 llp*= -426.8675 :00 epoch= 1 lr=0.001995000 ll= -342.0040 lp= -43.1246 llp= -385.1286 llp*= -385.1286 :00 epoch= 2 lr=0.001990013 ll= -294.9343 lp= -43.7030 llp= -338.6373 llp*= -338.6373 :00 epoch= 3 lr=0.001985037 ll= -249.8116 lp= -44.6577 llp= -294.4693 llp*= -294.4693 :00 ... epoch= 997 lr=0.000164891 ll= -53.4495 lp= -54.3246 llp= -107.7740 llp*= -107.7740 :15 epoch= 998 lr=0.000164478 ll= -53.4494 lp= -54.3239 llp= -107.7732 llp*= -107.7732 :15 epoch= 999 lr=0.000164067 ll= -53.4492 lp= -54.3232 llp= -107.7725 llp*= -107.7725 :15 ...
In each epoch, the algorithm visits every training instance and
adjusts each coefficient based on the current model and the trianing
instance. The reports indicate the epoch number, the learning rate
for that epoch (lr), the log likelihood of the data in
the model (ll), the log likelihood of the current set of
coefficients (lp), the sum of the two log likelihoods
(llp) [note that this is just negative error], the best
sum so far (llp*), and finally, the time, down to the
second. In this case, estimation took 15 seconds resulting in a log
likelihood of -53.4 and log prior -107.8.
After the feedback on estimation, the demo program prints out
the features by name and their coefficient weights. In this instance, features
are alphabetic or numeric tokens (or the intercept). Here are
the top positive and negative coefficients for each category, as well as
some zero coefficients from the first category, alt.atheism:
CLASSIFIER & FEATURES
NUMBER OF CATEGORIES=4
NUMBER OF FEATURES=1462
CATEGORY=alt.atheism
Jim 0.542459
some 0.519327
atheists 0.454521
on 0.346886
they 0.264611
model 0.259512
at 0.233116
is 0.225087
article 0.154355
ICO 0.153443
TEK 0.153443
vice 0.153192
of 0.137563
The 0.136769
mcl 0.134621
timmbake 0.134621
...
approach -0.000000
causes -0.000000
equally -0.000000
happen -0.000000
ignoring -0.000000
immediately -0.000000
later -0.000000
provided -0.000000
regard -0.000000
separate -0.000000
small -0.000000
sounds -0.000000
stand -0.000000
week -0.000000
willing -0.000000
women -0.000000
America -0.000000
cultural -0.000000
disciples -0.000000
speaking -0.000000
implied -0.000000
debate -0.000000
...
7 -0.179450
that -0.237454
2 -0.239784
do -0.269343
Mormon -0.291933
very -0.293706
Christian -0.339846
ca -0.374287
in -0.425005
*&^INTERCEPT%$^&** -0.947079
Here the name "Jim" is the most positively indicative of
the alt.atheism topic, and the intercept the most
negative feature. For some reason this includes the word in
and very and do, which seem unlikely features
to discriminate atheism from other religious topics. This is a problem
with unigram (single token) features -- they are often perplexing.
Features like "Christian" may show up because the alt.atheism
board may refer less to Christians (or Mormons) as a class.
Compare the alt.atheism topic with the misc.forsale
topic, where words like "PC" or "sale" are strongly
positive and again words like "Mormon" are negative, with some
perplexing entries like "that".
CATEGORY=misc.forsale
for 0.888025
PC 0.747638
drive 0.700343
or 0.510954
2 0.377641
edu 0.282761
300 0.253986
on 0.251271
would 0.194138
sale 0.147936
00 0.103214
*&^INTERCEPT%$^&** 0.083822
...
ca -0.000098
Book -0.058855
the -0.084477
In -0.108700
of -0.120941
Mormon -0.246687
to -0.461844
that -0.726345
Re -0.883255
Finally, here are the top positive and negative features
for soc.religion.christian:
CATEGORY=soc.religion.christian
rutgers 1.068456
life 0.763778
has 0.700812
May 0.676530
Mary 0.565283
athos 0.530967
who 0.354184
doctrine 0.263468
Orthodox 0.242234
Trinity 0.193247
s 0.178808
verses 0.142781
...
NNTP -0.026967
we -0.046827
edu -0.047974
A -0.055054
Mormon -0.056953
The -0.060776
Robert -0.077967
the -0.079859
were -0.108695
ca -0.120587
it -0.156520
you -0.157730
Organization -0.164869
a -0.219583
Christian -0.294424
Distribution -0.389873
*&^INTERCEPT%$^&** -0.437988
Posting -0.470168
Host -0.486523
Oddly, this category has junk from the mail headers and signatures and what not, like "rutgers", "NNTP" and "Posting".
Variance of Coefficient Estimates
To get some feeling for the variability of the feature estimates, here are the top positive and negative features for the second fold of a four-way cross-validation of which the above reports the first fold:
CATEGORY=misc.forsale
for 1.215978
drive 0.748811
*&^INTERCEPT%$^&** 0.629428
PC 0.555030
on 0.525066
sale 0.515190
2 0.479129
Host 0.266548
Posting 0.266545
...
COM -0.000069
been -0.000102
Sun -0.000180
are -0.034722
in -0.070332
the -0.114678
of -0.272626
In -0.276476
to -0.372242
Re -0.459132
that -0.806237
Code Walkthrough
The code for generating this demo is in src/TextClassificationDemo.java.
First, it builds up a corpus instance just as in the cross-validation
demo in the topic classification tutorial:
public static void main(String[] args) throws Exception {
...
PrintWriter progressWriter = new PrintWriter(System.out,true);
int numFolds = 4;
XValidatingClassificationCorpus<CharSequence> corpus
= ...
corpus.permuteCorpus(new Random(7117)); // destroys runs of categories
TokenizerFactory tokenizerFactory
= new RegExTokenizerFactory("\\p{L}+|\\d+"); // letter+ | digit+
FeatureExtractor<CharSequence> featureExtractor
= new TokenFeatureExtractor(tokenizerFactory);
int minFeatureCount = 5;
boolean addInterceptFeature = true;
boolean noninformativeIntercept = true;
double priorVariance = 0.5;
RegressionPrior prior
= RegressionPrior.laplace(priorVariance,noninformativeIntercept);
AnnealingSchedule annealingSchedule
= AnnealingSchedule.exponential(0.002,0.9975);
double minImprovement = 0.0000001;
int minEpochs = 100;
int maxEpochs = 1000;
for (int fold = 0; fold < numFolds; ++fold) {
corpus.setFold(fold);
LogisticRegressionClassifier<CharSequence> classifier
= LogisticRegressionClassifier.<CharSequence>train(featureExtractor,
corpus,
minFeatureCount,
addInterceptFeature,
prior,
annealingSchedule,
minImprovement,
minEpochs,
maxEpochs,
progressWriter);
...
The training method for the logistic regression classifier is almost
identical to the static method for estimating logistic regression models
in the stats package. The main difference is that the
classifier requires an instance of util.FeatureExtractor. The feature extractor interface defines a single method:
public interface FeatureExtractor<E> {
public Map<String,? extends Number> features(E in);
}
In the code above, we use a pre-built adapter that converts a
tokenizer factory into a feature extractor. The
tokenizer.TokenFeatureExtractor is constructed with a
tokenizer factory, which it then uses to tokenize character sequences
it receives. The resulting mapping is simply the count of the tokens in the
input.
The features are printed out in order by the classifier itself:
...
progressWriter.println("\nCLASSIFIER & FEATURES\n");
progressWriter.println(classifier);
...
The evaluation is done in the usual way, by having the corpus walk the evaluator over the test section:
...
progressWriter.println("\nEVALUATION\n");
ClassifierEvaluator<CharSequence,ConditionalClassification> evaluator
= new ClassifierEvaluator<CharSequence,ConditionalClassification>(classifier,CATEGORIES);
corpus.visitTest(evaluator);
progressWriter.printf("FOLD=%5d ACC=%4.2f +/-%4.2f\n",
fold,
evaluator.confusionMatrix().totalAccuracy(),
evaluator.confusionMatrix().confidence95());
}