com.aliasi.classify
Class LogisticRegressionClassifier<E>

java.lang.Object
  extended by com.aliasi.classify.LogisticRegressionClassifier<E>
Type Parameters:
E - the type of object being classified
All Implemented Interfaces:
BaseClassifier<E>, Classifier<E,ConditionalClassification>, ConditionalClassifier<E>, RankedClassifier<E>, ScoredClassifier<E>, Compilable, Serializable

public class LogisticRegressionClassifier<E>
extends Object
implements Classifier<E,ConditionalClassification>, ConditionalClassifier<E>, Compilable, Serializable

A LogisticRegressionClassifier provides conditional probability classifications of input objects using an underlying logistic regression model and feature extractor. Logistic regression is a discrimitive classifier which operates over arbitrary feature vectors extracted from items. See LogisticRegression for a full definition of logistic regression and its implementation.

Training

Logistic regression classifiers may be trained from a data corpus using the method train(Corpus,FeatureExtractor,int,boolean,RegressionPrior,AnnealingSchedule,double,int,int,Reporter), the last six arguments of which are shared with the logistic regression training method LogisticRegression.estimate(Vector[],int[],RegressionPrior,AnnealingSchedule,Reporter,double,int,int). The first three arguments are required to adapt logistic regression to general classification, and consist of a feature extractor, a corpus to train over, and a boolean flag indicating whether or not to add an intercept feature to every input vector.

This class merely acts as an adapter to implement the Classifier interface based on the LogisticRegression class in the statistics package. The basis of the adaptation is a general feature extractor, which is an instance of FeatureExtractor. A feature extractor converts an arbitrary input object (whose type is specified generically in this class) to a mapping from features (represented as strings) to values (represented as instances of Number). The class then uses a symbol table for features to convert the maps from feature names to numbers into sparse vectors, where the dimensions are the identifiers for the features in the symbol table. By convention, if the intercept feature flag is set, it will set dimension 0 of all inputs to 1.0.

Serialization and Compilation

This class implements both Serializable and Compilable, but both do the same thing and simply write the content of the model to the object output. The model read back in will be an instance of LogisticRegressionClassifier with the same components as the model that was serialized or compiled.

Since:
LingPipe3.5
Version:
3.9.2
Author:
Bob Carpenter
See Also:
Serialized Form

Field Summary
static String INTERCEPT_FEATURE_NAME
          The name of the feature used for intercepts, *&^INTERCEPT%$^&**.
 
Method Summary
 boolean addInterceptFeature()
          Returns true if this classifier automatically adds an intercept feature to each feature vector.
 List<String> categorySymbols()
          Returns a copy of the category symbols used by this classifier in the same order as used by the underlying logistic regression model.
 ConditionalClassification classify(E in)
          Return the conditional classification of the specified object using logistic regression classification.
 ConditionalClassification classifyFeatures(Map<String,? extends Number> featureMap)
          Return the conditional classification of a feature map using this classifier.
 ConditionalClassification classifyVector(Vector v)
          Returns the classification of the specified vector using the logistic regression model underlying this classifier.
 void compileTo(ObjectOutput objOut)
          Compile this classifier to the specified object output.
 FeatureExtractor<E> featureExtractor()
          Returns an immutable view of the feature extractor for this classifier.
 SymbolTable featureSymbolTable()
          Returns an unmodifiable view of the symbol table used for features in this classifier.
 ObjectToDoubleMap<String> featureValues(String category)
          Returns a mapping from features to their parameter values for the specified category.
 LogisticRegression model()
          Returns the logistic regression model underlying this classifier.
 String toString()
          Returns a string-based representation of this classifier, listing the parameter vectors for each category.
static
<F> LogisticRegressionClassifier<F>
train(Corpus<ObjectHandler<Classified<F>>> corpus, FeatureExtractor<? super F> featureExtractor, int minFeatureCount, boolean addInterceptFeature, RegressionPrior prior, AnnealingSchedule annealingSchedule, double minImprovement, int minEpochs, int maxEpochs, Reporter reporter)
          Returns a trained logistic regression classifier given the specified feature extractor, training corpus, model priors and search parameters.
static
<F> LogisticRegressionClassifier<F>
train(Corpus<ObjectHandler<Classified<F>>> corpus, FeatureExtractor<? super F> featureExtractor, int minFeatureCount, boolean addInterceptFeature, RegressionPrior prior, int priorBlockSize, LogisticRegressionClassifier<F> hotStart, AnnealingSchedule annealingSchedule, double minImprovement, int rollingAverageSize, int minEpochs, int maxEpochs, ObjectHandler<LogisticRegressionClassifier<F>> classifierHandler, Reporter reporter)
          Returns a trained logistic regression classifier given the specified feature extractor, training corpus, model priors and search parameters.
static
<F> LogisticRegressionClassifier<F>
train(FeatureExtractor<? super F> featureExtractor, Corpus<ClassificationHandler<F,Classification>> corpus, int minFeatureCount, boolean addInterceptFeature, RegressionPrior prior, AnnealingSchedule annealingSchedule, double minImprovement, int minEpochs, int maxEpochs, PrintWriter progressWriter)
          Deprecated. Use train(FeatureExtractor,Corpus,int,boolean,RegressionPrior,AnnealingSchedule,Reporter,double,int,int) instead.
static
<F> LogisticRegressionClassifier<F>
train(FeatureExtractor<? super F> featureExtractor, Corpus<ClassificationHandler<F,Classification>> corpus, int minFeatureCount, boolean addInterceptFeature, RegressionPrior prior, AnnealingSchedule annealingSchedule, Reporter reporter, double minImprovement, int minEpochs, int maxEpochs)
          Deprecated. Use train(Corpus,FeatureExtractor,int,boolean,RegressionPrior,AnnealingSchedule,double,int,int,Reporter) instead.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

INTERCEPT_FEATURE_NAME

public static final String INTERCEPT_FEATURE_NAME
The name of the feature used for intercepts, *&^INTERCEPT%$^&**.

See Also:
Constant Field Values
Method Detail

featureSymbolTable

public SymbolTable featureSymbolTable()
Returns an unmodifiable view of the symbol table used for features in this classifier.

Returns:
The feature symbol table for this classifier.

categorySymbols

public List<String> categorySymbols()
Returns a copy of the category symbols used by this classifier in the same order as used by the underlying logistic regression model. Classifications that this class returns will use only these symbols.

Returns:
The category symbols for this classifier.

model

public LogisticRegression model()
Returns the logistic regression model underlying this classifier.

Returns:
A copy of the model underlying this classifier.

addInterceptFeature

public boolean addInterceptFeature()
Returns true if this classifier automatically adds an intercept feature to each feature vector.

Returns:
Whether this classifier adds intercepts to feature vectors.

featureExtractor

public FeatureExtractor<E> featureExtractor()
Returns an immutable view of the feature extractor for this classifier.

Warning: If the feature extractor has side-effects (as, for example, the caching feature extractor does), these will be preserved by the returned result, which merely wraps the contained feature extractor in an anonymous inner feature extractor.

Returns:
The feature extractor for this classifier.

classifyVector

public ConditionalClassification classifyVector(Vector v)
Returns the classification of the specified vector using the logistic regression model underlying this classifier. This bypasses the conversion of an object to a feature map, and the subsequent conversion of a feature map to a vector.

Parameters:
v - Vector to classify.
Returns:
Conditional classification of the vector.

classifyFeatures

public ConditionalClassification classifyFeatures(Map<String,? extends Number> featureMap)
Return the conditional classification of a feature map using this classifier. This method bypasses the feature extraction step of converting an object to a feature map, which is carried out by the method classify(Object) using the feature symbol table featureSymbolTable() and the flag addInterceptFeature().

Parameters:
featureMap - the feature vector to classify.
Returns:
The conditional classification of the feature vector.

classify

public ConditionalClassification classify(E in)
Return the conditional classification of the specified object using logistic regression classification. All categories will have conditional probabilities in results.

Specified by:
classify in interface BaseClassifier<E>
Specified by:
classify in interface Classifier<E,ConditionalClassification>
Specified by:
classify in interface ConditionalClassifier<E>
Specified by:
classify in interface RankedClassifier<E>
Specified by:
classify in interface ScoredClassifier<E>
Parameters:
in - Input object to classify.
Returns:
The conditional classification of the object.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Compile this classifier to the specified object output. This method is only for storage convenience; the classifier read back in from the serialized object will be equivalent to this one (but not in the Object.equals() sense).

Serializing this class produces exactly the same output.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this classifier is written.
Throws:
IOException - If there is an underlying I/O error writing the model to the stream.

featureValues

public ObjectToDoubleMap<String> featureValues(String category)
Returns a mapping from features to their parameter values for the specified category. If the category is the last category, which implicitly has zero values for all parameters, the map returned by this method will also have zero values for all features.

Parameters:
category - Classification category.
Returns:
The map from features to their parameter values for the specified category.
Throws:
IllegalArgumentException - If the category is unknown.

toString

public String toString()
Returns a string-based representation of this classifier, listing the parameter vectors for each category.

Overrides:
toString in class Object
Returns:
A string-based representation of this classifier.

train

@Deprecated
public static <F> LogisticRegressionClassifier<F> train(FeatureExtractor<? super F> featureExtractor,
                                                                   Corpus<ClassificationHandler<F,Classification>> corpus,
                                                                   int minFeatureCount,
                                                                   boolean addInterceptFeature,
                                                                   RegressionPrior prior,
                                                                   AnnealingSchedule annealingSchedule,
                                                                   double minImprovement,
                                                                   int minEpochs,
                                                                   int maxEpochs,
                                                                   PrintWriter progressWriter)
                                             throws IOException
Deprecated. Use train(FeatureExtractor,Corpus,int,boolean,RegressionPrior,AnnealingSchedule,Reporter,double,int,int) instead.

Returns a trained logistic regression classifier given the specified feature extractor, corpus, model priors and search parameters.

Only the training section of the specified corpus is used for training.

See the class documentation above and the class documentation for LogisticRegression for more information on the parameters.

Type Parameters:
F - the type of object to be classified
Parameters:
featureExtractor - Converter from objects to feature maps.
corpus - Corpus of training data.
minFeatureCount - Minimum count for features in corpus to keep feature as part of model.
addInterceptFeature - A flag set to true if an intercept feature should be added to each input vector.
prior - The prior for regularization of the regression.
annealingSchedule - Class to compute learning rate for each epoch.
minImprovement - Minimum relative improvement in error during an epoch to stop search.
minEpochs - Minimum number of search epochs.
maxEpochs - Maximum number of epochs.
progressWriter - Writer to which progress reports are written. and checks for termination.
Throws:
IOException - If there is an underlying I/O exception reading the data from the corpus.

train

@Deprecated
public static <F> LogisticRegressionClassifier<F> train(FeatureExtractor<? super F> featureExtractor,
                                                                   Corpus<ClassificationHandler<F,Classification>> corpus,
                                                                   int minFeatureCount,
                                                                   boolean addInterceptFeature,
                                                                   RegressionPrior prior,
                                                                   AnnealingSchedule annealingSchedule,
                                                                   Reporter reporter,
                                                                   double minImprovement,
                                                                   int minEpochs,
                                                                   int maxEpochs)
                                             throws IOException
Deprecated. Use train(Corpus,FeatureExtractor,int,boolean,RegressionPrior,AnnealingSchedule,double,int,int,Reporter) instead.

Returns a trained logistic regression classifier given the specified feature extractor, corpus, model priors and search parameters.

Only the training section of the specified corpus is used for training.

See the class documentation above and the class documentation for LogisticRegression for more information on the parameters.

Type Parameters:
F - the type of object to be classified
Parameters:
featureExtractor - Converter from objects to feature maps.
corpus - Corpus of training data.
minFeatureCount - Minimum count for features in corpus to keep feature as part of model.
addInterceptFeature - A flag set to true if an intercept feature should be added to each input vector.
prior - The prior for regularization of the regression.
annealingSchedule - Class to compute learning rate for each epoch.
minImprovement - Minimum relative improvement in error during an epoch to stop search.
minEpochs - Minimum number of search epochs.
maxEpochs - Maximum number of epochs.
reporter - Reporter to which progress reports are written, or null for no reporting.
Throws:
IOException - If there is an underlying I/O exception reading the data from the corpus.

train

public static <F> LogisticRegressionClassifier<F> train(Corpus<ObjectHandler<Classified<F>>> corpus,
                                                        FeatureExtractor<? super F> featureExtractor,
                                                        int minFeatureCount,
                                                        boolean addInterceptFeature,
                                                        RegressionPrior prior,
                                                        AnnealingSchedule annealingSchedule,
                                                        double minImprovement,
                                                        int minEpochs,
                                                        int maxEpochs,
                                                        Reporter reporter)
                                             throws IOException
Returns a trained logistic regression classifier given the specified feature extractor, training corpus, model priors and search parameters.

Only the training section of the specified corpus is used for training.

See the class documentation above and the class documentation for LogisticRegression for more information on the parameters.

Prior block size for priors defauls to the corpus training size divided by 50.

Type Parameters:
F - the type of object to be classified
Parameters:
corpus - Corpus of training data.
featureExtractor - Converter from objects to feature maps.
minFeatureCount - Minimum count for features in corpus to keep feature as part of model.
addInterceptFeature - A flag set to true if an intercept feature should be added to each input vector.
prior - The prior for regularization of the regression.
annealingSchedule - Class to compute learning rate for each epoch.
minImprovement - Minimum relative improvement in error during an epoch to stop search.
minEpochs - Minimum number of search epochs.
maxEpochs - Maximum number of epochs.
reporter - Reporter to which progress reports are written, or null for no reporting.
Throws:
IOException - If there is an underlying I/O exception reading the data from the corpus.

train

public static <F> LogisticRegressionClassifier<F> train(Corpus<ObjectHandler<Classified<F>>> corpus,
                                                        FeatureExtractor<? super F> featureExtractor,
                                                        int minFeatureCount,
                                                        boolean addInterceptFeature,
                                                        RegressionPrior prior,
                                                        int priorBlockSize,
                                                        LogisticRegressionClassifier<F> hotStart,
                                                        AnnealingSchedule annealingSchedule,
                                                        double minImprovement,
                                                        int rollingAverageSize,
                                                        int minEpochs,
                                                        int maxEpochs,
                                                        ObjectHandler<LogisticRegressionClassifier<F>> classifierHandler,
                                                        Reporter reporter)
                                             throws IOException
Returns a trained logistic regression classifier given the specified feature extractor, training corpus, model priors and search parameters.

Only the training section of the specified corpus is used for training.

See the class documentation above and the class documentation for LogisticRegression for more information on the parameters.

Type Parameters:
F - the type of object to be classified
Parameters:
corpus - Corpus of training data.
featureExtractor - Converter from objects to feature maps.
minFeatureCount - Minimum count for features in corpus to keep feature as part of model.
addInterceptFeature - A flag set to true if an intercept feature should be added to each input vector.
prior - The prior for regularization of the regression.
priorBlockSize - Number of examples whose gradient is updated before the prior gradient is updated.
hotStart - Logistic regression classifier to use as initial coefficient values for training.
annealingSchedule - Class to compute learning rate for each epoch.
minImprovement - Minimum relative improvement in error during an epoch to stop search.
rollingAverageSize - Number of epochs over which to average objective improvement for monitoring convergence.
minEpochs - Minimum number of search epochs.
maxEpochs - Maximum number of epochs.
classifierHandler - Handler for classifiers produced at each epoch.
reporter - Reporter to which progress reports are written, or null for no reporting.
Throws:
IOException - If there is an underlying I/O exception reading the data from the corpus.