com.aliasi.classify
Class BinaryLMClassifier

java.lang.Object
  extended by com.aliasi.classify.LMClassifier<L,MultivariateEstimator>
      extended by com.aliasi.classify.DynamicLMClassifier<LanguageModel.Dynamic>
          extended by com.aliasi.classify.BinaryLMClassifier
All Implemented Interfaces:
BaseClassifier<CharSequence>, Classifier<CharSequence,JointClassification>, ConditionalClassifier<CharSequence>, JointClassifier<CharSequence>, RankedClassifier<CharSequence>, ScoredClassifier<CharSequence>, ClassificationHandler<CharSequence,Classification>, Handler, ObjectHandler<Classified<CharSequence>>, Compilable

public class BinaryLMClassifier
extends DynamicLMClassifier<LanguageModel.Dynamic>

A BinaryLMClassifier is a boolean dynamic language model classifier for use when there are two categories, but training data is only available for one of the categories.

A binary LM classifier is based on a single language model and cross-entropy threshold. It defines two categories, accept and reject, with acceptance determined by measuring sample cross-entropy rate in a language model against a threshold. As a language model classifier, the multivariate category estimator is uniform, the accepting language model is dynamic, and the rejecting language model is constant.

As an instance of language model classifier, this class provides scores that are adjusted per-character average log probabilities, which are roughly negative sample cross-entropy rates (see LMClassifier). The accepting language model behaves in the usual way. The rejecting language model provides a constant per-character log estimate. The uniform rejecting model is defined to be a boundary uniform lanuage model if the specified model is a sequence language model and a process uniform language model otherwise.

Training events may be supplied in the same way as for the superclass DynamicLMClassifier, with two caveats. First, the multivariate category model remains uniform and thus does not contribute to classification. Second, training events for the rejection category are ignored. Thus only the language model for the accepting category is trained. The broader interface is implemented without exceptions in order to allow binary classifiers to be plugged in for ones with explicit rejection models.

Instances of this class are compilable as instances of their superclass. The resulting object read back in will be an instance of LMClassifier, not of this class, but its classification behavior will be identical.

Resetting category language models is not allowed for binary language model classifiers, because they only contain one model and all else is constant.

Binary langauge model classifiers are concurrent-read and single-write thread safe. The only write operation is training the accepting category. Classification and compilation are reads. If the language model underlying this classifier is not thread safe, then reads may not be called concurrently.

Since:
LingPipe2.0
Version:
3.9.1
Author:
Bob Carpenter

Field Summary
static String DEFAULT_ACCEPT_CATEGORY
          The default value of the category for accepting input, "true".
static String DEFAULT_REJECT_CATEGORY
          The default value of the category for rejecting input, "false".
 
Constructor Summary
BinaryLMClassifier(LanguageModel.Dynamic acceptingLM, double crossEntropyThreshold)
          Construct a binary character sequence classifier that accepts or rejects inputs based on their cross-entropy being above or below a fixed cross-entropy threshold.
BinaryLMClassifier(LanguageModel.Dynamic acceptingLM, double crossEntropyThreshold, String acceptCategory, String rejectCategory)
          Construct a binary character sequence classifier that accepts or rejects inputs based on their cross-entropy being above or below a fixed cross-entropy threshold.
 
Method Summary
 String acceptCategory()
          Returns the category assigned to matching/accepted cases.
 void handle(Classified<CharSequence> classified)
          Train this classifier using the character sequence from the specified classified object if the best category of the classification is the accept category for this binary classifier.
 String rejectCategory()
          Returns the category assigned to non-matching/rejected cases.
 void resetCategory(String category, LanguageModel.Dynamic lm, int newCount)
          Throws an UnsupportedOperationException.
 void train(String category, char[] cs, int start, int end)
          Deprecated. Use handle(Classified) instead.
 void train(String category, CharSequence cSeq)
          Deprecated. Use handle(Classified) instead.
 
Methods inherited from class com.aliasi.classify.DynamicLMClassifier
categoryEstimator, compileTo, createNGramBoundary, createNGramProcess, createTokenized, handle, lmForCategory, train
 
Methods inherited from class com.aliasi.classify.LMClassifier
categories, categoryDistribution, classify, classifyJoint, languageModel
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_ACCEPT_CATEGORY

public static final String DEFAULT_ACCEPT_CATEGORY
The default value of the category for accepting input, "true".


DEFAULT_REJECT_CATEGORY

public static final String DEFAULT_REJECT_CATEGORY
The default value of the category for rejecting input, "false".

Constructor Detail

BinaryLMClassifier

public BinaryLMClassifier(LanguageModel.Dynamic acceptingLM,
                          double crossEntropyThreshold)
Construct a binary character sequence classifier that accepts or rejects inputs based on their cross-entropy being above or below a fixed cross-entropy threshold. If an input is accepted the best category will be DEFAULT_ACCEPT_CATEGORY, otherwise it will be DEFAULT_REJECT_CATEGORY. The labels of the categories can be reversed in order to build a rejector or changed altogether with the four-argument constructor. See the class documentation for more information on training, classification and compilation.

Parameters:
acceptingLM - The language model that determines whether an input is accepted or rejected.
crossEntropyThreshold - Maximum cross-entropy against a model to accept the input.

BinaryLMClassifier

public BinaryLMClassifier(LanguageModel.Dynamic acceptingLM,
                          double crossEntropyThreshold,
                          String acceptCategory,
                          String rejectCategory)
Construct a binary character sequence classifier that accepts or rejects inputs based on their cross-entropy being above or below a fixed cross-entropy threshold. If an input is accepted the best category will be the specified accept category, otherwise it will be the specified reject category. See the class documentation for more information on training, classification and compilation.

Parameters:
acceptingLM - The language model that determines whether an input is accepted or rejected.
crossEntropyThreshold - Maximum cross-entropy against a model to accept the input.
acceptCategory - Category label for matching input.
rejectCategory - Category label for rejecting input.
Method Detail

acceptCategory

public String acceptCategory()
Returns the category assigned to matching/accepted cases.

Returns:
The acceptance category.

rejectCategory

public String rejectCategory()
Returns the category assigned to non-matching/rejected cases.

Returns:
The rejection category.

train

@Deprecated
public void train(String category,
                             char[] cs,
                             int start,
                             int end)
Deprecated. Use handle(Classified) instead.

If the specified category is the accept catgory, train the underlying language model. If the category is the reject category, only the category distribution is trained. Either way, the multivariate category estimate is not updated.

Overrides:
train in class DynamicLMClassifier<LanguageModel.Dynamic>
Parameters:
category - Category of this training sequence.
cs - Characters used for training.
start - Index of first character to use for training.
end - Index of one past the last character to use for training.
Throws:
IllegalArgumentException - If the category is unknown.

train

@Deprecated
public void train(String category,
                             CharSequence cSeq)
Deprecated. Use handle(Classified) instead.

If the specified category is the accept catgory, train the underlying language model. If the category is the reject category, ignore the call. Either way, the multivariate category estimate is not updated.

Overrides:
train in class DynamicLMClassifier<LanguageModel.Dynamic>
Parameters:
category - Category of this training sample.
cSeq - Char sequence for this training sample.
Throws:
IllegalArgumentException - If the category is unknown.

handle

public void handle(Classified<CharSequence> classified)
Train this classifier using the character sequence from the specified classified object if the best category of the classification is the accept category for this binary classifier. If the category is neither the accept or reject category, this method throws an illegal argument exception.

Specified by:
handle in interface ObjectHandler<Classified<CharSequence>>
Overrides:
handle in class DynamicLMClassifier<LanguageModel.Dynamic>
Parameters:
classified - Classified character sequence.
Throws:
IllegalArgumentException - If the best category in the classification of the classified object is neither the accept nor the reject category for this binary classifier.

resetCategory

public void resetCategory(String category,
                          LanguageModel.Dynamic lm,
                          int newCount)
Throws an UnsupportedOperationException.

Overrides:
resetCategory in class DynamicLMClassifier<LanguageModel.Dynamic>
Parameters:
category - Ignored.
lm - Ignored.
newCount - Ignored.
Throws:
UnsupportedOperationException - Always.