com.aliasi.classify
Class DynamicLMClassifier<L extends LanguageModel.Dynamic>

java.lang.Object
  extended by com.aliasi.classify.LMClassifier<L,MultivariateEstimator>
      extended by com.aliasi.classify.DynamicLMClassifier<L>
Type Parameters:
L - the type of dynamic language model for this classifier
All Implemented Interfaces:
BaseClassifier<CharSequence>, Classifier<CharSequence,JointClassification>, ConditionalClassifier<CharSequence>, JointClassifier<CharSequence>, RankedClassifier<CharSequence>, ScoredClassifier<CharSequence>, ClassificationHandler<CharSequence,Classification>, Handler, ObjectHandler<Classified<CharSequence>>, Compilable
Direct Known Subclasses:
BinaryLMClassifier, NaiveBayesClassifier

public class DynamicLMClassifier<L extends LanguageModel.Dynamic>
extends LMClassifier<L,MultivariateEstimator>
implements ClassificationHandler<CharSequence,Classification>, ObjectHandler<Classified<CharSequence>>, Compilable

A DynamicLMClassifier is a language model classifier that accepts training events of categorized character sequences. Training is based on a multivariate estimator for the category distribution and dynamic language models for the per-category character sequence estimators. These models also form the basis of the superclass's implementation of classification.

Because this class implements training and classification, it may be used in tag-a-little, learn-a-little supervised learning without retraining epochs. This makes it ideal for active learning applications, for instance.

At any point after adding training events, the classfier may be compiled to an object output. The classifier read back in will be a non-dynamic instance of LMClassifier. It will be based on the compiled version of the multivariate estimator and the compiled version of the dynamic language models for the categories.

Instances of this class allow concurrent read operations but require writes to run exclusively. Reads in this context are either calculating estimates or compiling; writes are training. Extensions to LingPipe's classes may impose tighter restrictions. For instance, a subclass of MultivariateEstimator might be used that does not allow concurrent estimates; in that case, its restrictions are passed on to this classifier. The same goes for the language models and in the case of token language models, the tokenizer factories.

Since:
LingPipe2.0
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
DynamicLMClassifier(String[] categories, L[] languageModels)
          Construct a dynamic language model classifier over the specified categories with specified language models per category and an overall category estimator.
 
Method Summary
 MultivariateEstimator categoryEstimator()
          Deprecated. As of 3.0, use general method LMClassifier.categoryDistribution().
 void compileTo(ObjectOutput objOut)
          Writes a compiled version of this classifier to the specified object output.
static DynamicLMClassifier<NGramBoundaryLM> createNGramBoundary(String[] categories, int maxCharNGram)
          Construct a dynamic classifier over the specified cateogries, using boundary character n-gram models of the specified order.
static DynamicLMClassifier<NGramProcessLM> createNGramProcess(String[] categories, int maxCharNGram)
          Construct a dynamic classifier over the specified categories, using process character n-gram models of the specified order.
static DynamicLMClassifier<TokenizedLM> createTokenized(String[] categories, TokenizerFactory tokenizerFactory, int maxTokenNGram)
          Construct a dynamic language model classifier over the specified categories using token n-gram language models of the specified order and the specified tokenizer factory for tokenization.
 void handle(CharSequence charSequence, Classification classification)
          Deprecated. Use handle(Classified) instead.
 void handle(Classified<CharSequence> classified)
          Provides a training instance for the specified character sequence using the best category from the specified classification.
 L lmForCategory(String category)
          Deprecated. As of 3.0, use general LMClassifier.languageModel(String).
 void resetCategory(String category, L lm, int newCount)
          Resets the specified category to the specified language model.
 void train(String category, char[] cs, int start, int end)
          Deprecated. Use handle(Classified) instead.
 void train(String category, CharSequence sampleCSeq)
          Deprecated. Use handle(Classified) instead.
 void train(String category, CharSequence sampleCSeq, int count)
          Provide a training instance for the specified category consisting of the specified sample character sequence with the specified count.
 
Methods inherited from class com.aliasi.classify.LMClassifier
categories, categoryDistribution, classify, classifyJoint, languageModel
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DynamicLMClassifier

public DynamicLMClassifier(String[] categories,
                           L[] languageModels)
Construct a dynamic language model classifier over the specified categories with specified language models per category and an overall category estimator.

The multivariate estimator over categories is initialized with one count for each category. Technically, initializing counts involves a uniform Dirichlet prior with α=1, which is often called Laplace smoothing.

Parameters:
categories - Categories used for classification.
languageModels - Dynamic language models for categories.
Throws:
IllegalArgumentException - If there are not at least two categories, or if the length of the category and language model arrays is not the same.
Method Detail

train

@Deprecated
public void train(String category,
                             char[] cs,
                             int start,
                             int end)
Deprecated. Use handle(Classified) instead.

Provide a training instance for the specified category consisting of the sequence of characters in the specified character slice. A call to this method increments the count of the category in the maximum likelihood estimator and also trains the language model for the specified category. Thus the balance of categories reflected in calls to this method for training should reflect the balance of categories in the test set.

No modeling of the begin or end of the sequence is carried out. If such a behavior is desired, it should be reflected in the training instances supplied to this method.

The component models for this classifier may be accessed and trained independently using LMClassifier.categoryDistribution() and LMClassifier.languageModel(String).

Parameters:
category - Category of this training sequence.
cs - Characters used for training.
start - Index of first character to use for training.
end - Index of one past the last character to use for training.
Throws:
IllegalArgumentException - If the category is not known.

train

@Deprecated
public void train(String category,
                             CharSequence sampleCSeq)
Deprecated. Use handle(Classified) instead.

Provide a training instance for the specified category consisting of the specified sample character sequence. Training behavior is as described in train(String,char[],int,int).

Parameters:
category - Category of this training sequence.
sampleCSeq - Category sequence for training.
Throws:
IllegalArgumentException - If the category is not known.

train

public void train(String category,
                  CharSequence sampleCSeq,
                  int count)
Provide a training instance for the specified category consisting of the specified sample character sequence with the specified count. Training behavior is as described in train(String,char[],int,int).

Counts of zero are ignored, whereas counts less than zero raise an exception.

Parameters:
category - Category of this training sequence.
sampleCSeq - Category sequence for training.
count - Number of training instances.
Throws:
IllegalArgumentException - If the category is not known or if the count is negative.

handle

@Deprecated
public void handle(CharSequence charSequence,
                              Classification classification)
Deprecated. Use handle(Classified) instead.

Provides a training instance for the specified character sequence using the best category from the specified classification. Only the first-best category from the classification is used. The object is cast to CharSequence, and the result passed along with the first-best category to train(String,CharSequence).

Specified by:
handle in interface ClassificationHandler<CharSequence,Classification>
Parameters:
charSequence - Character sequence for training.
classification - Classification to use for training.
Throws:
ClassCastException - If the specified object does not implement CharSequence.

handle

public void handle(Classified<CharSequence> classified)
Provides a training instance for the specified character sequence using the best category from the specified classification. Only the first-best category from the classification is used.

Specified by:
handle in interface ObjectHandler<Classified<CharSequence>>
Parameters:
classified - Classified character sequence to treat as training data.

categoryEstimator

@Deprecated
public MultivariateEstimator categoryEstimator()
Deprecated. As of 3.0, use general method LMClassifier.categoryDistribution().

Returns the maximum likelihood estimator for categories in this classifier. Changes to the returned model will be reflected in this classifier; thus it may be used to train the category estimator without affecting the language models for any category.

Returns:
The maximum likelihood estimator for categories in this classifier.

lmForCategory

@Deprecated
public L lmForCategory(String category)
Deprecated. As of 3.0, use general LMClassifier.languageModel(String).

Returns the language model for the specified category. Changes to the returned model will be reflected in this classifier; thus it may be used to train a language model without affecting the category estimates.

Returns:
The language model for the specified category.
Throws:
IllegalArgumentException - If the category is not known.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled version of this classifier to the specified object output. The object returned will be an instance of LMClassifier.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this classifier is written.
Throws:
IOException - If there is an I/O exception writing to the output stream.

resetCategory

public void resetCategory(String category,
                          L lm,
                          int newCount)
Resets the specified category to the specified language model. This also resets the count in the multivariate estimator of categories to zero.

Parameters:
category - Category to reset.
lm - New dynamic language model for category.
newCount - New count for category.
Throws:
IllegalArgumentException - If the category is not known.

createNGramProcess

public static DynamicLMClassifier<NGramProcessLM> createNGramProcess(String[] categories,
                                                                     int maxCharNGram)
Construct a dynamic classifier over the specified categories, using process character n-gram models of the specified order.

See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for information on the category multivariate estimate for priors.

Parameters:
categories - Categories used for classification.
maxCharNGram - Maximum length of character sequence counted in model.
Throws:
IllegalArgumentException - If there are not at least two categories.

createNGramBoundary

public static DynamicLMClassifier<NGramBoundaryLM> createNGramBoundary(String[] categories,
                                                                       int maxCharNGram)
Construct a dynamic classifier over the specified cateogries, using boundary character n-gram models of the specified order.

See the documentation for the constructor DynamicLMClassifier(String[], LanguageModel.Dynamic[]) for information on the category multivariate estimate for priors.

Parameters:
categories - Categories used for classification.
maxCharNGram - Maximum length of character sequence counted in model.
Throws:
IllegalArgumentException - If there are not at least two categories.

createTokenized

public static DynamicLMClassifier<TokenizedLM> createTokenized(String[] categories,
                                                               TokenizerFactory tokenizerFactory,
                                                               int maxTokenNGram)
Construct a dynamic language model classifier over the specified categories using token n-gram language models of the specified order and the specified tokenizer factory for tokenization.

The multivariate estimator over categories is initialized with one count for each category.

The unknown token and whitespace models are uniform sequence models.

Parameters:
categories - Categories used for classification.
maxTokenNGram - Maximum length of token n-grams used.
tokenizerFactory - Tokenizer factory for tokenization.
Throws:
IllegalArgumentException - If there are not at least two categories.