com.aliasi.classify
Class LMClassifier<L extends LanguageModel,M extends MultivariateDistribution>

java.lang.Object
  extended by com.aliasi.classify.LMClassifier<L,M>
Type Parameters:
L - the type of language model used to generate text from categories
M - the multivariate distribution over categories
All Implemented Interfaces:
BaseClassifier<CharSequence>, Classifier<CharSequence,JointClassification>, ConditionalClassifier<CharSequence>, JointClassifier<CharSequence>, RankedClassifier<CharSequence>, ScoredClassifier<CharSequence>
Direct Known Subclasses:
DynamicLMClassifier

public class LMClassifier<L extends LanguageModel,M extends MultivariateDistribution>
extends Object
implements Classifier<CharSequence,JointClassification>, JointClassifier<CharSequence>

An LMClassifier performs joint probability-based classification of character sequences into non-overlapping categories based on language models for each category and a multivariate distribution over categories. Thus the subclass of Classification returned by the classify method is JointClassification. In addition to joint and conditional probabilities of categories given the input, the score of the returned joint classification is the character plus category sample entropy rate.

A language-model classifier is constructed from a fixed, finite set of categories which are assumed to have disjoint (non-overlapping) sets of members. The categories are represented as simple strings. Each category is assigned a language model. Furthermore, a multivariate distribution over the set of categories assigns marginal category probabilities.

Joint log probabilities are determined in the usual way:

log2 P(cs,cat) = log2 P(cs|cat) + log2 P(cat)
where P(cs|cat) is the probability of the character sequence cs in the language model for category cat and where P(cat) is the probability assigned by the multivariate distribution over categories. Scores are defined to be adjusted sample cross-entropy rates:
score(cs,cat)
  = (log2 P(cs,cat)) / (cs.length() + 2)
  = (log2 P(cs|cat) + log2 P(cat)) / (cs.length() + 2)
Note that the contribution of the category probability to the score approaches zero as the sample size grows and the data overwhelms the pre-data expectation. Also note that each category has its estimate divided by the same amount, so the probabilistic ordering is preserved. If the language models are process models, the cross-entroy rate is just (log2P(cs|cat))/cs.length(); for process models, add one to the denominator to account for figuratively generating the end-of-character-sequence symbol.

Note that maximizing joint probabilities is the same as maximizing conditional probabilities because the character sequence cs is constant:

ARGMAXcat P(cat|cs)
= ARGMAXcat P(cs,cat) / P(cs)
= ARGMAXcat P(cs,cat)
A computation of conditional estimates P(cat|cs) given the joint estimates is defined in JointClassification.

To ensure consistent estimates, all of the language models should either be process language models or sequence language models over the same set of characters, depending on whether probability normalization is over fixed length sequences or over all strings. On the other hand, the models themselves may be a mixture of n-gram lengths and smoothing parameters, or even in the case of sequence models, tokenized models and sequence character models.

Boolean classifiers for membership can be constructed with this class by means of a positive language model and a negative model. A character sequence is considered an instance of the category if they are more likely in the positive model than the negative model. There are several strategies for constructing anti-models. The most common methodology is to build an anti-model from an unbiased sample of negative cases, but this requires supervision for negative cases and tends to bias toward the model with more training data given the way language model cross-entropy tends to go down with more training data in general. Another approach is to build a weaker model from the same training data as the positive model, for instance by using lower order n-grams for the negative model. The simplest approach is to use a uniform negative model, which amounts to a cross-entropy rejection threshold; this is the basis of BinaryLMClassifier.

Language model classifiers may be trained using DynamicLMClassifier using a trainable multivariate estimator and dynamic language models.

Since:
LingPipe2.0
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
LMClassifier(String[] categories, L[] languageModels, M categoryDistribution)
          Construct a joint classifier for character sequences classifying over a specified set of categories, with a multivariate distribution over those categories and a language model per category.
 
Method Summary
 String[] categories()
          Returns the array of categories for this classifier.
 M categoryDistribution()
          Returns a multivariate distribution over categories for this classifier.
 JointClassification classify(CharSequence cSeq)
          Returns the joint classification of the specified character sequence.
 JointClassification classifyJoint(char[] cs, int start, int end)
          A convenience method returning a joint classification over a character array slice.
 L languageModel(String category)
          Returns the language model for the specified category.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LMClassifier

public LMClassifier(String[] categories,
                    L[] languageModels,
                    M categoryDistribution)
Construct a joint classifier for character sequences classifying over a specified set of categories, with a multivariate distribution over those categories and a language model per category. The language models are supplied in a parallel array to the categories.

The category distribution is the marginal over categories, and each language model provides a conditional estimate given its category. Categorization is as described in the class documentation.

Parameters:
categories - Array of categories for classification.
languageModels - A parallel array of language models for the categories.
categoryDistribution - The marginal distribution over the categories for classification.
Throws:
IllegalArgumentException - If there are not at least two categories, or if the category and language model arrays are not the same lenght.
Method Detail

categories

public String[] categories()
Returns the array of categories for this classifier.

This method copies the array and thus changes to it do not affect the categories for this classifier.

Returns:
The array of categories for this classifier.

languageModel

public L languageModel(String category)
Returns the language model for the specified category. The model for a specified category is used to provide estimates of P(cSeq|category), the conditional probability of a character sequence given the specified category as described in the class documentation above.

Changes to the returned model affect this classifier's behavior.

Parameters:
category - The specified category.
Returns:
The language model for the specified category.
Throws:
IllegalArgumentException - If the category is not known.

categoryDistribution

public M categoryDistribution()
Returns a multivariate distribution over categories for this classifier. This is method returns P(category), the marginal distribution over categories used during classification as described in the class documentation.

Changes to the returned distribution affect this classifier's behavior.

Returns:
The distribution over categories.

classify

public JointClassification classify(CharSequence cSeq)
Returns the joint classification of the specified character sequence.

Specified by:
classify in interface BaseClassifier<CharSequence>
Specified by:
classify in interface Classifier<CharSequence,JointClassification>
Specified by:
classify in interface ConditionalClassifier<CharSequence>
Specified by:
classify in interface JointClassifier<CharSequence>
Specified by:
classify in interface RankedClassifier<CharSequence>
Specified by:
classify in interface ScoredClassifier<CharSequence>
Parameters:
cSeq - Character sequence being classified.
Returns:
Joint classification of the specified character sequence.
Throws:
IllegalArgumentException - If the specified object is not a character sequence.

classifyJoint

public JointClassification classifyJoint(char[] cs,
                                         int start,
                                         int end)
A convenience method returning a joint classification over a character array slice. Note that estimateJoint(cs,start,end) returns the same result as esimtateJoint(new String(cs,start,end-start)).

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - One plus the index of the last character in the slice.
Throws:
IllegalArgumentException - If the start index is less than zero or greater than the end index or if the end index is not within bounds.