

PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object com.aliasi.classify.LMClassifier<L,M>
L
 the type of language model used to generate text from categoriesM
 the multivariate distribution over categoriespublic class LMClassifier<L extends LanguageModel,M extends MultivariateDistribution>
An LMClassifier
performs joint probabilitybased
classification of character sequences into nonoverlapping
categories based on language models for each category and a
multivariate distribution over categories. Thus the subclass of
Classification
returned by the classify method is
JointClassification
. In addition to joint and conditional
probabilities of categories given the input, the score of the
returned joint classification is the character plus category
sample entropy rate.
A languagemodel classifier is constructed from a fixed, finite set of categories which are assumed to have disjoint (nonoverlapping) sets of members. The categories are represented as simple strings. Each category is assigned a language model. Furthermore, a multivariate distribution over the set of categories assigns marginal category probabilities.
Joint log probabilities are determined in the usual way:
log_{2} P(cs,cat)
= log_{2} P(cscat)
+ log_{2} P(cat)
where P(cscat)
is the probability of the character
sequence cs
in the language model for category
cat
and where P(cat)
is the probability
assigned by the multivariate distribution over categories. Scores
are defined to be adjusted sample crossentropy rates:
score(cs,cat)
= (log_{2} P(cs,cat)) / (cs.length() + 2)
= (log_{2} P(cscat)
+ log_{2} P(cat)) / (cs.length() + 2)
Note that the contribution of the category probability to the score
approaches zero as the sample size grows and the data overwhelms
the predata expectation. Also note that each category has its
estimate divided by the same amount, so the probabilistic ordering
is preserved. If the language models are process models, the
crossentroy rate is just
(log_{2}P(cscat))/cs.length(); for
process models, add one to the denominator to account for
figuratively generating the endofcharactersequence symbol.
Note that maximizing joint probabilities is the same as
maximizing conditional probabilities because the character sequence
cs
is constant:
ARGMAX_{cat} P(catcs)
= ARGMAX_{cat} P(cs,cat) / P(cs)
= ARGMAX_{cat} P(cs,cat)
A computation of conditional estimates P(catcs)
given
the joint estimates is defined in JointClassification
.
To ensure consistent estimates, all of the language models
should either be process language models or sequence language
models over the same set of characters, depending on whether
probability normalization is over fixed length sequences or over
all strings. On the other hand, the models themselves may be a
mixture of ngram lengths and smoothing parameters, or even in the
case of sequence models, tokenized models and sequence character
models.
Boolean classifiers for membership can be constructed with this
class by means of a positive language model and a negative model.
A character sequence is considered an instance of the category if
they are more likely in the positive model than the negative model.
There are several strategies for constructing antimodels. The
most common methodology is to build an antimodel from an unbiased
sample of negative cases, but this requires supervision for
negative cases and tends to bias toward the model with more
training data given the way language model crossentropy tends to
go down with more training data in general. Another approach is to
build a weaker model from the same training data as the positive
model, for instance by using lower order ngrams for the negative
model. The simplest approach is to use a uniform negative model,
which amounts to a crossentropy rejection threshold; this is the
basis of BinaryLMClassifier
.
Language model classifiers may be trained using DynamicLMClassifier
using a trainable multivariate estimator and
dynamic language models.
 Since:
 LingPipe2.0
 Version:
 3.9.1
 Author:
 Bob Carpenter
Constructor Summary
LMClassifier(String[] categories,
L[] languageModels,
M categoryDistribution)
Construct a joint classifier for character sequences
classifying over a specified set of categories, with a
multivariate distribution over those categories and a language
model per category.
Method Summary
String[]
categories()
Returns the array of categories for this classifier.
M
categoryDistribution()
Returns a multivariate distribution over categories for this
classifier.
JointClassification
classify(CharSequence cSeq)
Returns the joint classification of the specified character sequence.
JointClassification
classifyJoint(char[] cs,
int start,
int end)
A convenience method returning a joint classification over a
character array slice.
L
languageModel(String category)
Returns the language model for the specified category.
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Constructor Detail
LMClassifier
public LMClassifier(String[] categories,
L[] languageModels,
M categoryDistribution)
 Construct a joint classifier for character sequences
classifying over a specified set of categories, with a
multivariate distribution over those categories and a language
model per category. The language models are supplied in a
parallel array to the categories.
The category distribution is the marginal over categories,
and each language model provides a conditional estimate given
its category. Categorization is as described in the class
documentation.
 Parameters:
categories
 Array of categories for classification.languageModels
 A parallel array of language models for
the categories.categoryDistribution
 The marginal distribution over the
categories for classification.
 Throws:
IllegalArgumentException
 If there are not at least two
categories, or if the category and language model arrays are not
the same lenght.
Method Detail
categories
public String[] categories()
 Returns the array of categories for this classifier.
This method copies the array and thus changes to
it do not affect the categories for this classifier.
 Returns:
 The array of categories for this classifier.
languageModel
public L languageModel(String category)
 Returns the language model for the specified category. The
model for a specified category is used to provide estimates of
P(cSeqcategory)
, the conditional probability of a
character sequence given the specified category as described in
the class documentation above.
Changes to the returned model affect this classifier's
behavior.
 Parameters:
category
 The specified category.
 Returns:
 The language model for the specified category.
 Throws:
IllegalArgumentException
 If the category is not known.
categoryDistribution
public M categoryDistribution()
 Returns a multivariate distribution over categories for this
classifier. This is method returns
P(category)
,
the marginal distribution over categories used during
classification as described in the class documentation.
Changes to the returned distribution affect this classifier's
behavior.
 Returns:
 The distribution over categories.
classify
public JointClassification classify(CharSequence cSeq)
 Returns the joint classification of the specified character sequence.
 Specified by:
classify
in interface BaseClassifier<CharSequence>
 Specified by:
classify
in interface Classifier<CharSequence,JointClassification>
 Specified by:
classify
in interface ConditionalClassifier<CharSequence>
 Specified by:
classify
in interface JointClassifier<CharSequence>
 Specified by:
classify
in interface RankedClassifier<CharSequence>
 Specified by:
classify
in interface ScoredClassifier<CharSequence>
 Parameters:
cSeq
 Character sequence being classified.
 Returns:
 Joint classification of the specified character
sequence.
 Throws:
IllegalArgumentException
 If the specified object is not
a character sequence.
classifyJoint
public JointClassification classifyJoint(char[] cs,
int start,
int end)
 A convenience method returning a joint classification over a
character array slice. Note that
estimateJoint(cs,start,end)
returns the same result
as esimtateJoint(new String(cs,start,endstart))
.
 Parameters:
cs
 Underlying character array.start
 Index of first character in slice.end
 One plus the index of the last character in the slice.
 Throws:
IllegalArgumentException
 If the start index is less than zero
or greater than the end index or if the end index is not within bounds.
Overview
Package
Class
Tree
Deprecated
Index
Help
PREV CLASS
NEXT CLASS
FRAMES
NO FRAMES
SUMMARY: NESTED  FIELD  CONSTR  METHOD
DETAIL: FIELD  CONSTR  METHOD