

PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object com.aliasi.classify.LMClassifier<L,MultivariateEstimator> com.aliasi.classify.DynamicLMClassifier<TokenizedLM> com.aliasi.classify.NaiveBayesClassifier
public class NaiveBayesClassifier
A NaiveBayesClassifier
provides a trainable naive Bayes
text classifier, with tokens as features. A classifier is
constructed from a set of categories and a tokenizer factory. The
token estimator is a unigram token language model with a uniform
whitespace model and an optional ngram character language model
for smoothing unknown tokens.
Naive Bayes applied to tokenized text results in a socalled "bag of words" model where the tokens (words) are assumed to be independent of one another:
P(tokenscat)
= Π_{i<tokens.length}
P(tokens[i]cat)
This class implements this assumption by plugging unigram token
language models into a dynamic language model classifier. The
unigram token language model makes the naive Bayes assumption by
virtue of having no tokens of context.
The unigram model smooths maximum likelihood token estimates with a characterlevel model. Unfolding the general definition of that class to the unigram case yields the model:
P(tokencat)
= P_{tokenLM(cat)}(token)
= λ * count(token,cat) / totalCount(cat)
+ (1  λ) * P_{charLM(cat)}(Word)
where tokenLM(cat)
is the token language model defined
for the specified category and charLM(cat)
is the
character level language model it uses for smoothing. The unigram
token model is based on counts count(token,cat)
of a
token in the category and an overall count
totalCount(cat)
of tokens in the category. The
interpolation factor λ
is computed as per the
WittenBell model C with hyperparameter one:
&lambda = totalCount(cat) / (totalCount(cat) + numTokens(cat))
Roughly, the probability mass smoothed from the token model is
equal to the number of firstsightings of tokens in the training
data.
If this character smoothing model is uniform, there are two extremes that need to be balanced, especially in cases where there is not very much training data per category. If it is in initialized with the true number of characters, it will return a proper uniform character estimate. In practice, this will probably underestimate unknown tokens and thus categories in which they are unknown will pay a high penalty. If the token smoothing model is initalized with zero as the max number of characters, the token backoff will always be zero and thus not contribute to the classification scores. This will overestimate unknown tokens for classification, with probabilities summing to more than one. In practice, it will probably not penalize unknown words in categories enough. If the cost is greater than zero, it will be linear in the length of the unknown token.
Another way to smooth unknown tokens is to provide each model at least one instance of each token known to every other model, so there are no tokens known to one model and not another. But this adds an additional smoothing bias to the maximum likelihood character estimates which may or may not be helpful.
The unigram model is constructed with a whitespace model that
returns a constant zero estimate, UniformBoundaryLM.ZERO_LM
,
and thus contributes no probability mass to estimates.
As with the other language model classifiers, the conditional category probability ratios are determined with a category distribution and inversion:
ARGMAX_{cat} P(cattokens)
= ARGMAX_{cat} P(cat,tokens) / P(tokens)
= ARGMAX_{cat} P(cat,tokens)
= ARGMAX_{cat} P(tokenscat) * P(cat)
The category probability model P(cat)
is taken
to be a multivariate estimator with an initial count of one
for each category.
For this class, the tokens are produced by a tokenizer factory. This tokenizer factory may normalize tokens to stems, to lower case, remove stop words, etc. An extreme example would be to trim the bag to a small set of salient words, as picked out by TF/IDF with categories as documents.
Instances of this class may be compiled and read back into
memory in the same way as other instances of DynamicLMClassifier
. It also inherits its
concurrentread/singlewrite concurrency restrictions from that
class (training is write; compiling and estimating are reads).
Constructor Summary  

NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory)
Construct a naive Bayes classifier with the specified categories and tokenizer factory. 

NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory,
int charSmoothingNGram)
Construct a naive Bayes classifier with the specified categories, tokenizer factory and level of character ngram for smoothing token estimates. 

NaiveBayesClassifier(String[] categories,
TokenizerFactory tokenizerFactory,
int charSmoothingNGram,
int maxObservedChars)
Construct a naive Bayes classifier with the specified categories, tokenizer factory and level of character ngram for smoothing token estimates, along with a specification of the total number of characters in test and training instances. 
Method Summary 

Methods inherited from class com.aliasi.classify.DynamicLMClassifier 

categoryEstimator, compileTo, createNGramBoundary, createNGramProcess, createTokenized, handle, handle, lmForCategory, resetCategory, train, train, train 
Methods inherited from class com.aliasi.classify.LMClassifier 

categories, categoryDistribution, classify, classifyJoint, languageModel 
Methods inherited from class java.lang.Object 

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait 
Constructor Detail 

public NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory)
The character backoff models are assumed to be uniform
and there is no limit on the number of observed characters
other than Character.MAX_VALUE
.
categories
 Categories into which to classify text.tokenizerFactory
 Text tokenizer.
IllegalArgumentException
 If there are not at least two
categories.public NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory, int charSmoothingNGram)
There is no limit on the number of observed characters
other than Character.MAX_VALUE
.
categories
 Categories into which to classify text.tokenizerFactory
 Text tokenizer.charSmoothingNGram
 Order of character ngram used to
smooth token estimates.
IllegalArgumentException
 If there are not at least two
categories.public NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory, int charSmoothingNGram, int maxObservedChars)
As noted in the class documentation above, setting the max observed characters parameter to one effectively eliminates estimates of the string of an unknown token.
categories
 Categories into which to classify text.tokenizerFactory
 Text tokenizer.charSmoothingNGram
 Order of character ngram used to
smooth token estimates.maxObservedChars
 The maximum number of characters found
in the text of training and test sets.
IllegalArgumentException
 If there are not at least two
categories or if the number of observed characters is less than 1
or more than the total number of characters.


PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 