com.aliasi.classify
Class NaiveBayesClassifier

java.lang.Object
  extended by com.aliasi.classify.LMClassifier<L,MultivariateEstimator>
      extended by com.aliasi.classify.DynamicLMClassifier<TokenizedLM>
          extended by com.aliasi.classify.NaiveBayesClassifier
All Implemented Interfaces:
BaseClassifier<CharSequence>, Classifier<CharSequence,JointClassification>, ConditionalClassifier<CharSequence>, JointClassifier<CharSequence>, RankedClassifier<CharSequence>, ScoredClassifier<CharSequence>, ClassificationHandler<CharSequence,Classification>, Handler, ObjectHandler<Classified<CharSequence>>, Compilable

public class NaiveBayesClassifier
extends DynamicLMClassifier<TokenizedLM>

A NaiveBayesClassifier provides a trainable naive Bayes text classifier, with tokens as features. A classifier is constructed from a set of categories and a tokenizer factory. The token estimator is a unigram token language model with a uniform whitespace model and an optional n-gram character language model for smoothing unknown tokens.

Naive Bayes applied to tokenized text results in a so-called "bag of words" model where the tokens (words) are assumed to be independent of one another:

P(tokens|cat) = Πi<tokens.length P(tokens[i]|cat)
This class implements this assumption by plugging unigram token language models into a dynamic language model classifier. The unigram token language model makes the naive Bayes assumption by virtue of having no tokens of context.

The unigram model smooths maximum likelihood token estimates with a character-level model. Unfolding the general definition of that class to the unigram case yields the model:

P(token|cat)
= PtokenLM(cat)(token)
= λ * count(token,cat) / totalCount(cat)
  + (1 - λ) * PcharLM(cat)(Word)
where tokenLM(cat) is the token language model defined for the specified category and charLM(cat) is the character level language model it uses for smoothing. The unigram token model is based on counts count(token,cat) of a token in the category and an overall count totalCount(cat) of tokens in the category. The interpolation factor λ is computed as per the Witten-Bell model C with hyperparameter one:
&lambda = totalCount(cat) / (totalCount(cat) + numTokens(cat))
Roughly, the probability mass smoothed from the token model is equal to the number of first-sightings of tokens in the training data.

If this character smoothing model is uniform, there are two extremes that need to be balanced, especially in cases where there is not very much training data per category. If it is in initialized with the true number of characters, it will return a proper uniform character estimate. In practice, this will probably underestimate unknown tokens and thus categories in which they are unknown will pay a high penalty. If the token smoothing model is initalized with zero as the max number of characters, the token backoff will always be zero and thus not contribute to the classification scores. This will overestimate unknown tokens for classification, with probabilities summing to more than one. In practice, it will probably not penalize unknown words in categories enough. If the cost is greater than zero, it will be linear in the length of the unknown token.

Another way to smooth unknown tokens is to provide each model at least one instance of each token known to every other model, so there are no tokens known to one model and not another. But this adds an additional smoothing bias to the maximum likelihood character estimates which may or may not be helpful.

The unigram model is constructed with a whitespace model that returns a constant zero estimate, UniformBoundaryLM.ZERO_LM, and thus contributes no probability mass to estimates.

As with the other language model classifiers, the conditional category probability ratios are determined with a category distribution and inversion:

ARGMAXcat P(cat|tokens)
= ARGMAXcat P(cat,tokens) / P(tokens)
= ARGMAXcat P(cat,tokens)
= ARGMAXcat P(tokens|cat) * P(cat)
The category probability model P(cat) is taken to be a multivariate estimator with an initial count of one for each category.

For this class, the tokens are produced by a tokenizer factory. This tokenizer factory may normalize tokens to stems, to lower case, remove stop words, etc. An extreme example would be to trim the bag to a small set of salient words, as picked out by TF/IDF with categories as documents.

Instances of this class may be compiled and read back into memory in the same way as other instances of DynamicLMClassifier. It also inherits its concurrent-read/single-write concurrency restrictions from that class (training is write; compiling and estimating are reads).

Since:
LingPipe2.0
Version:
3.0
Author:
Bob Carpenter

Constructor Summary
NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory)
          Construct a naive Bayes classifier with the specified categories and tokenizer factory.
NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory, int charSmoothingNGram)
          Construct a naive Bayes classifier with the specified categories, tokenizer factory and level of character n-gram for smoothing token estimates.
NaiveBayesClassifier(String[] categories, TokenizerFactory tokenizerFactory, int charSmoothingNGram, int maxObservedChars)
          Construct a naive Bayes classifier with the specified categories, tokenizer factory and level of character n-gram for smoothing token estimates, along with a specification of the total number of characters in test and training instances.
 
Method Summary
 
Methods inherited from class com.aliasi.classify.DynamicLMClassifier
categoryEstimator, compileTo, createNGramBoundary, createNGramProcess, createTokenized, handle, handle, lmForCategory, resetCategory, train, train, train
 
Methods inherited from class com.aliasi.classify.LMClassifier
categories, categoryDistribution, classify, classifyJoint, languageModel
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NaiveBayesClassifier

public NaiveBayesClassifier(String[] categories,
                            TokenizerFactory tokenizerFactory)
Construct a naive Bayes classifier with the specified categories and tokenizer factory.

The character backoff models are assumed to be uniform and there is no limit on the number of observed characters other than Character.MAX_VALUE.

Parameters:
categories - Categories into which to classify text.
tokenizerFactory - Text tokenizer.
Throws:
IllegalArgumentException - If there are not at least two categories.

NaiveBayesClassifier

public NaiveBayesClassifier(String[] categories,
                            TokenizerFactory tokenizerFactory,
                            int charSmoothingNGram)
Construct a naive Bayes classifier with the specified categories, tokenizer factory and level of character n-gram for smoothing token estimates. If the character n-gram is less than one, a uniform model will be used.

There is no limit on the number of observed characters other than Character.MAX_VALUE.

Parameters:
categories - Categories into which to classify text.
tokenizerFactory - Text tokenizer.
charSmoothingNGram - Order of character n-gram used to smooth token estimates.
Throws:
IllegalArgumentException - If there are not at least two categories.

NaiveBayesClassifier

public NaiveBayesClassifier(String[] categories,
                            TokenizerFactory tokenizerFactory,
                            int charSmoothingNGram,
                            int maxObservedChars)
Construct a naive Bayes classifier with the specified categories, tokenizer factory and level of character n-gram for smoothing token estimates, along with a specification of the total number of characters in test and training instances. If the character n-gram is less than one, a uniform model will be used.

As noted in the class documentation above, setting the max observed characters parameter to one effectively eliminates estimates of the string of an unknown token.

Parameters:
categories - Categories into which to classify text.
tokenizerFactory - Text tokenizer.
charSmoothingNGram - Order of character n-gram used to smooth token estimates.
maxObservedChars - The maximum number of characters found in the text of training and test sets.
Throws:
IllegalArgumentException - If there are not at least two categories or if the number of observed characters is less than 1 or more than the total number of characters.