com.aliasi.classify
Class TfIdfClassifierTrainer<E>

java.lang.Object
  extended by com.aliasi.classify.TfIdfClassifierTrainer<E>
Type Parameters:
E - the type of object being classified
All Implemented Interfaces:
ClassificationHandler<E,Classification>, Handler, ObjectHandler<Classified<E>>, Compilable, Serializable

public class TfIdfClassifierTrainer<E>
extends Object
implements ClassificationHandler<E,Classification>, ObjectHandler<Classified<E>>, Compilable, Serializable

A TfIdfClassifierTrainer provides a framework for training discriminative classifiers based on term-frequency (TF) and inverse document frequency (IDF) weighting of features.

Construction

A TfIdfClassifierTrainer is constructed from a feature extractor of a specified type. If the instance is to be compiled, the feature extractor must be either serializable or compilable. , producing an instance that may be trained through

Training

Categories may be added dynamically. The initial classifier will be empty and not defined for any categories.

A TF/IDF classifier trainer is trained through the ObjectHandler interface with a classified object. Specifically, the method handle(Classified) is called, the generic object being the training instance and the classification being a simple first-best classification.

For multiple training examples of the same category, their feature vectors are added together to produce the raw category vectors.

Classification

The compiled models perform scored classification. That is, they implement the method classify(E) to return a ScoredClassification. The scores assigned to the different categories are normalized dot products after term frequency and inverse document frequency weighting.

Suppose training supplied n training categories cat[0], ..., cat[n-1], with associated raw feature vectors v[0], ..., v[n-1]. The dimensions of these vectors are the features, so that if f is a feature, v[i][f] is the raw score for the feature f in category cat[i].

First, the inverse document frequency weighting of each term is defined:

     idf(f) = ln (df(f) / n)
where df(f) is the document frequency of feature f, defined to be the number of distinct categories in which feature f is defined. This has the effect of upweighting the scores of features that occur in few categories and downweighting the scores of features that occur in many categories

Term frequency normalization dampens the term frequencies using square roots:

     tf(x) = sqrt(x)
This produces a linear relation in pairwise growth rather than the usual quadratic one derived from a simple cross-product.

The weighted feature vectors are as follows:

     v'[i][f] = tf(v[i][f]) * idf(f)

Given an instance to classify, first the feature extractor is used to produce a raw feature vector x. This is then normalized in the same way as the document vectors v[i], namely:

     x'[f] = tf(x[f]) * idf(f)
The resulting query vector x' is then compared against each normalized document vector v'[i] using vector cosine, which defines its classification score:
     score(v'[i],x')
     = cos(v'[i],x')
     = v'[i] * x' / ( length(v'[i]) * length(x') )
where v'[i] * x' is the vector dot product:
     Σf v'[i][f] * x'[f]
and where the length of a vector is defined to be the square root of its dot product with itself:
     length(y) = sqrt(y * y)

Cosine scores will vary between -1 and 1. The cosine is only 1 between two vectors if they point in the same direction; that is, one is a positive scalar product of the other. The cosine is only -1 between two vectors if they point in opposite direction; that is, one is a negative scalar product of the other. The cosine is 0 for two vectors that are orthogonal, that is, at right angles to each other. If all the values in all of the category vectors and the query vector are positive, cosine will run between 0 and 1.

Warning: Because of floating-point arithmetic rounding, these results about signs and bounds are not strictly guaranteed to hold; instances may return cosines slightly below -1 or above 1, or not return exactly 0 for orthogonal vectors.

Serialization

A TF/IDF classifier trainer may be serialized at any point. The object read back in will be an instance of the same class with the same parametric type for the objects being classified. During serialization, the feature extractor will be serialized if it's serializable, or compiled if it's compilable but not serializable. If the feature extractor is neither serializable nor compilable, serialization will throw an error.

Compilation

At any point, a TF/IDF classifier may be compiled to an object output stream. The object read back in will be an instance of Classifier<E,ScoredClassification>. During compilation, the feature extractor will be compiled if it's compilable, or serialized if it's serializable but not compilable. If the feature extractor is neither compilable nor serializable, compilation will throw an error.

Reverse Indexing

The TF/IDF classifier indexes instances by means of their feature values.

Since:
LingPipe3.1
Version:
3.9.1
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
TfIdfClassifierTrainer(FeatureExtractor<? super E> featureExtractor)
          Construct a TF/IDF classifier trainer based on the specified feature extractor.
 
Method Summary
 Set<String> categories()
          Return a copy of the set of categories for which at least one training instance has been seen.
 void compileTo(ObjectOutput out)
          Compile this trainer to the specified object output.
 FeatureExtractor<? super E> featureExtractor()
          Return the feature extractor for this classifier.
 void handle(Classified<E> classified)
          Handle the specified classified object as training data.
 void handle(E input, Classification classification)
          Deprecated. Use handle(Classified) instead.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TfIdfClassifierTrainer

public TfIdfClassifierTrainer(FeatureExtractor<? super E> featureExtractor)
Construct a TF/IDF classifier trainer based on the specified feature extractor. This feature extractor must be either serializable or compilable if the resulting trainer is to be compilable.

Parameters:
featureExtractor - Feature extractor for examples.
Method Detail

featureExtractor

public FeatureExtractor<? super E> featureExtractor()
Return the feature extractor for this classifier.

Returns:
The feature extractor for this classifier.

categories

public Set<String> categories()
Return a copy of the set of categories for which at least one training instance has been seen.

Returns:
The set of categories for this trainer.

handle

@Deprecated
public void handle(E input,
                              Classification classification)
Deprecated. Use handle(Classified) instead.

Train the classifier on the specified object with the specified classification.

Specified by:
handle in interface ClassificationHandler<E,Classification>
Parameters:
input - Classified object.
classification - Classification of the the object.

handle

public void handle(Classified<E> classified)
Handle the specified classified object as training data.

Specified by:
handle in interface ObjectHandler<Classified<E>>
Parameters:
classified - Classified object for training.

compileTo

public void compileTo(ObjectOutput out)
               throws IOException
Compile this trainer to the specified object output.

Specified by:
compileTo in interface Compilable
Parameters:
out - Stream to which a compiled classifier is written.
Throws:
UnsupportedOperationException - If the underlying feature extractor is neither compilable nor serializable.
IOException - If there is an I/O error compiling the object.