

PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object com.aliasi.classify.TfIdfClassifierTrainer<E>
E
 the type of object being classifiedpublic class TfIdfClassifierTrainer<E>
A TfIdfClassifierTrainer
provides a framework for
training discriminative classifiers based on termfrequency (TF)
and inverse document frequency (IDF) weighting of features.
A TfIdfClassifierTrainer
is constructed from a
feature extractor of a specified type. If the instance is to
be compiled, the feature extractor must be either serializable
or compilable.
, producing an instance
that may be trained through
Categories may be added dynamically. The initial classifier will be empty and not defined for any categories.
A TF/IDF classifier trainer is trained through the ObjectHandler
interface with a classified object. Specifically,
the method handle(Classified
is called, the
generic object being the training instance and the classification
being a simple firstbest classification.
For multiple training examples of the same category, their feature vectors are added together to produce the raw category vectors.
The compiled models perform scored classification. That is,
they implement the method classify(E)
to return a
ScoredClassification
. The scores assigned to the
different categories are normalized dot products after term
frequency and inverse document frequency weighting.
Suppose training supplied n
training
categories cat[0], ..., cat[n1]
, with
associated raw feature vectors v[0], ..., v[n1]
.
The dimensions of these vectors are the features, so that
if f
is a feature, v[i][f]
is
the raw score for the feature f
in
category cat[i]
.
First, the inverse document frequency weighting of each term is defined:
idf(f) = ln (df(f) / n)where
df(f)
is the document frequency of
feature f
, defined to be the number of
distinct categories in which feature f
is
defined. This has the effect of upweighting the scores of
features that occur in few categories and downweighting
the scores of features that occur in many categories
Term frequency normalization dampens the term frequencies using square roots:
tf(x) = sqrt(x)This produces a linear relation in pairwise growth rather than the usual quadratic one derived from a simple crossproduct.
The weighted feature vectors are as follows:
v'[i][f] = tf(v[i][f]) * idf(f)
Given an instance to classify, first the feature
extractor is used to produce a raw feature vector
x
. This is then normalized in the same
way as the document vectors v[i]
, namely:
x'[f] = tf(x[f]) * idf(f)The resulting query vector
x'
is then compared
against each normalized document vector v'[i]
using vector cosine, which defines its classification score:
score(v'[i],x') = cos(v'[i],x') = v'[i] * x' / ( length(v'[i]) * length(x') )where
v'[i] * x'
is the vector dot product:
Σ_{f} v'[i][f] * x'[f]and where the length of a vector is defined to be the square root of its dot product with itself:
length(y) = sqrt(y * y)
Cosine scores will vary between 1
and
1
. The cosine is only 1
between two
vectors if they point in the same direction; that is, one is a
positive scalar product of the other. The cosine is only
1
between two vectors if they point in opposite
direction; that is, one is a negative scalar product of the other.
The cosine is 0
for two vectors that are orthogonal,
that is, at right angles to each other. If all the values
in all of the category vectors and the query vector are
positive, cosine will run between 0
and 1
.
Warning: Because of floatingpoint arithmetic rounding,
these results about signs and bounds are not strictly guaranteed to
hold; instances may return cosines slightly below 1
or above 1
, or not return exactly 0
for
orthogonal vectors.
A TF/IDF classifier trainer may be serialized at any point. The object read back in will be an instance of the same class with the same parametric type for the objects being classified. During serialization, the feature extractor will be serialized if it's serializable, or compiled if it's compilable but not serializable. If the feature extractor is neither serializable nor compilable, serialization will throw an error.
At any point, a TF/IDF classifier may be compiled to an object
output stream. The object read back in will be an instance of
Classifier<E,ScoredClassification>
. During
compilation, the feature extractor will be compiled if it's
compilable, or serialized if it's serializable but not compilable.
If the feature extractor is neither compilable nor serializable,
compilation will throw an error.
The TF/IDF classifier indexes instances by means of their feature values.
Constructor Summary  

TfIdfClassifierTrainer(FeatureExtractor<? super E> featureExtractor)
Construct a TF/IDF classifier trainer based on the specified feature extractor. 
Method Summary  

Set<String> 
categories()
Return a copy of the set of categories for which at least one training instance has been seen. 
void 
compileTo(ObjectOutput out)
Compile this trainer to the specified object output. 
FeatureExtractor<? super E> 
featureExtractor()
Return the feature extractor for this classifier. 
void 
handle(Classified<E> classified)
Handle the specified classified object as training data. 
void 
handle(E input,
Classification classification)
Deprecated. Use handle(Classified) instead. 
Methods inherited from class java.lang.Object 

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait 
Constructor Detail 

public TfIdfClassifierTrainer(FeatureExtractor<? super E> featureExtractor)
featureExtractor
 Feature extractor for examples.Method Detail 

public FeatureExtractor<? super E> featureExtractor()
public Set<String> categories()
@Deprecated public void handle(E input, Classification classification)
handle(Classified)
instead.
handle
in interface ClassificationHandler<E,Classification>
input
 Classified object.classification
 Classification of the the object.public void handle(Classified<E> classified)
handle
in interface ObjectHandler<Classified<E>>
classified
 Classified object for training.public void compileTo(ObjectOutput out) throws IOException
compileTo
in interface Compilable
out
 Stream to which a compiled classifier is written.
UnsupportedOperationException
 If the underlying feature
extractor is neither compilable nor serializable.
IOException
 If there is an I/O error compiling the
object.


PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 