com.aliasi.classify
Class KnnClassifier<E>

java.lang.Object
  extended by com.aliasi.classify.KnnClassifier<E>
Type Parameters:
E - the type of objects being classified
All Implemented Interfaces:
BaseClassifier<E>, Classifier<E,ScoredClassification>, RankedClassifier<E>, ScoredClassifier<E>, ClassificationHandler<E,Classification>, Handler, ObjectHandler<Classified<E>>, Compilable, Serializable

public class KnnClassifier<E>
extends Object
implements ClassificationHandler<E,Classification>, Classifier<E,ScoredClassification>, ScoredClassifier<E>, ObjectHandler<Classified<E>>, Compilable, Serializable

A KnnClassifier implements k-nearest-neighor classification based on feature extraction and a vector proximity or distance. K-nearest-neighbor classification is a kind of memory-based learning in which every training instance is stored along with its category. To classify an object, the k nearest training examples to the object being classified are found. Each of the k nearest neighbors votes for its training category. The resulting classification scores are the result of voting. It is possible to weight the votes by proximity.

K-nearest-neighbor classifiers are particularly effective for highly irregular classification boundaries where linear classifiers like the perceptron have a hard time discriminiting instances. For instance, it's possible to learn checkerboard patterns in 2D space using k-nearest-neighbor classification.

Construction

A k-nearest neighbor classifier is constructed using a feature extractor, the number of neighbors k to consider, a vector distance or proximity and a boolean indicator of whether to weight results by proximity or treat the nearest neighbors equally.

Training

Training simply involves storing the feature vector for each training instance along with its category. The vectors are stored as instances of SparseFloatVector for space efficiency. They are constructed using a symbol table for features based on the specified feature extractor for this classifier.

Appropriate Distance and Proximity Functions

Nearness is defined by the proximity or distance functions over vectors supplied at construction time. As objects move nearer to one another, their distance decreases and their proximity increases. Distance measures are converted into proximities behind the scenes by inversion, as it leaves all proximities positive and finite:

     proximity(v1,v2) = 1 / (1 + distance(v1,v2))
This will scale distance functions that return results between 0 and positive infinity to return proximities between 0 and 1.

Warning: Distance functions used for k-nearest-neighbors classification should not return negative values; any zero or negative values will be converted to Double.POSITIVE_INFINITY.

Classification

Classification involves finding the k nearest neighbors to a query point. Both training instances and instances to be classified are converted to feature mappings using the specified feature extractor, and then encoded as sparse vectors using an implicitly managed feature symbol table.

The first step in classification is simply collecting the k nearest neighbors. That is, the training examples that have the greatest proximity to the example being classified. Given the set of k training examples that are closest to the test point, one of two strategies is used for determing scores. In the simple, unweighted case, the score of a category is simply the number of vectors in the k nearest neighbors with that category. In the weighted case, each vector in the k nearest neighbors contributes its proximity, and the final score is the sum of all proximities.

Choosing the Number of Neighbors k

In most cases, it makes sense to try to optimize for the number of neighbors k using cross-validation or held-out data.

In the weighted case, it sometimes makes sense to take the maximum number of neighbors k to be very large, potentially even Integer.MAX_VALUE. This is because the examples are weighted by proximity, and those far away may have vanishingly small proximities.

Serialization and Compilation

There is no compilation required for k-nearest-neighbors classification, so serialization and compilation produce the same result. The object read back in after serialization or compilation should be identical to the one serialized or compiled.

Implementation Notes

This is a brute force implementation of k-nearest neighbors in the sense that every training example is multiplied by every object being classified. Some k-nearest-neighbor implementations attempt to efficiently index the training examples, using techniques such as KD trees, so that search for nearest neighbors can be more efficient.

References

K-nearest-neighbor classification is widely used, and well described in several texts:

Since:
LingPipe3.1
Version:
3.9.1
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
KnnClassifier(FeatureExtractor<? super E> featureExtractor, int k)
          Construct a k-nearest-neighbor classifier based on the specified feature extractor and using the specified number of neighbors.
KnnClassifier(FeatureExtractor<? super E> featureExtractor, int k, Distance<Vector> distance)
          Construct a k-nearest-neighbor classifier based on the specified feature extractor, specified maximum number of neighbors, and specified distance function.
KnnClassifier(FeatureExtractor<? super E> extractor, int k, Proximity<Vector> proximity, boolean weightByProximity)
          Construct a k-nearest-neighbor classifier based on the specified feature extractor, specified maximum number of neighbors, specified proximity function, and boolean flag indicating whether or not to weight nearest neighbors by proximity during classification.
 
Method Summary
 List<String> categories()
          Returns a copy of he current list of categories for this classifier.
 ScoredClassification classify(E in)
          Return the k-nearest-neighbor classification result for the specified input object.
 void compileTo(ObjectOutput out)
          Compiles this k-nearest-neighbor classifier to the specified object output stream.
 FeatureExtractor<? super E> featureExtractor()
          Returns the feature extractor for ths KNN classifier.
 void handle(Classified<E> classifiedObject)
          Handle the specified classified object as a training instance.
 void handle(E trainingInstance, Classification classification)
          Deprecated. Use handle(Classified) instead.
 int k()
          Returns the number K of neighbors used by this K-nearest neighbors classifier.
 Proximity<Vector> proximity()
          Returns the proximity measure for this KNN classifier.
 boolean weightByProximity()
          Returns true if resposes are weighted by proximity.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KnnClassifier

public KnnClassifier(FeatureExtractor<? super E> featureExtractor,
                     int k)
Construct a k-nearest-neighbor classifier based on the specified feature extractor and using the specified number of neighbors. The distance measure will be taken to be EuclideanDistance. The nearest neighbors will not be weighted by proximity, and thus all have equal votes.

Parameters:
featureExtractor - Feature extractor for training and classification instances.
k - Maximum number of neighbors to use during classification.

KnnClassifier

public KnnClassifier(FeatureExtractor<? super E> featureExtractor,
                     int k,
                     Distance<Vector> distance)
Construct a k-nearest-neighbor classifier based on the specified feature extractor, specified maximum number of neighbors, and specified distance function. The nearest neighbors will not be weighted by proximity, and thus all have equal votes.

Parameters:
featureExtractor - Feature extractor for training and classification instances.
k - Maximum number of neighbors to use during classification.
distance - Distance function to use to compare examples.

KnnClassifier

public KnnClassifier(FeatureExtractor<? super E> extractor,
                     int k,
                     Proximity<Vector> proximity,
                     boolean weightByProximity)
Construct a k-nearest-neighbor classifier based on the specified feature extractor, specified maximum number of neighbors, specified proximity function, and boolean flag indicating whether or not to weight nearest neighbors by proximity during classification.

Parameters:
extractor - Feature extractor for training and classification instances.
k - Maximum number of neighbors to use during classification.
proximity - Proximity function to compare examples.
weightByProximity - Flag indicating whether to weight neighbors by proximity during classification.
Method Detail

featureExtractor

public FeatureExtractor<? super E> featureExtractor()
Returns the feature extractor for ths KNN classifier.

Returns:
The feature extractor.

proximity

public Proximity<Vector> proximity()
Returns the proximity measure for this KNN classifier.

Returns:
The proximity.

categories

public List<String> categories()
Returns a copy of he current list of categories for this classifier.

Returns:
The categories for this classifier.

weightByProximity

public boolean weightByProximity()
Returns true if resposes are weighted by proximity.

Returns:
A flag indicating if this classifier is weighted by proximity.

k

public int k()
Returns the number K of neighbors used by this K-nearest neighbors classifier.

Returns:
The K for this KNN classifier.

handle

@Deprecated
public void handle(E trainingInstance,
                              Classification classification)
Deprecated. Use handle(Classified) instead.

Handle the specified classified training instance. The training instance is converted to a feature vector using the feature extractor, and then stored as a sparse vector relative to a feature symbol table.

Specified by:
handle in interface ClassificationHandler<E,Classification>
Parameters:
trainingInstance - Object being classified during training.
classification - Classification for specified object.

handle

public void handle(Classified<E> classifiedObject)
Handle the specified classified object as a training instance. The training instance is converted to a feature vector using the feature extractor, and then stored as a sparse vector relative to a feature symbol table.

Specified by:
handle in interface ObjectHandler<Classified<E>>
Parameters:
classifiedObject - Classified Object to use for training.

classify

public ScoredClassification classify(E in)
Return the k-nearest-neighbor classification result for the specified input object. The resulting classification will have all of the categories defined, though those with no support in the nearest neighbors will have scores of zero.

If this classifier does not weight by proximity, the resulting score for a category will be the number of nearest neighbors of the specified category. That is, it will be a straight vote.

If the classifier does weight by proximity, the resulting score for a category will be the sum of the proximity scores for the nearest neighbors of a given category. Instances with no near neighbors will be scored zero (0). Thus proximities should be configured to return positive values.

Specified by:
classify in interface BaseClassifier<E>
Specified by:
classify in interface Classifier<E,ScoredClassification>
Specified by:
classify in interface RankedClassifier<E>
Specified by:
classify in interface ScoredClassifier<E>
Parameters:
in - Object to classify.
Returns:
Scored classification for the specified object.

compileTo

public void compileTo(ObjectOutput out)
               throws IOException
Compiles this k-nearest-neighbor classifier to the specified object output stream.

This is only a convenience method. It provides exactly the same function as standard serialization.

Specified by:
compileTo in interface Compilable
Parameters:
out - Output stream to which this classifier is written.
Throws:
IOException - If there is an underlying I/O exception during compilation.