com.aliasi.spell
Class JaccardDistance

java.lang.Object
  extended by com.aliasi.spell.TokenizedDistance
      extended by com.aliasi.spell.JaccardDistance
All Implemented Interfaces:
Distance<CharSequence>, Proximity<CharSequence>

public class JaccardDistance
extends TokenizedDistance

The JaccardDistance class implements a notion of distance based on token overlap. The tokens are generated from the character sequences being compared by a tokenizer factory that is supplied at construction time. A distance of zero (0) is a perfect match, a distance of one (10 a perfect mismatch.

Suppose termSet(cs) is the set of tokens extracted from the character sequence cs. With these terms, the proximity underlying Jaccard distance is defined as the percentage of tokens that appear in both character sequences:

 proximity(cs1,cs2)
   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))
Proximities run between 0 and 1. A proximity of 0 means the character sequences share no terms in common and a proximity of 1 means the character sequences share all of their terms.

Distance is then defined in terms of proximity by subtraction.

 distance(cs1,cs2) = 1 - proximity(cs1,cs2)
 
Distances also run between 0 and 1. A distance of 0 means the character sequences share all of their terms, whereas a distance of 1 means they have no terms in common.

Since:
LingPipe2.4
Version:
3.8
Author:
Bob Carpenter

Constructor Summary
JaccardDistance(TokenizerFactory factory)
          Construct an instance of Jaccard string distance using the specified tokenizer factory.
 
Method Summary
 double distance(CharSequence cSeq1, CharSequence cSeq2)
          Returns the Jaccard distance between the specified character sequence.
 double proximity(CharSequence cSeq1, CharSequence cSeq2)
          Returns the proximity between the specified character sequences.
 
Methods inherited from class com.aliasi.spell.TokenizedDistance
termFrequencyVector, tokenizerFactory, tokenSet, tokenSet
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

JaccardDistance

public JaccardDistance(TokenizerFactory factory)
Construct an instance of Jaccard string distance using the specified tokenizer factory.

Parameters:
factory - Tokenizer factory for distance.
Method Detail

distance

public double distance(CharSequence cSeq1,
                       CharSequence cSeq2)
Returns the Jaccard distance between the specified character sequence. See the class definition above for a definition.

Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
Returns:
Jaccard distance between the sequences.

proximity

public double proximity(CharSequence cSeq1,
                        CharSequence cSeq2)
Returns the proximity between the specified character sequences.

Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
Returns:
Jaccard proximity between the sequences.