com.aliasi.spell
Class JaccardDistance
java.lang.Object
com.aliasi.spell.TokenizedDistance
com.aliasi.spell.JaccardDistance
- All Implemented Interfaces:
- Distance<CharSequence>, Proximity<CharSequence>
public class JaccardDistance
- extends TokenizedDistance
The JaccardDistance class implements a notion of
distance based on token overlap. The tokens are generated
from the character sequences being compared by a tokenizer
factory that is supplied at construction time. A distance of
zero (0) is a perfect match, a distance of
one (10 a perfect mismatch.
Suppose termSet(cs) is the set of tokens extracted from
the character sequence cs. With these terms,
the proximity underlying Jaccard distance is defined
as the percentage of tokens that appear in both
character sequences:
proximity(cs1,cs2)
= size(termSet(cs1) INTERSECT termSet(cs2))
/ size(termSet(cs1) UNION termSet(cs2))
Proximities run between 0 and 1. A proximity of 0 means the
character sequences share no terms in common and a proximity of 1
means the character sequences share all of their terms.
Distance is then defined in terms of proximity by subtraction.
distance(cs1,cs2) = 1 - proximity(cs1,cs2)
Distances also run between 0 and 1. A distance of 0 means the
character sequences share all of their terms, whereas a distance of
1 means they have no terms in common.
- Since:
- LingPipe2.4
- Version:
- 3.8
- Author:
- Bob Carpenter
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
JaccardDistance
public JaccardDistance(TokenizerFactory factory)
- Construct an instance of Jaccard string distance using
the specified tokenizer factory.
- Parameters:
factory - Tokenizer factory for distance.
distance
public double distance(CharSequence cSeq1,
CharSequence cSeq2)
- Returns the Jaccard distance between the specified character
sequence. See the class definition above for a definition.
- Parameters:
cSeq1 - First character sequence.cSeq2 - Second character sequence.
- Returns:
- Jaccard distance between the sequences.
proximity
public double proximity(CharSequence cSeq1,
CharSequence cSeq2)
- Returns the proximity between the specified character
sequences.
- Parameters:
cSeq1 - First character sequence.cSeq2 - Second character sequence.
- Returns:
- Jaccard proximity between the sequences.