|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.spell.TokenizedDistance
com.aliasi.spell.JaccardDistance
public class JaccardDistance
The JaccardDistance class implements a notion of
distance based on token overlap. The tokens are generated
from the character sequences being compared by a tokenizer
factory that is supplied at construction time. A distance of
zero (0) is a perfect match, a distance of
one (10 a perfect mismatch.
Suppose termSet(cs) is the set of tokens extracted from
the character sequence cs. With these terms,
the proximity underlying Jaccard distance is defined
as the percentage of tokens that appear in both
character sequences:
proximity(cs1,cs2)
= size(termSet(cs1) INTERSECT termSet(cs2))
/ size(termSet(cs1) UNION termSet(cs2))
Proximities run between 0 and 1. A proximity of 0 means the
character sequences share no terms in common and a proximity of 1
means the character sequences share all of their terms.
Distance is then defined in terms of proximity by subtraction.
Distances also run between 0 and 1. A distance of 0 means the character sequences share all of their terms, whereas a distance of 1 means they have no terms in common.distance(cs1,cs2) = 1 - proximity(cs1,cs2)
| Constructor Summary | |
|---|---|
JaccardDistance(TokenizerFactory factory)
Construct an instance of Jaccard string distance using the specified tokenizer factory. |
|
| Method Summary | |
|---|---|
double |
distance(CharSequence cSeq1,
CharSequence cSeq2)
Returns the Jaccard distance between the specified character sequence. |
double |
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the proximity between the specified character sequences. |
| Methods inherited from class com.aliasi.spell.TokenizedDistance |
|---|
termFrequencyVector, tokenizerFactory, tokenSet, tokenSet |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public JaccardDistance(TokenizerFactory factory)
factory - Tokenizer factory for distance.| Method Detail |
|---|
public double distance(CharSequence cSeq1,
CharSequence cSeq2)
cSeq1 - First character sequence.cSeq2 - Second character sequence.
public double proximity(CharSequence cSeq1,
CharSequence cSeq2)
cSeq1 - First character sequence.cSeq2 - Second character sequence.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||