

PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 
java.lang.Object com.aliasi.spell.TokenizedDistance com.aliasi.spell.TfIdfDistance
public class TfIdfDistance
The TfIdfDistance
class provides a string distance
based on term frequency (TF) and inverse document frequency (IDF).
The method distance(CharSequence,CharSequence)
will return
results in the range between 0
(perfect match) and
1
(no match) inclusive; the method proximity(CharSequence,CharSequence)
runs in the opposite
direction, returning 0
for no match and 1
for a perfect match. Full details are provided below.
Terms are produced from the character sequences being compared by a tokenizer factory fixed at construction time. These terms form the dimensions of vectors whose values are the counts for the terms in the strings being compared.
The raw term frequencies are adjusted in scale and by inverse
document frequency. The resulting term vectors are then compared
by one minus their cosine. Because the term vectors contain only
positive values, the result is a distance between zero
(0
), for completely dissimilar strings, to one
(1
), for characterbycharacter identical strings.
The inverse document frequencies are defined over a collection
of documents. The collection of documents must be provided to this
class one at a time through either the generic text handler method
handle(char[],int,int)
.
Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inversedocument frequencies by logs, and both inversedocument frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.
Suppose we have a collection docs
of n
strings, which we will call documents in keeping with tradition.
Further let df(t,docs)
be the document frequency of
token t
, that is, the number of documents in which the
token t
appears. Then the inverse document frequency
(IDF) of t
is defined by:
idf(t,docs) = sqrt(log(n/df(t,docs)))
If the document frequency df(t,docs)
of a term is
zero, then idf(t,docs)
is set to zero. As a result,
only terms that appeared in at least one training document are
used during comparison.
The term vector for a string is then defined by its term
frequencies. If count(t,cs)
is the count of term
t
in character sequence cs
, then
the term frequency (TF) is defined by:
tf(t,cs) = sqrt(count(t,cs))
The termfrequency/inversedocument frequency (TF/IDF) vector
tfIdf(cs,docs)
for a character sequence cs
over a collection of documents ds
has a value
tfIdf(cs,docs)(t)
for term t
defined by:
tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)
The proximity between character sequences cs1
and
cs2
is defined as the cosine of their TF/IDF
vectors:
dist(cs1,cs2) = 1  cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))
Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:
cos(x,y) = x ^{.} y / ( x * y )
where dot products are defined by:
x ^{.} y = Σ_{i} x[i] * y[i]
and length is defined by:
x = sqrt(x ^{.} x)
Distance is then just 1 minus the proximity value.
distance(cs1,cs2) = 1  proximity(cs1,cs2)
org.apache.lucene.search.Similarity
Class Documentation.
Field Summary 

Fields inherited from class com.aliasi.spell.TokenizedDistance 

mTokenizerFactory 
Constructor Summary  

TfIdfDistance(TokenizerFactory tokenizerFactory)
Construct an instance of TF/IDF string distance based on the specified tokenizer factory. 
Method Summary  

double 
distance(CharSequence cSeq1,
CharSequence cSeq2)
Return the TF/IDF distance between the specified character sequences. 
int 
docFrequency(String term)
Returns the number of training documents that contained the specified term. 
void 
handle(char[] cs,
int start,
int length)
Deprecated. Use handle(CharSequence) instead. 
void 
handle(CharSequence cSeq)
Add the specified character sequence as a document for training. 
double 
idf(String term)
Return the inversedocument frequency for the specified term. 
int 
numDocuments()
Returns the total number of training documents. 
int 
numTerms()
Returns the number of terms that have been seen during training. 
double 
proximity(CharSequence cSeq1,
CharSequence cSeq2)
Returns the TF/IDF proximity between the specified character sequences. 
Set<String> 
termSet()
Returns the set of known terms for this distance. 
void 
trainIdf(CharSequence doc)
Deprecated. Use handle(CharSequence) instead. 
Methods inherited from class com.aliasi.spell.TokenizedDistance 

termFrequencyVector, tokenizerFactory, tokenSet, tokenSet 
Methods inherited from class java.lang.Object 

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait 
Constructor Detail 

public TfIdfDistance(TokenizerFactory tokenizerFactory)
tokenizerFactory
 Tokenizer factory for this distance.Method Detail 

@Deprecated public void trainIdf(CharSequence doc)
handle(CharSequence)
instead.
doc
 Character sequence to add to training set.@Deprecated public void handle(char[] cs, int start, int length)
handle(CharSequence)
instead.
TextHandler
interface based on the method trainIdf(CharSequence)
.
See trainIdf(CharSequence)
for more information.
handle
in interface TextHandler
cs
 Underlying character array.start
 Index of first character of document.length
 Number of characters in the document.
IndexOutOfBoundsException
 If the start index
is not within the array bounds, or if the start index
plus the length minus one is not within the array bounds.public void handle(CharSequence cSeq)
trainIdf(CharSequence)
for more information.
cSeq
 Characters to trai.public double distance(CharSequence cSeq1, CharSequence cSeq2)
distance
in interface Distance<CharSequence>
cSeq1
 First character sequence.cSeq2
 Second character sequence.
public double proximity(CharSequence cSeq1, CharSequence cSeq2)
proximity
in interface Proximity<CharSequence>
cSeq1
 First character sequence.cSeq2
 Second character sequence.
public int docFrequency(String term)
term
 Term to test.
public double idf(String term)
term
 The term whose IDF is returned.
public int numDocuments()
public int numTerms()
public Set<String> termSet()


PREV CLASS NEXT CLASS  FRAMES NO FRAMES  
SUMMARY: NESTED  FIELD  CONSTR  METHOD  DETAIL: FIELD  CONSTR  METHOD 