com.aliasi.spell
Class TfIdfDistance

java.lang.Object
  extended by com.aliasi.spell.TokenizedDistance
      extended by com.aliasi.spell.TfIdfDistance
All Implemented Interfaces:
Handler, TextHandler, Distance<CharSequence>, Proximity<CharSequence>

public class TfIdfDistance
extends TokenizedDistance
implements TextHandler

The TfIdfDistance class provides a string distance based on term frequency (TF) and inverse document frequency (IDF). The method distance(CharSequence,CharSequence) will return results in the range between 0 (perfect match) and 1 (no match) inclusive; the method proximity(CharSequence,CharSequence) runs in the opposite direction, returning 0 for no match and 1 for a perfect match. Full details are provided below.

Terms are produced from the character sequences being compared by a tokenizer factory fixed at construction time. These terms form the dimensions of vectors whose values are the counts for the terms in the strings being compared.

The raw term frequencies are adjusted in scale and by inverse document frequency. The resulting term vectors are then compared by one minus their cosine. Because the term vectors contain only positive values, the result is a distance between zero (0), for completely dissimilar strings, to one (1), for character-by-character identical strings.

The inverse document frequencies are defined over a collection of documents. The collection of documents must be provided to this class one at a time through either the generic text handler method handle(char[],int,int).

Formal Definition of TF/IDF Distance

Note that there are a range of different distances called "TF/IDF" distance. The one in this class is defined to be symmetric, unlike typical TF/IDF distances defined for information retrieval. It scales inverse-document frequencies by logs, and both inverse-document frequencies and term frequencies by square roots. This causes the influence of IDF to grow logarithmically, and term frequency comparison to grow linearly.

Suppose we have a collection docs of n strings, which we will call documents in keeping with tradition. Further let df(t,docs) be the document frequency of token t, that is, the number of documents in which the token t appears. Then the inverse document frequency (IDF) of t is defined by:

idf(t,docs) = sqrt(log(n/df(t,docs)))

If the document frequency df(t,docs) of a term is zero, then idf(t,docs) is set to zero. As a result, only terms that appeared in at least one training document are used during comparison.

The term vector for a string is then defined by its term frequencies. If count(t,cs) is the count of term t in character sequence cs, then the term frequency (TF) is defined by:

tf(t,cs) = sqrt(count(t,cs))

The term-frequency/inverse-document frequency (TF/IDF) vector tfIdf(cs,docs) for a character sequence cs over a collection of documents ds has a value tfIdf(cs,docs)(t) for term t defined by:

tfIdf(cs,docs)(t) = tf(t,cs) * idf(t,docs)

The proximity between character sequences cs1 and cs2 is defined as the cosine of their TF/IDF vectors:

dist(cs1,cs2) = 1 - cosine(tfIdf(cs1,docs),tfIdf(cs2,docs))

Recall that the cosine of two vectors is the dot product of the vectors divided by their lengths:

cos(x,y) = x . y / ( |x| * |y| )
where dot products are defined by:
x . y = Σi x[i] * y[i]
and length is defined by:
|x| = sqrt(x . x)

Distance is then just 1 minus the proximity value.

 distance(cs1,cs2) = 1 - proximity(cs1,cs2)
 

References

Since:
LingPipe2.4
Version:
3.9.1
Author:
Bob Carpenter

Field Summary
 
Fields inherited from class com.aliasi.spell.TokenizedDistance
mTokenizerFactory
 
Constructor Summary
TfIdfDistance(TokenizerFactory tokenizerFactory)
          Construct an instance of TF/IDF string distance based on the specified tokenizer factory.
 
Method Summary
 double distance(CharSequence cSeq1, CharSequence cSeq2)
          Return the TF/IDF distance between the specified character sequences.
 int docFrequency(String term)
          Returns the number of training documents that contained the specified term.
 void handle(char[] cs, int start, int length)
          Deprecated. Use handle(CharSequence) instead.
 void handle(CharSequence cSeq)
          Add the specified character sequence as a document for training.
 double idf(String term)
          Return the inverse-document frequency for the specified term.
 int numDocuments()
          Returns the total number of training documents.
 int numTerms()
          Returns the number of terms that have been seen during training.
 double proximity(CharSequence cSeq1, CharSequence cSeq2)
          Returns the TF/IDF proximity between the specified character sequences.
 Set<String> termSet()
          Returns the set of known terms for this distance.
 void trainIdf(CharSequence doc)
          Deprecated. Use handle(CharSequence) instead.
 
Methods inherited from class com.aliasi.spell.TokenizedDistance
termFrequencyVector, tokenizerFactory, tokenSet, tokenSet
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TfIdfDistance

public TfIdfDistance(TokenizerFactory tokenizerFactory)
Construct an instance of TF/IDF string distance based on the specified tokenizer factory.

Parameters:
tokenizerFactory - Tokenizer factory for this distance.
Method Detail

trainIdf

@Deprecated
public void trainIdf(CharSequence doc)
Deprecated. Use handle(CharSequence) instead.

Add the specified character sequence as a document for training. The training documents are only used for computing inverse-document frequencies. The more documents in which a document appears, the less it will be weighted during comparison.

Parameters:
doc - Character sequence to add to training set.

handle

@Deprecated
public void handle(char[] cs,
                              int start,
                              int length)
Deprecated. Use handle(CharSequence) instead.

Add the specified character slice as a document for training. This method provides an implementation of the TextHandler interface based on the method trainIdf(CharSequence). See trainIdf(CharSequence) for more information.

Specified by:
handle in interface TextHandler
Parameters:
cs - Underlying character array.
start - Index of first character of document.
length - Number of characters in the document.
Throws:
IndexOutOfBoundsException - If the start index is not within the array bounds, or if the start index plus the length minus one is not within the array bounds.

handle

public void handle(CharSequence cSeq)
Add the specified character sequence as a document for training. See trainIdf(CharSequence) for more information.

Parameters:
cSeq - Characters to trai.

distance

public double distance(CharSequence cSeq1,
                       CharSequence cSeq2)
Return the TF/IDF distance between the specified character sequences. This distance depends on the training instances supplied. See the class documentation above for more information.

Specified by:
distance in interface Distance<CharSequence>
Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
Returns:
The TF/IDF distance between the two sequences.

proximity

public double proximity(CharSequence cSeq1,
                        CharSequence cSeq2)
Returns the TF/IDF proximity between the specified character sequences. The proximity depends on training instances and the tokenizer factory; see the class documentation above for details.

Specified by:
proximity in interface Proximity<CharSequence>
Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
Returns:
The TF/IDF proximity between the two sequences.

docFrequency

public int docFrequency(String term)
Returns the number of training documents that contained the specified term.

Parameters:
term - Term to test.
Returns:
The number of training documents that contained the specified term.

idf

public double idf(String term)
Return the inverse-document frequency for the specified term. See the class doducmentation above for a formal definition.

Parameters:
term - The term whose IDF is returned.
Returns:
The IDF of the specified term.

numDocuments

public int numDocuments()
Returns the total number of training documents.

Returns:
The total number of training documents.

numTerms

public int numTerms()
Returns the number of terms that have been seen during training.

Returns:
The number of terms for this distance.

termSet

public Set<String> termSet()
Returns the set of known terms for this distance. The set will contain every token that was derived from a training instance. Only terms in the returned term set will contribute to similarity. All other terms have an inverse-document frequency of zero, so will not be matched, though they will contribute to length during the cosine calculation.

Returns:
The set of terms for this distance measure.