com.aliasi.spell
Class TokenizedDistance

java.lang.Object
  extended by com.aliasi.spell.TokenizedDistance
All Implemented Interfaces:
Distance<CharSequence>, Proximity<CharSequence>
Direct Known Subclasses:
JaccardDistance, TfIdfDistance

public abstract class TokenizedDistance
extends Object
implements Distance<CharSequence>, Proximity<CharSequence>

The TokenizedDistance class provides an underlying implementation of string distance based on comparing sets of tokens. It holds a tokenizer factory and provides convenience methods for extracting tokens from the input.

The method tokenSet(CharSequence) provides the set of tokens derived by tokenizing the specified character sequence. The method termFrequencyVector(CharSequence) provides a mapping from tokens extracted by a tokenizer to integer counts.

Since:
LingPipe2.4.0
Version:
3.0
Author:
Bob Carpenter

Field Summary
protected  TokenizerFactory mTokenizerFactory
          Deprecated. Use tokenizerFactory() instead.
 
Constructor Summary
TokenizedDistance(TokenizerFactory tokenizerFactory)
          Construct a tokenized distance from the specified tokenizer factory.
 
Method Summary
 ObjectToCounterMap<String> termFrequencyVector(CharSequence cSeq)
          Return the mapping from terms to their counts derived from the specified character sequence using the tokenizer factory in th is class.
 TokenizerFactory tokenizerFactory()
          Return the tokenizer factory for this tokenized distance.
 Set<String> tokenSet(char[] cs, int start, int length)
          Return the set of tokens produced by the specified character slice using the tokenizer for this distance measure.
 Set<String> tokenSet(CharSequence cSeq)
          Return the set of tokens produced by the specified character sequence using the tokenizer for this distance measure.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface com.aliasi.util.Distance
distance
 
Methods inherited from interface com.aliasi.util.Proximity
proximity
 

Field Detail

mTokenizerFactory

@Deprecated
protected final TokenizerFactory mTokenizerFactory
Deprecated. Use tokenizerFactory() instead.
The underlying tokenizer factory, which is fixed at construction time.

Constructor Detail

TokenizedDistance

public TokenizedDistance(TokenizerFactory tokenizerFactory)
Construct a tokenized distance from the specified tokenizer factory.

Parameters:
tokenizerFactory - Tokenizer for this distance.
Method Detail

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Return the tokenizer factory for this tokenized distance.

Returns:
This distance's tokenizer factory.

tokenSet

public Set<String> tokenSet(CharSequence cSeq)
Return the set of tokens produced by the specified character sequence using the tokenizer for this distance measure.

Parameters:
cSeq - Character sequence to tokenize.
Returns:
The token set for the character sequence.

tokenSet

public Set<String> tokenSet(char[] cs,
                            int start,
                            int length)
Return the set of tokens produced by the specified character slice using the tokenizer for this distance measure.

Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
length - Length of slice.
Returns:
The token set for the character sequence.
Throws:
IndexOutOfBoundsException - If the start index is not within the underlying array, or if the start index plus the length minus one is not within the underlying array.

termFrequencyVector

public ObjectToCounterMap<String> termFrequencyVector(CharSequence cSeq)
Return the mapping from terms to their counts derived from the specified character sequence using the tokenizer factory in th is class.

Parameters:
cSeq - Character sequence to tokenize.
Returns:
Counts of tokens in character sequence.