com.aliasi.dict
Class ApproxDictionaryChunker

java.lang.Object
  extended by com.aliasi.dict.ApproxDictionaryChunker
All Implemented Interfaces:
Chunker, Serializable

public class ApproxDictionaryChunker
extends Object
implements Chunker, Serializable

An ApproxDictionaryChunker implements a chunker that produces chunks based on weighted edit distance of strings from dictionary entries. This is an approximate or "fuzzy" dictionary matching strategy.

The underlying dictionary is required to be an instance of TrieDictionary in order to support efficient search for matches. Other dictionaries can be easily converted to trie dictionaries by adding their entries to a fresh trie dictionary.

Entries are matched by weighted edit distance, as supplied by an implementation of WeightedEditDistance. All substrings within the maximum distance specified at construction time are returned as part of the chunking. Keep in mind that weights for weighted edit distance are specified as proximities, that is, as negative distances.

No Transposition

Transposition is not implemented in the approximate dictionary chunker, so no matches are possible through transposition. Specifically, the transpose weight method is never called on the underlying weighted edit distance.

Token Sensitivity

The tokenizer factory supplied at construction time is only used to constrain search by enforcing boundary conditions. Chunks are only returned if they start on the first character of a token and end on the last character of a token.

Using an instance of CharacterTokenizerFactory effectively removes token sensitivity by treating every non-whitespace character as a token and thus rendering every non-whitespace position a possible chunk boundary.

Serialization

An approximate dictionary is serializable if its tokenizer factory and edit distance are serializable. The reconstituted object will be an instance of this class, ApproxDictionaryChunker.

References

The approach implemented here is very similar to that described in the following paper:

The best general reference for approximate string matching is:

Since:
LingPipe2.1
Version:
3.9.1
Author:
Bob Carpenter
See Also:
Serialized Form

Field Summary
static WeightedEditDistance TT_DISTANCE
          This is a weighted edit distance defined by Tsuruoka and Tsujii for matching protein names in biomedical texts.
 
Constructor Summary
ApproxDictionaryChunker(TrieDictionary<String> dictionary, TokenizerFactory tokenizerFactory, WeightedEditDistance editDistance, double distanceThreshold)
          Construct an approximate dictionary chunker from the specified dictionary, tokenizer factory, weighted edit distance and distance bound.
 
Method Summary
 Chunking chunk(char[] cs, int start, int end)
          Return the approximate dictionary-based chunking for the specified character sequence.
 Chunking chunk(CharSequence cSeq)
          Return the approximate dictionary-based chunking for the specified character sequence.
 TrieDictionary<String> dictionary()
          Returns the trie dictionary underlying this chunker.
 double distanceThreshold()
          Returns the maximum edit distance a string can be from a dictionary entry in order to be returned by this chunker.
 WeightedEditDistance editDistance()
          Returns the weighted edit distance for matching with this chunker.
 void setMaxDistance(double distanceThreshold)
          Set the max distance a string can be from a dictionary entry in order to be returned as a chunk by this chunker.
 TokenizerFactory tokenizerFactory()
          Returns the tokenizer factory for matching with this chunker.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TT_DISTANCE

public static final WeightedEditDistance TT_DISTANCE
This is a weighted edit distance defined by Tsuruoka and Tsujii for matching protein names in biomedical texts. Reproducing table 1 from their paper provides the weighting function (converting slightly to our terminology and scale):
Operation Character Cost
Insertion space or hyphen -10
other characters -100
Deletion space or hyphen -10
other characters -100
Substitution space for hyphen -10
digit for other digit -10
capital for lowercase -10
other characters -50
Match any character 0
Transposition any characters Double.NEGATIVE_INFINITY
Tsuruoka and Tsujii's Weighted Edit Distance
Tsuruoka and Tsujii's paper is available online:
Yoshimasa Tsuruoka and Jun'ichi Tsujii. 2003. Boosting precision and recall of dictionary-based protein name recognition In Proceedings of the 2003 ACL workshop on NLP in Biomedicine.

Constructor Detail

ApproxDictionaryChunker

public ApproxDictionaryChunker(TrieDictionary<String> dictionary,
                               TokenizerFactory tokenizerFactory,
                               WeightedEditDistance editDistance,
                               double distanceThreshold)
Construct an approximate dictionary chunker from the specified dictionary, tokenizer factory, weighted edit distance and distance bound. The dictionary is used for the candidate matches. The tokenizer factory is used for determining possible boundaries of matches, which must start on the first character of a token and end on the last character of a token. The edit distance is used for measuring substrings against dictionary entries. The distance threshold specifies the maximum distance at which matches are returned.

Parameters:
dictionary - Dictionary to use for matching.
tokenizerFactory - Tokenizer factory for boundary determination.
editDistance - Matching distance measure.
distanceThreshold - Distance threshold for matching.
Method Detail

dictionary

public TrieDictionary<String> dictionary()
Returns the trie dictionary underlying this chunker. This is the actual dictionary used by the chunker, so changes to it will affect this chunker.

Returns:
The trie dictionary underlying this chunker.

editDistance

public WeightedEditDistance editDistance()
Returns the weighted edit distance for matching with this chunker. This is the actual edit distance used by the chunker, so changes to it will affect this chunker.

Returns:
The weighted edit distance for this chunker.

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Returns the tokenizer factory for matching with this chunker. This is the actual tokenizer factory used by this chunker, so changes to it will affect the behavior of this class.

Returns:
The tokenizer factory for this chunker.

distanceThreshold

public double distanceThreshold()
Returns the maximum edit distance a string can be from a dictionary entry in order to be returned by this chunker. This value is set using setMaxDistance(double).

Returns:
The maximum edit distance for this chunker.

setMaxDistance

public void setMaxDistance(double distanceThreshold)
Set the max distance a string can be from a dictionary entry in order to be returned as a chunk by this chunker.


chunk

public Chunking chunk(CharSequence cSeq)
Return the approximate dictionary-based chunking for the specified character sequence.

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence to chunk.
Returns:
Chunking of the specified character sequence.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Return the approximate dictionary-based chunking for the specified character sequence.

Specified by:
chunk in interface Chunker
Parameters:
cs - Underlying characters.
start - Index of first character in the array.
end - Index of one past the last character in the array.
Returns:
Chunking of the specified character sequence.
Throws:
IllegalArgumentException - If the indices are out of bounds in the character sequence.