|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.dict.ApproxDictionaryChunker
public class ApproxDictionaryChunker
An ApproxDictionaryChunker implements a chunker that
produces chunks based on weighted edit distance of strings from
dictionary entries. This is an approximate or "fuzzy"
dictionary matching strategy.
The underlying dictionary is required to be an instance of
TrieDictionary in order to support efficient search for
matches. Other dictionaries can be easily converted to
trie dictionaries by adding their entries to a fresh trie
dictionary.
Entries are matched by weighted edit distance, as supplied by an
implementation of WeightedEditDistance. All substrings
within the maximum distance specified at construction time are
returned as part of the chunking. Keep in mind that weights for
weighted edit distance are specified as proximities, that is, as
negative distances.
Transposition is not implemented in the approximate dictionary chunker, so no matches are possible through transposition. Specifically, the transpose weight method is never called on the underlying weighted edit distance.
The tokenizer factory supplied at construction time is only used to constrain search by enforcing boundary conditions. Chunks are only returned if they start on the first character of a token and end on the last character of a token.
Using an instance of CharacterTokenizerFactory effectively removes
token sensitivity by treating every non-whitespace character as a
token and thus rendering every non-whitespace position a possible
chunk boundary.
ApproxDictionaryChunker.
The approach implemented here is very similar to that described in the following paper:
| Field Summary | |
|---|---|
static WeightedEditDistance |
TT_DISTANCE
This is a weighted edit distance defined by Tsuruoka and Tsujii for matching protein names in biomedical texts. |
| Constructor Summary | |
|---|---|
ApproxDictionaryChunker(TrieDictionary<String> dictionary,
TokenizerFactory tokenizerFactory,
WeightedEditDistance editDistance,
double distanceThreshold)
Construct an approximate dictionary chunker from the specified dictionary, tokenizer factory, weighted edit distance and distance bound. |
|
| Method Summary | |
|---|---|
Chunking |
chunk(char[] cs,
int start,
int end)
Return the approximate dictionary-based chunking for the specified character sequence. |
Chunking |
chunk(CharSequence cSeq)
Return the approximate dictionary-based chunking for the specified character sequence. |
TrieDictionary<String> |
dictionary()
Returns the trie dictionary underlying this chunker. |
double |
distanceThreshold()
Returns the maximum edit distance a string can be from a dictionary entry in order to be returned by this chunker. |
WeightedEditDistance |
editDistance()
Returns the weighted edit distance for matching with this chunker. |
void |
setMaxDistance(double distanceThreshold)
Set the max distance a string can be from a dictionary entry in order to be returned as a chunk by this chunker. |
TokenizerFactory |
tokenizerFactory()
Returns the tokenizer factory for matching with this chunker. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final WeightedEditDistance TT_DISTANCE
Tsuruoka and Tsujii's paper is available online:
Operation Character Cost Insertion space or hyphen -10 other characters -100 Deletion space or hyphen -10 other characters -100 Substitution space for hyphen -10 digit for other digit -10 capital for lowercase -10 other characters -50 Match any character 0 Transposition any characters Double.NEGATIVE_INFINITY Tsuruoka and Tsujii's Weighted Edit Distance
Yoshimasa Tsuruoka and Jun'ichi Tsujii. 2003. Boosting precision and recall of dictionary-based protein name recognition In Proceedings of the 2003 ACL workshop on NLP in Biomedicine.
| Constructor Detail |
|---|
public ApproxDictionaryChunker(TrieDictionary<String> dictionary,
TokenizerFactory tokenizerFactory,
WeightedEditDistance editDistance,
double distanceThreshold)
dictionary - Dictionary to use for matching.tokenizerFactory - Tokenizer factory for boundary
determination.editDistance - Matching distance measure.distanceThreshold - Distance threshold for matching.| Method Detail |
|---|
public TrieDictionary<String> dictionary()
public WeightedEditDistance editDistance()
public TokenizerFactory tokenizerFactory()
public double distanceThreshold()
setMaxDistance(double).
public void setMaxDistance(double distanceThreshold)
public Chunking chunk(CharSequence cSeq)
chunk in interface ChunkercSeq - Character sequence to chunk.
public Chunking chunk(char[] cs,
int start,
int end)
chunk in interface Chunkercs - Underlying characters.start - Index of first character in the array.end - Index of one past the last character in the array.
IllegalArgumentException - If the indices are out of
bounds in the character sequence.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||