|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.dict.ExactDictionaryChunker
public class ExactDictionaryChunker
An exact dictionary chunker extracts chunks based on exact matches of tokenized dictionary entries.
All dictionary entry categories are converted to strings from
generic objects using Object.toString().
An exact dicitonary chunker may be configured either to
extract all matching chunks, or to restrict the results to a
consistent set of non-overlapping chunks. These non-overlapping
chunks are taken to be the left-most, longest-matching,
highest-scoring, or alphabetically preceding in type according
to the following definitions. A chunk with span
(start1,end1) overlaps a chunk with span
(start2,end2) if and only if either end
points of the second chunk lie within the first chunk:
start1 <= start2 < end1, or
start1 < end2 <= end1.
(0,1) and (1,3) do
not overlap, but
(0,1) overlaps (0,2),
(1,2) overlaps (0,2), and
(1,7) overlaps (2,3).
A chunk chunk1=(start1,end1):type1@score1 dominates
another chunk chunk2=(start2,end2):type2@score2 if and
only if the chunks overlap and:
start1 < start2 (leftmost), or
start1 == start2
and end1 > end2 (longest), or
start1 == start2, end1 == end2
and score1 > score2 (highest scoring), or
start1 == start2, end1 == end2,
score1 == score2 and
type1 < type2 (alphabetical).
If the chunker is specified to be case sensitive, the exact
dictionary entries must match. If it is not case sensitive, all
matching will be done after applying string normalization using
String.toLowerCase().
Matching ignores whitespace as defined by the specified
tokenizer factory. The tokenizer factory should have
character-for-character aligned tokens with the input. That is, it
should not do stemming, stopword removal, etc., or this chunker
will not be able to calculate string positions. Safe tokenizer
factories include IndoEuropeanTokenizerFactory, RegExTokenizerFactory, and CharacterTokenizerFactory; unsafe ones
include the NGramTokenizerFactory and
anything user-defined constructed with a filter tokenizer,
including NormalizeWhiteSpaceFilterTokenizer, StopFilterTokenizer or a PorterStemmerFilterTokenizer.
Chunking is thread safe, and may be run concurrently. Changing
the return-all-matches flag with setReturnAllMatches(boolean) should not be called while chunking
is running, as it may affect the behavior of the running example
with respect to whether it returns all chunkings. Once
constructed, the tokenizer's behavior should not change.
Implementation Note: This class is implemented using the Aho-Corasick algorithm, a generalization of the Knuth-Morris-Pratt string-matching algorithm to sets of strings. Aho-Corasick is linear in the number of tokens in the input plus the number of output chunks. Memory requirements are only an array of integers as long as the longest phrase (a circular queue for holding start points of potential chunks) and the memory required by the chunking implementation for the result (which may be as large as quadratic in the size of the input, or may be very small if there are not many matches). Compilation of the Aho-Corasick tree is done in the constructor and is linear in number of dictionary entries with a constant factor as high as the maximum phrase length; this can be improved to a constant factor using suffix-tree like speedups, but it didn't seem worth the complexity here when the dictionaries would be long-lived.
| Constructor Summary | |
|---|---|
ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory)
Construct an exact dictionary chunker from the specified dictionary and tokenizer factory which is case sensitive and returns all matches. |
|
ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory,
boolean returnAllMatches,
boolean caseSensitive)
Construct an exact dictionary chunker from the specified dictionary and tokenizer factory, returning all matches or not as specified. |
|
| Method Summary | |
|---|---|
boolean |
caseSensitive()
Returns true if this dictionary chunker is
case sensitive. |
Chunking |
chunk(char[] cs,
int start,
int end)
Returns the chunking for the specified character slice. |
Chunking |
chunk(CharSequence cSeq)
Returns the chunking for the specified character sequence. |
boolean |
returnAllMatches()
Returns true if this chunker returns all matches. |
void |
setReturnAllMatches(boolean returnAllMatches)
Set whether to return all matches to the specified condition. |
TokenizerFactory |
tokenizerFactory()
Returns the tokenizer factory underlying this chunker. |
String |
toString()
Returns a string-based representation of this chunker. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory)
After construction, this class does not use the dictionary and will not be sensitive to changes in the underlying dictionary.
dict - Dictionary forming the basis of the chunker.factory - Tokenizer factory underlying chunker.
public ExactDictionaryChunker(Dictionary<String> dict,
TokenizerFactory factory,
boolean returnAllMatches,
boolean caseSensitive)
After construction, this class does not use the dictionary and will not be sensitive to changes in the underlying dictionary.
Case sensitivity is defined using Locale.ENGLISH. For other languages, underlying case
sensitivity must be defined externally by passing in
case-normalized text.
dict - Dictionary forming the basis of the chunker.factory - Tokenizer factory underlying chunker.returnAllMatches - true if chunker should return
all matches.caseSensitive - true if chunker is case
sensitive.| Method Detail |
|---|
public TokenizerFactory tokenizerFactory()
public boolean caseSensitive()
true if this dictionary chunker is
case sensitive. Case sensitivity must be defined at
construction time and may not be reset.
public boolean returnAllMatches()
true if this chunker returns all matches.
public void setReturnAllMatches(boolean returnAllMatches)
Note that setting this while running a chunking in another thread may affect that chunking.
returnAllMatches - true if all matches should
be returned.public Chunking chunk(CharSequence cSeq)
chunk in interface ChunkercSeq - Character sequence to chunk.
public Chunking chunk(char[] cs,
int start,
int end)
chunk in interface Chunkercs - Underlying array of characters.start - Index of first character in slice.end - One past the index of the last character in the slice.
public String toString()
toString in class Object
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||