com.aliasi.dict
Class ExactDictionaryChunker

java.lang.Object
  extended by com.aliasi.dict.ExactDictionaryChunker
All Implemented Interfaces:
Chunker

public class ExactDictionaryChunker
extends Object
implements Chunker

An exact dictionary chunker extracts chunks based on exact matches of tokenized dictionary entries.

All dictionary entry categories are converted to strings from generic objects using Object.toString().

An exact dicitonary chunker may be configured either to extract all matching chunks, or to restrict the results to a consistent set of non-overlapping chunks. These non-overlapping chunks are taken to be the left-most, longest-matching, highest-scoring, or alphabetically preceding in type according to the following definitions. A chunk with span (start1,end1) overlaps a chunk with span (start2,end2) if and only if either end points of the second chunk lie within the first chunk:

For instance, (0,1) and (1,3) do not overlap, but (0,1) overlaps (0,2), (1,2) overlaps (0,2), and (1,7) overlaps (2,3).

A chunk chunk1=(start1,end1):type1@score1 dominates another chunk chunk2=(start2,end2):type2@score2 if and only if the chunks overlap and:

To construct a non-overlapping result, all dominated chunks are removed.

If the chunker is specified to be case sensitive, the exact dictionary entries must match. If it is not case sensitive, all matching will be done after applying string normalization using String.toLowerCase().

Matching ignores whitespace as defined by the specified tokenizer factory. The tokenizer factory should have character-for-character aligned tokens with the input. That is, it should not do stemming, stopword removal, etc., or this chunker will not be able to calculate string positions. Safe tokenizer factories include IndoEuropeanTokenizerFactory, RegExTokenizerFactory, and CharacterTokenizerFactory; unsafe ones include the NGramTokenizerFactory and anything user-defined constructed with a filter tokenizer, including NormalizeWhiteSpaceFilterTokenizer, StopFilterTokenizer or a PorterStemmerFilterTokenizer.

Chunking is thread safe, and may be run concurrently. Changing the return-all-matches flag with setReturnAllMatches(boolean) should not be called while chunking is running, as it may affect the behavior of the running example with respect to whether it returns all chunkings. Once constructed, the tokenizer's behavior should not change.

Implementation Note: This class is implemented using the Aho-Corasick algorithm, a generalization of the Knuth-Morris-Pratt string-matching algorithm to sets of strings. Aho-Corasick is linear in the number of tokens in the input plus the number of output chunks. Memory requirements are only an array of integers as long as the longest phrase (a circular queue for holding start points of potential chunks) and the memory required by the chunking implementation for the result (which may be as large as quadratic in the size of the input, or may be very small if there are not many matches). Compilation of the Aho-Corasick tree is done in the constructor and is linear in number of dictionary entries with a constant factor as high as the maximum phrase length; this can be improved to a constant factor using suffix-tree like speedups, but it didn't seem worth the complexity here when the dictionaries would be long-lived.

Since:
LingPipe2.3.1
Version:
3.8.1
Author:
Bob Carpenter

Constructor Summary
ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory)
          Construct an exact dictionary chunker from the specified dictionary and tokenizer factory which is case sensitive and returns all matches.
ExactDictionaryChunker(Dictionary<String> dict, TokenizerFactory factory, boolean returnAllMatches, boolean caseSensitive)
          Construct an exact dictionary chunker from the specified dictionary and tokenizer factory, returning all matches or not as specified.
 
Method Summary
 boolean caseSensitive()
          Returns true if this dictionary chunker is case sensitive.
 Chunking chunk(char[] cs, int start, int end)
          Returns the chunking for the specified character slice.
 Chunking chunk(CharSequence cSeq)
          Returns the chunking for the specified character sequence.
 boolean returnAllMatches()
          Returns true if this chunker returns all matches.
 void setReturnAllMatches(boolean returnAllMatches)
          Set whether to return all matches to the specified condition.
 TokenizerFactory tokenizerFactory()
          Returns the tokenizer factory underlying this chunker.
 String toString()
          Returns a string-based representation of this chunker.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ExactDictionaryChunker

public ExactDictionaryChunker(Dictionary<String> dict,
                              TokenizerFactory factory)
Construct an exact dictionary chunker from the specified dictionary and tokenizer factory which is case sensitive and returns all matches. See the class documentation above for more information on chunking and requirements for tokenizer factories.

After construction, this class does not use the dictionary and will not be sensitive to changes in the underlying dictionary.

Parameters:
dict - Dictionary forming the basis of the chunker.
factory - Tokenizer factory underlying chunker.

ExactDictionaryChunker

public ExactDictionaryChunker(Dictionary<String> dict,
                              TokenizerFactory factory,
                              boolean returnAllMatches,
                              boolean caseSensitive)
Construct an exact dictionary chunker from the specified dictionary and tokenizer factory, returning all matches or not as specified. See the class documentation above for more information on chunking.

After construction, this class does not use the dictionary and will not be sensitive to changes in the underlying dictionary.

Case sensitivity is defined using Locale.ENGLISH. For other languages, underlying case sensitivity must be defined externally by passing in case-normalized text.

Parameters:
dict - Dictionary forming the basis of the chunker.
factory - Tokenizer factory underlying chunker.
returnAllMatches - true if chunker should return all matches.
caseSensitive - true if chunker is case sensitive.
Method Detail

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Returns the tokenizer factory underlying this chunker. Once set in the constructor, the tokenizer factory may not be changed. If the tokenizer factory allows dynamic reconfiguration, it should not be reconfigured or inconsistent results may be returned.

Returns:
The tokenizer factory for this chunker.

caseSensitive

public boolean caseSensitive()
Returns true if this dictionary chunker is case sensitive. Case sensitivity must be defined at construction time and may not be reset.

Returns:
Whether this chunker is case sensitive.

returnAllMatches

public boolean returnAllMatches()
Returns true if this chunker returns all matches.

Returns:
Whether this chunker returns all matches.

setReturnAllMatches

public void setReturnAllMatches(boolean returnAllMatches)
Set whether to return all matches to the specified condition.

Note that setting this while running a chunking in another thread may affect that chunking.

Parameters:
returnAllMatches - true if all matches should be returned.

chunk

public Chunking chunk(CharSequence cSeq)
Returns the chunking for the specified character sequence. Whether all matching chunks are returned depends on whether this chunker is configured to return all matches or not. See the class documentation above for more information.

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence to chunk.
Returns:
The chunking for the specified character sequence.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Returns the chunking for the specified character slice. Whether all matching chunks are returned depends on whether this chunker is configured to return all matches or not. See the class documentation above for more information.

Specified by:
chunk in interface Chunker
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One past the index of the last character in the slice.
Returns:
The chunking for the specified character slice.

toString

public String toString()
Returns a string-based representation of this chunker. The string includes the tokenizer factory's class name, whether or not it returns all matches, whether or not it is case sensitive, and also includes the entire trie underlying the matcher, which is quite large for large dictionaries (multiple lines per dictionary entry).

Overrides:
toString in class Object
Returns:
String-based representation of this chunker.