com.aliasi.chunk
Class HmmChunker

java.lang.Object
  extended by com.aliasi.chunk.HmmChunker
All Implemented Interfaces:
Chunker, ConfidenceChunker, NBestChunker
Direct Known Subclasses:
CharLmHmmChunker

public class HmmChunker
extends Object
implements NBestChunker, ConfidenceChunker

An HmmChunker uses a hidden Markov model to perform chunking over tokenized character sequences. Instances contain a hidden Markov model, decoder for the model and tokenizer factory.

Chunking results are available through three related methods. The method chunk(CharSequence) and its sister method chunk(char[],int,int) implement the Chunking interface by returning the first-best chunking for the argument character sequence or slice. The method nBest(char[],int,int,int) return an iterator over complete chunkings and their joint probability estimates in descending order of probability, with the last argument supplying an upper bound on the number of chunkings returned. Finally, the method nBestChunks(char[],int,int,int) returns an iterator over the chunks themselves, this time in descending order of the chunk's conditional probability given the input (i.e. descending confidence), with the final argument providing an upper bound on the number of such chunks returned.

Internal Token/Tag Encoding

The chunker requires a hidden Markov model whose states conform to a token-by-token encoding of a chunking. This class assumes the following encoding:

Tag Description of Tokens to which it is Assigned May Follow May Precede
B_X Initial (begin) token of chunk of type X E_Y, W_Y, EE_O_X, WW_O_X M_X, W_X
M_X Interior (middle) token of chunk of type X B_X, M_X M_X, W_X
E_X Final (end) token of chunk of type X B_X, M_X B_Y, W_Y, BB_O_X, WW_O_Y
W_X Token by itself comprising a (whole) chunk of type X E_Y, W_Y, EE_O_X, WW_O_X B_Y, W_Y, BB_O_X, WW_O_Y
BB_O_X Token not in chunk, previous token ending chunk of type X E_X, W_X MM_O, EE_O_Y
MM_O Token, previous token and following token not in chunk BB_O_Y, MM_O MM_O, EE_O_Y
EE_O_X Token and previous token not in a chunk, following token begins chunk of type X BB_O_Y, MM_O B_X, W_X
WW_O_X Token not in chunk, previous token ended a chunk, following token begins chunk of type X E_X, W_X B_Y, W_Y
The intention here is that the X tags in the last two columns (legal followers and preceders) match the tag in the first column, whereas the Y tags vary freely. Note that this produces the following number of states:
 numTags = (7 * numTypes) + 1
Not all transitions between states are legal; the ones ruled out in the table above must receive zero probability estimates. The number of legal transitions is given by:
 numTransitions = 5*numTypes2 + 13*numTypes + 1

By including an indication of the position in a chunk, an HMM is able to model tokens that start and end chunks, as well as those that fall in the middle of chunks or make up chunks on their own. In addition, it also models tokens that precede or follow chunks of a given type. For instance, consider the following tokenization and tagging, with an implicit tag W_OOS for the out-of-sentence tag:

                 (W_OOS)
 Yestereday      BB_O_OOS
 afternoon       MM_O
 ,               EE_O_PER
 John            B_PER
 J               M_PER
 .               M_PER
 Smith           E_PER
 traveled        BB_O_PER
 to              EE_O_LOC
 Washington      W_LOC
 .               WW_O_OOS
                 (W_OOS)
First note that the person chunk John J. Smith consists of three tokens: John with a begin-person tag, J and . with in-person tokens, and Smith an end-person token. In contrast, the token Washington makes up a location chunk all by itself.

There are several flavors of tags assigned to tokens that are not part of chunks based on the status of the surrounding tokens. First, BB_O_OOS is the tag assigned to Yesterday, because it is an out token that follows (an implicit) OOS tag. That is, it's the first out token following out-of-sentence. This allows the tag to capture the capitalization pattern of sentence-initial tokens that are not part of chunks. The interior token afternoon is simply assigned MM_O; its context does not allow it to see any surrounding chunks. At the other end of the sentence, the final period token is assigned the tag WW_O_OOS, because it preceds the (implicit) OOS (out of sentence) chunk. This allows some discrimination between sentence-final punctuation and other punctuation.

Next note that the token traveled is assigned to the category of first tokens following person, whereas to is assigned to the category of a final token preceding a location. Finally note the tag MM_O assigned to the token afternoon which appears between two other tokens that are not part of chunks.

If taggings of this sort are required rather than chunkings, the HMM decoder may be retrieved via getDecoder() and used along with the tokenizer factory retrieved through getTokenizerFactory() to produce taggings.

Training

The class CharLmHmmChunker may be used to train a chunker using an HMM estimator such as HmmCharLmEstimator to estimate the HMM. This estimator uses bounded character language models to estimate emission probabilities.

Caching

The efficiency of the chunker may be improved (at the expense of increased memory usage) if caching is turned on for the HMM decoder. The first-best and n-best results only use log probabilities, and hence only require caching of the log probabilities. The confidence-based estimator uses only the linear estimates, and hence only require caching of linear probabilities.

Since:
LingPipe2.2
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
HmmChunker(TokenizerFactory tokenizerFactory, HmmDecoder decoder)
          Construct a chunker from the specified tokenizer factory and hidden Markov model decoder.
 
Method Summary
 Chunking chunk(char[] cs, int start, int end)
          Returns a chunking of the specified character slice.
 Chunking chunk(CharSequence cSeq)
          Returns a chunking of the specified character sequence.
 HmmDecoder getDecoder()
          Returns the underlying hidden Markov model decoder for this chunker.
 TokenizerFactory getTokenizerFactory()
          Returns the underlying tokenizer factory for this chunker.
 Iterator<ScoredObject<Chunking>> nBest(char[] cs, int start, int end, int maxNBest)
          Returns a size-bounded iterator over scored objects with joint probability estimates of tags and tokens as scores and chunkings as objects.
 Iterator<Chunk> nBestChunks(char[] cs, int start, int end, int maxNBest)
          Returns an iterator over scored objects with conditional probability estimates for scores and chunks as objects.
 Iterator<ScoredObject<Chunking>> nBestConditional(char[] cs, int start, int end, int maxNBest)
          Returns a size-bounded iterator over scored objects with conditional probability estimates of tags and tokens as scores and chunkings as objects.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HmmChunker

public HmmChunker(TokenizerFactory tokenizerFactory,
                  HmmDecoder decoder)
Construct a chunker from the specified tokenizer factory and hidden Markov model decoder. The decoder should operate over a tag set as specified in the class documentation above. Once constructed, the tokenizer factory and decoder may not be changed.

See the note in the class documentation concerning caching in the decoder. A typical application will configure the cache of the decoder before creating an HMM chunker. See the class documentation for HmmDecoder, as well as the method documentation for HmmDecoder.setEmissionCache(Map) and HmmDecoder.setEmissionLog2Cache(Map) for more information.

Parameters:
tokenizerFactory - Tokenizer factory for tokenization.
decoder - Hidden Markov model decoder.
Method Detail

getDecoder

public HmmDecoder getDecoder()
Returns the underlying hidden Markov model decoder for this chunker. This is the actual decoder used by this chunker, so any changes to it will affect this chunker.

The decoder provides access to the underlying hidden Markov model for this chunker.

Returns:
The decoder for this chunker.

getTokenizerFactory

public TokenizerFactory getTokenizerFactory()
Returns the underlying tokenizer factory for this chunker.

Returns:
The tokenizer factory for this chunker.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Returns a chunking of the specified character slice. This is the chunking determined by the first-best hidden Markov model tags given the tokenization performed by the underlying factory. More information about the underlying HMM is provided in the class documentation above.

Specified by:
chunk in interface Chunker
Parameters:
cs - Array of characters.
start - Index of first character.
end - Index of one past last character.
Returns:
First-best chunking of the specified character slice.
Throws:
IndexOutOfBoundsException - If the specified indices are out of bounds of the specified character array.

chunk

public Chunking chunk(CharSequence cSeq)
Returns a chunking of the specified character sequence. This is the chunking determined by the first-best hidden Markov model tags given the tokenization performed by the underlying factory. More information about the underlying HMM is provided in the class documentation above.

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence to chunk.
Returns:
First-best chunking of the specified character sequence.

nBest

public Iterator<ScoredObject<Chunking>> nBest(char[] cs,
                                              int start,
                                              int end,
                                              int maxNBest)
Returns a size-bounded iterator over scored objects with joint probability estimates of tags and tokens as scores and chunkings as objects. The maximum number of chunkings returned is specified by the last argument position. This limit should be set as low as possible to reduce the memory requirements of n-best search.

Specified by:
nBest in interface NBestChunker
Parameters:
cs - Array of characters.
start - Index of first character.
end - Index of one past last character.
maxNBest - Maximum number of results to return.
Returns:
Iterator over scored objects containing chunkings and joint probability estimates.
Throws:
IndexOutOfBoundsException - If the specified indices are out of bounds of the specified character array.
IllegalArgumentException - If the maximum n-best value is not greater than zero.

nBestConditional

public Iterator<ScoredObject<Chunking>> nBestConditional(char[] cs,
                                                         int start,
                                                         int end,
                                                         int maxNBest)
Returns a size-bounded iterator over scored objects with conditional probability estimates of tags and tokens as scores and chunkings as objects. The maximum number of chunkings returned is specified by the last argument position. This limit should be set as low as possible to reduce the memory requirements of n-best search.

Parameters:
cs - Array of characters.
start - Index of first character.
end - Index of one past last character.
maxNBest - Maximum number of results to return.
Returns:
Iterator over scored objects containing chunkings and conditional probability estimates.
Throws:
IndexOutOfBoundsException - If the specified indices are out of bounds of the specified character array.
IllegalArgumentException - If the maximum n-best value is not greater than zero.

nBestChunks

public Iterator<Chunk> nBestChunks(char[] cs,
                                   int start,
                                   int end,
                                   int maxNBest)
Returns an iterator over scored objects with conditional probability estimates for scores and chunks as objects. This returns the possible chunks in confidence-ranked order, with scores being the conditional probability of the chunk given the input. If the results are exhausted, or after the specified number of n-best results have been returned, the iterator will return false when its hasNext() method is called.

Specified by:
nBestChunks in interface ConfidenceChunker
Parameters:
cs - Array of characters.
start - Index of first character.
end - Index of one past last character.
maxNBest - Maximum number of chunks returned.
Returns:
Iterator over scored objects containing chunkings and joint probability estimates.
Throws:
IndexOutOfBoundsException - If the specified indices are out of bounds of the specified character array.