com.aliasi.chunk
Class CharLmHmmChunker

java.lang.Object
  extended by com.aliasi.chunk.HmmChunker
      extended by com.aliasi.chunk.CharLmHmmChunker
All Implemented Interfaces:
Chunker, ConfidenceChunker, NBestChunker, Handler, ObjectHandler<Chunking>, TagHandler, Compilable

public class CharLmHmmChunker
extends HmmChunker
implements TagHandler, ObjectHandler<Chunking>, Compilable

A CharLmHmmChunker employs a hidden Markov model estimator and tokenizer factory to learn a chunker. This estimator used is an instance of AbstractHmmEstimator for underlying HMM estimation. It uses a tokenizer factory to break the chunks down into sequences of tokens and tags.

Training

This class implements the ObjectHandler<Chunking> and TagHandler interfaces, either of which may be used to supply training instances. Every training event is used to train the underlying HMM. Training instances are supplied through the chunk handler in the usual way.

Training instances for the tag handler require the standard BIO tagging scheme in which the first token in a chunk of type X is tagged B-X ("begin"), with all subsequent tokens in the same chunk tagged I-X ("in"). All tokens not in chunks are tagged O. For example, the tags required for training are:

 Yestereday      O
 afternoon       O
 ,               O
 John            B-PER
 J               I-PER
 .               I-PER
 Smith           I-PER
 traveled        O
 to              O
 Washington      O
 .               O
This is the same tagging scheme supplied in several corpora (Penn BioIE, ConNLL, etc.) Note that this is not the same tag scheme used for the underlying HMM. This simpler tag scheme shown above is first converted to the more fine-grained tag scheme described in the class documentation for HmmChunker.

Training with a Dictionary

This chunker may be trained with dictionary entries through the method trainDictionary(CharSequence cSeq, String type). Calling this method trains the emission probabilities for the relevant tags determined by tokenizing the specifid character sequence (after conversion to the underlying tag scheme defined in HmmChunker).

Warning:It is not enough to just train with a dictionary. Dictionaries do not train the contexts in which elements show up. Ordinary training data must also be supplied, and this data must have some elements which are not part of chunks in order to train the out tags. If only a dictionary is used to train, null pointer exceptions will show up at run time.

For example, calling

 charLmHmmChunker.trainDictionary("Washington", "LOCATION");
would provide the token "Washington" as a training case for emission from the tag W_LOCATION--the 'W_' annotation is emitted because the trainDictionary uses the richer tag set of HmmChunker. Alterantively, calling:
 charLmHmmChunker.trainDictionary("John J. Smith", "PERSON");
would train the tag B_PERSON to be trained with the sequence "John", the tag I_PERSON to be trained with the tokens "J" and ".", and the tag E_PERSON to be trained with the token "Smith". Furthermore, in this case, the transition probabilities receive training instances for the three transitions: B_PERSON to M_PERSON, M_PERSON to M_PERSON, and finally, M_PERSON to E_PERSON.

Note that there is no method to train non-chunk tokens, because the categories assigned to them are context-specific, being determined by the surrounding tokens. An effective way to train out categories in general is to supply them as part of entire sentences that have no chunks in them. Note that this only trains the begin-sentence, end-sentence and internal tags for non-chunked tokens.

To be useful, the dictionary entries must match the chunks that should be found. For instance, in the MUC training data, there are many instances of USAir, the name of a United States airline. It might be thought that stock listings would help the extraction of company names, but in fact, the company is "officially" known as USAirways Group.

It is also important that training with dictionaries not be done with huge diffuse dictionaries that wind up smoothing the language models too much. For example, training just locations with a 2 million location gazzeteer once per entry will leave obscure locations with an estimate close to those of New York or Beijing.

Tag Smoothing

The constructor CharLmHmmChunker(TokenizerFactory,AbstractHmmEstimator,boolean))} accepts a flag that determines whether to smooth tag transition probabilities. If the flag is set to true in the constructor, every time a new symbol is seen in the training data, all of its relevant underlying tags are added to the symbol table and all legal transitions among them and all other tags are incremented by one.

If smoothing is turned off, only tag-tag transitions seen in the training data are allowed.

The begin-sentence and end-sentence tags are automatically added in the constructor, so that if no training data is provided, a chunking with no chunks is returned. This smoothing may not be turned off. Thus there will always be a non-zero probability in the underlying HMM of starting with tag BB_O_BOS and WW_O_BOS, of ending with the tag EE_O_BOS or WW_O_BOS. There will also always be a non-zero probability of transitioning from BB_O_BOS to MM_O and to EE_O_BOS, and of transitioning from MM_O to MM_O and EE_O_BOS.

Compilation

This class implements the Compilable interface. To compile a static model from the current state of training, call the method compileTo(ObjectOutput). The result of reading an object from the corresponding object input stream will produce a compiled HMM chunker of class HmmChunker, with the same estimates as the current state of the chunker being compiled.

Caching

Caching is turned off on the HMM decoder for this class by default. If caching is turned on for instances of this class (through the method HmmChunker.getDecoder() inherited from HmmChunker), then training instances will fail to be reflected in cached estimates and the results may be inconsistent and may lead to exceptions. Caching may be turned on as long as there are no more training instances, but in this case, it is almost always more efficient to just compile the model and turn caching on for that.

After compilation, the returned chunker will have caching turned off by default. To turn on caching for the compiled model, which is highly recommended for efficiency, retrieve the HMM decoder and set its cache. For instance, to set up caching for both log estimates and linear estimates, use the code:

 ObjectInput objIn = ...;
 HmmChunker chunker = (HmmChunker) objIn.readObject();
 HmmDecoder decoder = chunker.getDecoder();
 decoder.setEmissionCache(new FastCache(1000000));
 decoder.setEmissionLog2Cache(new FastCache(1000000));
 

Reserved Tag

The tag BOS is reserved for use by the system for encoding document start/end positions. See HmmChunker for more information.

Since:
LingPipe2.2
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
CharLmHmmChunker(TokenizerFactory tokenizerFactory, AbstractHmmEstimator hmmEstimator)
          Construct a CharLmHmmChunker from the specified tokenizer factory and hidden Markov model estimator.
CharLmHmmChunker(TokenizerFactory tokenizerFactory, AbstractHmmEstimator hmmEstimator, boolean smoothTags)
          Construct a CharLmHmmChunker from the specified tokenizer factory, HMM estimator and tag-smoothing flag.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Compiles this model to the specified object output stream.
 AbstractHmmEstimator getHmmEstimator()
          Returns the underlying hidden Markov model estimator for this chunker estimator.
 TokenizerFactory getTokenizerFactory()
          Return the tokenizer factory for this chunker.
 void handle(Chunking chunking)
          Handle the specified chunking by tokenizing it, assigning tags and training the underlying hidden Markov model.
 void handle(String[] tokens, String[] whitespaces, String[] tags)
          Deprecated. Use handle(Chunking) instead.
 String toString()
          Returns a string representation of the complete topology of the underlying HMM with log2 transition probabilities.
 void trainDictionary(CharSequence cSeq, String type)
          Train the underlying hidden Markov model based on the specified character sequence being of the specified type.
 
Methods inherited from class com.aliasi.chunk.HmmChunker
chunk, chunk, getDecoder, nBest, nBestChunks, nBestConditional
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

CharLmHmmChunker

public CharLmHmmChunker(TokenizerFactory tokenizerFactory,
                        AbstractHmmEstimator hmmEstimator)
Construct a CharLmHmmChunker from the specified tokenizer factory and hidden Markov model estimator. Smoothing is turned off by default. See CharLmHmmChunker(TokenizerFactory,AbstractHmmEstimator,boolean)) for more information.

Parameters:
tokenizerFactory - Tokenizer factory to tokenize chunks.
hmmEstimator - Underlying HMM estimator.

CharLmHmmChunker

public CharLmHmmChunker(TokenizerFactory tokenizerFactory,
                        AbstractHmmEstimator hmmEstimator,
                        boolean smoothTags)
Construct a CharLmHmmChunker from the specified tokenizer factory, HMM estimator and tag-smoothing flag.

If smoothing is turned on, then every time a new entity type is seen in the training data, all possible underlying tags involving that object are added to the symbol table, and every legal transition among these tags and all other tags is increment by count 1.

The tokenizer factory must be compilable in order for the model to be compiled. If it is not compilable, then attempting to compile the model will raise an exception.

Parameters:
tokenizerFactory - Tokenizer factory to tokenize chunks.
hmmEstimator - Underlying HMM estimator.
smoothTags - Set to true for tag smoothing.
Method Detail

getHmmEstimator

public AbstractHmmEstimator getHmmEstimator()
Returns the underlying hidden Markov model estimator for this chunker estimator. This is the actual estimator used by this class, so changes to it will affect wthis class's chunk estimates.

Returns:
The underlying HMM estimator.

getTokenizerFactory

public TokenizerFactory getTokenizerFactory()
Return the tokenizer factory for this chunker.

Overrides:
getTokenizerFactory in class HmmChunker
Returns:
The tokenizer factory for this chunker.

trainDictionary

public void trainDictionary(CharSequence cSeq,
                            String type)
Train the underlying hidden Markov model based on the specified character sequence being of the specified type. As described in the class documentation above, this only trains the emission probabilities and internal transitions for the character sequence, based on the underlying tokenizer factory.

Warning: Chunkers cannot only be trained with a dictionary. They require regular training data in order to train the contexts in which dictionary items show up. Attempting to train with only a dictionary will lead to null pointer exceptions when attempting to decode.

Parameters:
cSeq - Character sequence on which to train.
type - Type of chunk.

handle

public void handle(Chunking chunking)
Handle the specified chunking by tokenizing it, assigning tags and training the underlying hidden Markov model. For a description of how chunkings are broken down into taggings, see the parent class documentation in HmmChunker.

Specified by:
handle in interface ObjectHandler<Chunking>
Parameters:
chunking - Chunking to use for training.

handle

@Deprecated
public void handle(String[] tokens,
                              String[] whitespaces,
                              String[] tags)
Deprecated. Use handle(Chunking) instead.

Handle the specified tokens, whitespaces and tags by using them (after conversion) to train the underlying hidden Markov model. The description of tag format is given in the class documentation above; this format is converted into the underlying format used by the underlying HMM as described in HmmChunker.

Specified by:
handle in interface TagHandler
Parameters:
tokens - Array of tokens.
whitespaces - Array of whitespaces; unused and may be null.
tags - Array of tags in format described in class documentation.
Throws:
IllegalArgumentException - If the token and tag arrays are not the same length, or if the whitespaces array is non-null and not one longer than the array of tokens.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Compiles this model to the specified object output stream. The model may then be read back in using ObjectInput.readObject(); the resulting object will be instance of HmmChunker. See the class documentation above for information on setting the cache for a compiled model.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this object is compiled.
Throws:
IOException - If there is an I/O error during the write.
IllegalArgumentException - If the tokenizer factory supplied to the constructor of this class is not compilable.

toString

public String toString()
Returns a string representation of the complete topology of the underlying HMM with log2 transition probabilities. Note that this output does not represent the emission probabilities per category.

Overrides:
toString in class Object
Returns:
String-based representation of this chunker.