com.aliasi.chunk
Class CharLmRescoringChunker

java.lang.Object
  extended by com.aliasi.chunk.RescoringChunker<B>
      extended by com.aliasi.chunk.AbstractCharLmRescoringChunker<CharLmHmmChunker,NGramProcessLM,NGramBoundaryLM>
          extended by com.aliasi.chunk.CharLmRescoringChunker
All Implemented Interfaces:
Chunker, ConfidenceChunker, NBestChunker, Handler, ObjectHandler<Chunking>, TagHandler, Compilable

public class CharLmRescoringChunker
extends AbstractCharLmRescoringChunker<CharLmHmmChunker,NGramProcessLM,NGramBoundaryLM>
implements TagHandler, ObjectHandler<Chunking>, Compilable

A CharLmRescoringChunker provides a long-distance character language model-based chunker that operates by rescoring the output of a contained character language model HMM chunker.

The Underlying Chunker

This model performs rescoring over an underlying chunker. The underlying chunkeris an instance of CharLmHmmChunker, configured with the specified tokenizer factory, n-gram length, number of characters and interpolation ratio provided in the constructor. The underlying chunker may be configured after retrieving it through the superclass's RescoringChunker.baseChunker() method. The typical use of this is to configure caching.

The Rescoring Model

The rescoring model used by this chunker is based on a bounded character language model per chunk type with an additional process character language model for text not in chunks. The remaining details are described in the class documentation for the superclass AbstractCharLmRescoringChunker.

Training and Compilation

This chunker is trained in the usual way through calls to the appropriate handle() method. The method handle(Chunking) implements the ObjectHandler<Chunking> interface and allows for training through chunking examples. The method handle(String[],String[],String[]) provides implements the TagHandler interface, allowing training through BIO-encoded chunk taggings. A model is compiled by calling the Compilable interface method compileTo(ObjectOutput). The compiled model is an instance of a AbstractCharLmRescoringChunker, and its underlying chunker may be recovered that way.

Runtime Configuration

The underlying chunker is recoverable as a character language model HMM chunker through RescoringChunker.baseChunker(). The non-chunk process n-gram character language model is returned by AbstractCharLmRescoringChunker.outLM(), whereas the chunk models are returned by AbstractCharLmRescoringChunker.chunkLM(String).

The components of a character LM rescoring chunker are accessible in their training format for methods on this class, as described above.

The compiled models are instances of RescoringChunker, which allow their underlying chunker to be retrieved through RescoringChunker.baseChunker() and then configured. The other run-time models, for may be retrieved through the superclass's

Reserved Tag

The tag BOS is reserved for use by the system for encoding document start/end positions. See HmmChunker for more information.

Since:
LingPipe2.3
Version:
3.9
Author:
Bob Carpenter

Constructor Summary
CharLmRescoringChunker(TokenizerFactory tokenizerFactory, int numChunkingsRescored, int nGram, int numChars, double interpolationRatio)
          Construct a character language model rescoring chunker based on the specified components.
CharLmRescoringChunker(TokenizerFactory tokenizerFactory, int numChunkingsRescored, int nGram, int numChars, double interpolationRatio, boolean smoothTags)
          Construct a character language model rescoring chunker based on the specified components.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Compiles this model to the specified object output stream.
 void handle(Chunking chunking)
          Trains this chunker with the specified chunking.
 void handle(String[] toks, String[] whitespaces, String[] tags)
          Deprecated. Use handle(Chunking) instead.
 void trainDictionary(CharSequence cSeq, String type)
          Provides the specified character sequence data as training data for the language model of the specfied type.
 void trainOut(CharSequence cSeq)
          Trains the language model for non-entities using the specified character sequence.
 
Methods inherited from class com.aliasi.chunk.AbstractCharLmRescoringChunker
chunkLM, outLM, rescore, typeToChar
 
Methods inherited from class com.aliasi.chunk.RescoringChunker
baseChunker, chunk, chunk, nBest, nBestChunks, numChunkingsRescored, setNumChunkingsRescored
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CharLmRescoringChunker

public CharLmRescoringChunker(TokenizerFactory tokenizerFactory,
                              int numChunkingsRescored,
                              int nGram,
                              int numChars,
                              double interpolationRatio)
Construct a character language model rescoring chunker based on the specified components. Tags in the underlying model are not smoothed by default (see the full constructor's documentation: CharLmRescoringChunker(TokenizerFactory,int,int,int,double,boolean).

Parameters:
tokenizerFactory - Tokenizer factory for boundaries.
numChunkingsRescored - Number of underlying chunkings rescored.
nGram - N-gram length for all models.
numChars - Number of characters in the training and run-time character sets.
interpolationRatio - Underlying language-model interpolation ratios.

CharLmRescoringChunker

public CharLmRescoringChunker(TokenizerFactory tokenizerFactory,
                              int numChunkingsRescored,
                              int nGram,
                              int numChars,
                              double interpolationRatio,
                              boolean smoothTags)
Construct a character language model rescoring chunker based on the specified components.

Whether tags are smoothed in the underlying model is determined by the flag in the constructor. See CharLmHmmChunker's class documentation for more information on the effects of smoothing.

Parameters:
tokenizerFactory - Tokenizer factory for boundaries.
numChunkingsRescored - Number of underlying chunkings rescored.
nGram - N-gram length for all models.
numChars - Number of characters in the training and run-time character sets.
interpolationRatio - Underlying language-model interpolation ratios.
smoothTags - Set to true to smooth tags in underlying chunker.
Method Detail

handle

public void handle(Chunking chunking)
Trains this chunker with the specified chunking.

Specified by:
handle in interface ObjectHandler<Chunking>
Parameters:
chunking - Training data.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Compiles this model to the specified object output stream. The model may then be read back in using ObjectInput.readObject(); the resulting object will be an instance of AbstractCharLmRescoringChunker.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this object is compiled.
Throws:
IOException - If there is an I/O error during the write.
IllegalArgumentException - If the tokenizer factory supplied to the constructor of this class is not compilable.

handle

@Deprecated
public void handle(String[] toks,
                              String[] whitespaces,
                              String[] tags)
Deprecated. Use handle(Chunking) instead.

Trains this chunker with the specified BIO-encoded chunk tagging. For information on the external BIO format, as well as the internal tagging format, see HmmChunker.

Specified by:
handle in interface TagHandler
Parameters:
toks - Tokens of training data.
whitespaces - Whitespaces in training data.
tags - Tags for training data.

trainDictionary

public void trainDictionary(CharSequence cSeq,
                            String type)
Provides the specified character sequence data as training data for the language model of the specfied type. This method calls the method of the same signature on the trainable base chunker. The language model for the specified type will be created if it has not been seen previously.

Warning: It is not sufficient to only train a model using this method. Annotated data with a representative balance of entities and non-entity text is required to train the overall likelihood of entities and the contexts in which they occur. Use of this method will not bias the likelihoods of entities occurring. But, it might cause the common entities in the training data to be overwhelmed if a large dictionary is used. One possibility is to train the basic data multiple times relative to the dictionary (or vice-versa).

Parameters:
cSeq - Character sequence for training.
type - Type of character sequence.

trainOut

public void trainOut(CharSequence cSeq)
Trains the language model for non-entities using the specified character sequence.

Warning: Training using this method biases the likelihood of entities downward, because it does not train the likelihood of a non-entity character sequence ending and being followed by an entity of a specified type. Thus this method is best used to seed a dictionary of common words that are relatively few in number relative to the entity-annotated training data.

Parameters:
cSeq - Data to train the non-entity (out) model.