com.aliasi.chunk
Class AbstractCharLmRescoringChunker<B extends NBestChunker,O extends LanguageModel.Process,C extends LanguageModel.Sequence>

java.lang.Object
  extended by com.aliasi.chunk.RescoringChunker<B>
      extended by com.aliasi.chunk.AbstractCharLmRescoringChunker<B,O,C>
Type Parameters:
B - the type of the underlying n-best chunker being rescored
O - the type of the process language model for non-entities
C - the type of the sequence language model for entities
All Implemented Interfaces:
Chunker, ConfidenceChunker, NBestChunker
Direct Known Subclasses:
CharLmRescoringChunker

public class AbstractCharLmRescoringChunker<B extends NBestChunker,O extends LanguageModel.Process,C extends LanguageModel.Sequence>
extends RescoringChunker<B>

An AbstractCharLmRescoringChunker provides the basic character language-model rescoring model used by the trainable CharLmRescoringChunker and its compiled version.

Rescoring Model

The per-type language models simply model expressions of that type, both within and across tokens. The non-chunk model is responsible not only for modeling the text not in chunks, but also in predicting what the next chunk is given the text not in a chunk.

The exact model used is most easily described through an example. Consider the sentence John J. Smith lives in Washington. with John J. Smith as a person-type chunk and Washington as a location-type chunk. The probablity of this analysis derives from alternating chunk/non-chunk spans, starting and ending with non-chunk spans.

 POUT(CPER|CBOS)
 * PPER(John J. Smith)
 * POUT( lives in CLOC|CPER)
 * PLOC(Washington)
 * POUT(.CEOS|CLOC)
 
Note that the chunk models PPER and PLOC are bounded models, and thus predict the first letter given the fact that it's the first letter, and also encodes an end-of-string probability to model the end. See NGramBoundaryLM for more information on bounded models.

The non-chunk POUT model is a process language model, but uses distinguished characters in much the same way as the bounded models do internally. In particular, we have distinguished characters for each type (e.g. CPER), and for begin-of-sentence and end-of-sentence markers (e.g. CBOS). These must be chosen so as not to conflict with any input characters in training or decoding. With this encoding, the non-chunk model bears the brunt of the burden in predicting types. To start, it conditions the text it generates on the previous type, encoded as a character. To end, it generates the next chunk type, also encoded as a character. This allows the models to be sensitive to the fact that phrases like lives in (including the spaces on either side) are conditioned on following a person. The following chunk type, location, is generated conditional on following CPER lives in. The only constraints on the length of these dependencies is the length of the n-gram models (and the size of the chunk/non-chunk spans).

The resulting model generates a properly normalized probability distribution over chunkings.

Reserved Tag

The tag BOS is reserved for use by the system for encoding document start/end positions. See HmmChunker for more information.

Since:
LingPipe2.3
Version:
3.0
Author:
Bob Carpenter

Constructor Summary
AbstractCharLmRescoringChunker(B baseNBestChunker, int numChunkingsRescored, O outLM, Map<String,Character> typeToChar, Map<String,C> typeToLM)
          Construct a rescoring chunker based on the specified underlying chunker, with the specified number of underlying chunkings rescored, based on the models and type encodings provided in the last three arguments.
 
Method Summary
 C chunkLM(String chunkType)
          Returns the sequence language model for chunks of the specified type.
 O outLM()
          Returns the process language model for non-chunks.
 double rescore(Chunking chunking)
          Performs rescoring of the base chunking output using character language models.
 char typeToChar(String chunkType)
          Returns an immutable view of the mapping from entity types to their character object representations.
 
Methods inherited from class com.aliasi.chunk.RescoringChunker
baseChunker, chunk, chunk, nBest, nBestChunks, numChunkingsRescored, setNumChunkingsRescored
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractCharLmRescoringChunker

public AbstractCharLmRescoringChunker(B baseNBestChunker,
                                      int numChunkingsRescored,
                                      O outLM,
                                      Map<String,Character> typeToChar,
                                      Map<String,C> typeToLM)
Construct a rescoring chunker based on the specified underlying chunker, with the specified number of underlying chunkings rescored, based on the models and type encodings provided in the last three arguments. See the class documentation for more information on the role of these parameters.

Parameters:
baseNBestChunker - Underlying chunker to rescore.
numChunkingsRescored - Number of underlying chunkings rescored by this chunker.
outLM - The process language model for non-chunks.
typeToChar - A mapping from chunk types to the characters that encode them.
typeToLM - A mapping from chunk types to the language models used to model them.
Method Detail

typeToChar

public char typeToChar(String chunkType)
Returns an immutable view of the mapping from entity types to their character object representations. See the chunk type. Throws an illegal argument exception if the chunk type is not found.

Parameters:
chunkType - Type of chunk.
Returns:
The type to type.
Throws:
IllegalArgumentException - If the specified chunk type does not exist.

outLM

public O outLM()
Returns the process language model for non-chunks. This is the actual language model used, so changes to it affect this chunker.

Returns:
The process language model for non-chunks.

chunkLM

public C chunkLM(String chunkType)
Returns the sequence language model for chunks of the specified type.

Parameters:
chunkType - Type of chunk.
Returns:
Language model for the specified chunk type.

rescore

public double rescore(Chunking chunking)
Performs rescoring of the base chunking output using character language models. See the class documentation above for more information.

Specified by:
rescore in class RescoringChunker<B extends NBestChunker>
Parameters:
chunking - Chunking being rescored.
Returns:
New score for chunker.