com.aliasi.crf
Class ChainCrfChunker

java.lang.Object
  extended by com.aliasi.crf.ChainCrfChunker
All Implemented Interfaces:
Chunker, ConfidenceChunker, NBestChunker, Serializable

public class ChainCrfChunker
extends Object
implements Chunker, ConfidenceChunker, NBestChunker, Serializable

A ChainCrfChunker implements chunking based on a chain CRF over string sequences, a tokenizer factory, and a tag to chunk coder/decoder.

The tokenizer factory is used to turn an input sequence into a list of tokens. The codec is used to convert taggings into chunkings and vice-versa.

Codec-Based Features

For chunking, feature extraction is over the same two implicit data structures as for chain CRFs, nodes and edges. For chunkers, the labels are coded and decoded by an instance of TagChunkCodec, such as the BIO-based codec. In order to generate token-based representations on which to hang tags, an instance of TokenizerFactory is supplied in the chunker constructor.

Training

The static estimate() method is used to train a chain CRF-based chunker. The training data is provided as a corpus of chunkings. The tag-chunk codec and tokenizer factory are then used to convert the chunkings to taggings, and the resulting tag corpus passed off to the chain CRF estimator method. Feature extractors are the same as for a chain CRF, with one for nodes and one for edges. The tags passed in to these feature extractors will be determiend by the tag-chunk codec. The remaining inputs are identical to those for chain CRFs; see the method documentation for more information.

Decoding

A chain CRF chunker implements all three chunker interfaces in order to return first-best chunkings, n-best chunkings (with or without normalization of scores to conditional probabilities), and to iterate over the n-best chunks in decreasing order of probability.

Serialization

Chain CRF chunkers are serializable if their contained tokenizer factories and codecs are serializable. The chunker read back in will be of this class, ChainCrfChunker, with components derived from serialization and deserialization.

Thread Safety

The chain CRF chunker class is thread safe if the tokenizer factory and tag/chunk coder/decoder are thread safe.

Since:
LingPipe3.9
Version:
3.9
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
ChainCrfChunker(ChainCrf<String> crf, TokenizerFactory tokenizerFactory, TagChunkCodec codec)
          Construct a chunker based on the specified chain conditional random field, tokenizer factory and tag-chunk coder/decoder.
 
Method Summary
 Chunking chunk(char[] cs, int start, int end)
          Return the chunking of the specified character slice.
 Chunking chunk(CharSequence cSeq)
          Return the chunking of the specified character sequence.
 TagChunkCodec codec()
          Returns the tag/chunk coder/decoder for this chunker.
 ChainCrf<String> crf()
          Returns the underlying CRF for this chunker.
static ChainCrfChunker estimate(Corpus<ObjectHandler<Chunking>> chunkingCorpus, TagChunkCodec codec, TokenizerFactory tokenizerFactory, ChainCrfFeatureExtractor<String> featureExtractor, boolean addInterceptFeature, int minFeatureCount, boolean cacheFeatureVectors, RegressionPrior prior, int priorBlockSize, AnnealingSchedule annealingSchedule, double minImprovement, int minEpochs, int maxEpochs, Reporter reporter)
          Return the chain CRF-based chunker estimated from the specified corpus, which is converted to a tagging corpus using the specified coder/decoder and tokenizer factory, then passed to the chain CRF estimate method along with the rest of the arguments.
 Iterator<ScoredObject<Chunking>> nBest(char[] cs, int start, int end, int maxResults)
          Return the scored chunkings of the specified character sequence in order as an iterator in order of score.
 Iterator<Chunk> nBestChunks(char[] cs, int start, int end, int maxNBest)
          Returns the n-best chunks in decreasing order of probability estimates.
 Iterator<ScoredObject<Chunking>> nBestConditional(char[] cs, int start, int end, int maxResults)
          Returns an iterator over n-best chunkings with scores normalized to conditional probabilities of the output given the input string slice.
 TokenizerFactory tokenizerFactory()
          Return the tokenizer factory for this chunker.
 String toString()
          Return a string-based representation of this CRF chunker.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

ChainCrfChunker

public ChainCrfChunker(ChainCrf<String> crf,
                       TokenizerFactory tokenizerFactory,
                       TagChunkCodec codec)
Construct a chunker based on the specified chain conditional random field, tokenizer factory and tag-chunk coder/decoder. If the codec requires a tokenizer factory, it should be the same one as supplied to this chunker constructor.

Parameters:
crf - Underlying conditional random field.
tokenizerFactory - Tokenizer factory for converting chunkings to token sequences.
codec - Coder/decoder for converting taggings to chunkings and vice-versa.
Method Detail

crf

public ChainCrf<String> crf()
Returns the underlying CRF for this chunker.

Returns:
CRF for this chunker.

codec

public TagChunkCodec codec()
Returns the tag/chunk coder/decoder for this chunker.

Returns:
The tag chunk codec for this chunker.

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Return the tokenizer factory for this chunker.

Returns:
The tokenizer factory for this chunker.

toString

public String toString()
Return a string-based representation of this CRF chunker.

Overrides:
toString in class Object
Returns:
String representation of this chunker.

chunk

public Chunking chunk(CharSequence cSeq)
Description copied from interface: Chunker
Return the chunking of the specified character sequence.

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence to chunk.
Returns:
A chunking of the character sequence.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Description copied from interface: Chunker
Return the chunking of the specified character slice.

Specified by:
chunk in interface Chunker
Parameters:
cs - Underlying character sequence.
start - Index of first character in slice.
end - Index of one past the last character in the slice.
Returns:
The chunking over the specified character slice.

nBest

public Iterator<ScoredObject<Chunking>> nBest(char[] cs,
                                              int start,
                                              int end,
                                              int maxResults)
Description copied from interface: NBestChunker
Return the scored chunkings of the specified character sequence in order as an iterator in order of score. The return result is an iterator over scored objects consisting of chunkings and scores. The maximum number of returned chunkings is also specified; for many n-best chunkers, a smaller maximum n-best size leads to faster results.

Specified by:
nBest in interface NBestChunker
Parameters:
cs - Underlying character array.
start - Index of first character to analyze.
end - Index of one past the last character to analyze.
maxResults - The maximum number of results to return.n

nBestConditional

public Iterator<ScoredObject<Chunking>> nBestConditional(char[] cs,
                                                         int start,
                                                         int end,
                                                         int maxResults)
Returns an iterator over n-best chunkings with scores normalized to conditional probabilities of the output given the input string slice. The same chunkings will be returned in the same order as for the unnormalized method, nBest(char[],int,int,int). Like that method, the maximum number of results parameter should be set as low as practical, as it cuts down on memory requirement for outputs that will never be returned.

Conditional probability normalization requires an additional forward-backward pass to derive the normalizing factor, but the benefit is that results become comparable across input strings.

Parameters:
cs - Underlying characters.
start - First character in slice.
end - One past the last character in the slice.
maxResults - Maximum number of results to return.

nBestChunks

public Iterator<Chunk> nBestChunks(char[] cs,
                                   int start,
                                   int end,
                                   int maxNBest)
Description copied from interface: ConfidenceChunker
Returns the n-best chunks in decreasing order of probability estimates. The return results implement the Chunk interface, and their scores are conditional probability estimates of the chunk given the input character slice.

Specified by:
nBestChunks in interface ConfidenceChunker
Parameters:
cs - Underlying character array.
start - Index of first character to analyze.
end - Index of one past the last character to analyze.
maxNBest - The maximum number of chunks to return.

estimate

public static ChainCrfChunker estimate(Corpus<ObjectHandler<Chunking>> chunkingCorpus,
                                       TagChunkCodec codec,
                                       TokenizerFactory tokenizerFactory,
                                       ChainCrfFeatureExtractor<String> featureExtractor,
                                       boolean addInterceptFeature,
                                       int minFeatureCount,
                                       boolean cacheFeatureVectors,
                                       RegressionPrior prior,
                                       int priorBlockSize,
                                       AnnealingSchedule annealingSchedule,
                                       double minImprovement,
                                       int minEpochs,
                                       int maxEpochs,
                                       Reporter reporter)
                                throws IOException
Return the chain CRF-based chunker estimated from the specified corpus, which is converted to a tagging corpus using the specified coder/decoder and tokenizer factory, then passed to the chain CRF estimate method along with the rest of the arguments.

Estimation is based on regularized stochastic gradient descent. See ChainCrf.estimate(Corpus,ChainCrfFeatureExtractor,boolean,int,boolean,boolean,RegressionPrior,int,AnnealingSchedule,double,int,int,Reporter) for more information.

Parameters:
chunkingCorpus - Training corpus of chunkings.
codec - Coder/decoder for translating chunkings to taggings and vice-versa.
tokenizerFactory - Tokenizer factory for converting inputs to token sequences for the underlying chain CRF.
featureExtractor - Feature extractor for the underlying chain CRF.
addInterceptFeature - Set to true to automatically add an intercept feature with constant value 1.0 in position 0.
minFeatureCount - Minimum number of times a feature must show up in the tagging corpus given the feature extractors to be retained for training.
cacheFeatureVectors - Flag indicating whether or not to cache extracted features.
prior - Prior to use to regularize the underlying chain CRF estimates.
priorBlockSize - Number of instances to update by gradeint for every prior update.
annealingSchedule - Annealing schedule to determine learning rates for stochastic gradient descent training.
minImprovement - Minimum improvement in epoch to terminate training (computed with a rolling average).
minEpochs - Minimum number of epochs for which to train.
maxEpochs - Maximum nubmer of epochs for which to train.
reporter - Reporter to which reports of training are sent, or null for silent operation.
Throws:
IOException - If there is an underlying I/O exception reading the corpus.