com.aliasi.chunk
Class TrainTokenShapeChunker

java.lang.Object
  extended by com.aliasi.chunk.TrainTokenShapeChunker
All Implemented Interfaces:
Handler, ObjectHandler<Chunking>, TagHandler, Compilable

public class TrainTokenShapeChunker
extends Object
implements TagHandler, ObjectHandler<Chunking>, Compilable

A TrainTokenShapeChunker is used to train a token and shape-based chunker.

Estimation is based on a joint model of tags T1,...,TN and tokens W1,...,WN, which is approximated with a limited history and smoothed using linear interpolation.

By the chain rule:
   P(W1,...,WN,T1,...TN)
       = P(W1,T1) * P(W2,T2|W1,T1) * P(W3,T3|W1,W2,T1,T2)
         * ... * P(WN,TN|W1,...,WN-1,T1,...,TN-1)
 
The longer contexts are approximated with the two previous tokens and one previous tag.
   P(WN,TN|W1,...,WN-1,T1,...,TN-1)
       ~ P(WN,TN|WN-2,WN-1,TN-1)
 
The shorter contexts are padded with tags and tokens for the beginning of a stream, and an addition end-of-stream symbol is trained after the last symbol in the input. The joint model is further decomposed into a conditional tag model and a conditional token model by the chain rule:
    P(WN,TN|WN-2,WN-1,TN-1)
        = P(TN|WN-2,WN-1,TN-1)
          * P(WN|WN-2,WN-1,TN-1,TN)
 
The token model is further approximated as:
   P(WN|WN-2,WN-1,TN-1,TN)
       ~ P(WN|WN-1,interior(TN-1),TN)
 
where interior(TN-1) is the interior version of a tag; for instance:
   interior("ST_PERSON").equals("PERSON")
   interior("PERSON").equals("PERSON")
 
This performs what is known as "model tying", and it amounts to sharing the models for the two contexts. The tag model is also approximated by tying start and interior tag histories:
   P(TN|WN-2,WN-1,TN-1)
       ~ P(TN|WN-2,WN-1,interior(TN-1))
 
The tag and token models are themselves simple linear interpolation models, with smoothing parameters defined by the Witten-Bell method. The order of contexts for the token model is:
   P(WN|TN,interior(TN-1),WN-1)
   ~ lambda(TN,interior(TN-1),WN-1) * P_ml(WN|TN,interior(TN-1),WN-1)
     + (1-lambda(")) * P(WN|TN,interior(TN-1))

   P(WN|TN,interior(TN-1))
   ~ lambda(TN,interior(TN-1)) * P_ml(WN|TN,interior(TN-1))
     + (1-lambda(")) * P(WN|TN)

   P(WN|TN)  ~  lambda(TN) * P_ml(WN|TN)
                + 1-lambda(") * UNIFORM_ESTIMATE
 
The last step is degenerate in that SUM_W P(W|T) = INFINITY, because there are infinitely many possible tokens, and each is assigned the uniform estimate. To fix this, a model would be needed of character sequences that ensured SUM_W P(W|T) = 1.0. (The steps to do the final uniform estimate are handled by the compiled estimator.)

The tag estimator is smoothed by:
   P(TN|interior(TN-1),WN-1,WN-2)
       ~ lambda(interior(TN-1),WN-1,WN-2) * P_ml(TN|interior(TN-1),WN-1,WN-2)
       + (1-lambda(")) * P(TN|interior(TN-1),WN-1)

  P(TN|interior(TN-1),WN-1)
      ~ lambda(interior(TN-1),WN-1) * P_ml(TN|interior(TN-1),WN-1)
      + (1-lambda("))               * P_ml(TN|interior(TN-1))
 
Note that the smoothing stops at estimating a tag in terms of the previous tags. This guarantees that only bigram tag sequences seen in the training data get non-zero probability under the estimator.

Sequences of training pairs are added via handle(Chunking) method.

Since:
LingPipe1.0
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
TrainTokenShapeChunker(TokenCategorizer categorizer, TokenizerFactory factory)
          Construct a trainer for a token/shape chunker based on the specified token categorizer and tokenizer factory.
TrainTokenShapeChunker(TokenCategorizer categorizer, TokenizerFactory factory, int knownMinTokenCount, int minTokenCount, int minTagCount)
          Construct a trainer for a token/shape chunker based on the specified token categorizer, tokenizer factory and numerical parameters.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Compiles a chunker based on the training data received by this trainer to the specified object output.
 void handle(Chunking chunking)
          Add the specified chunking as a training event.
 void handle(String[] tokens, String[] whitespaces, String[] tags)
          Deprecated. Use handle(Chunking) instead.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TrainTokenShapeChunker

public TrainTokenShapeChunker(TokenCategorizer categorizer,
                              TokenizerFactory factory)
Construct a trainer for a token/shape chunker based on the specified token categorizer and tokenizer factory. The other parameters receive default vaules. The interpolation ratio is set to 4.0, the number of tokens to 3,000,000, the known minimum token count to 8, and the min tag and token count for pruning to 1.

Parameters:
categorizer - Token categorizer for unknown tokens.
factory - Tokenizer factory for creating tokenizers.

TrainTokenShapeChunker

public TrainTokenShapeChunker(TokenCategorizer categorizer,
                              TokenizerFactory factory,
                              int knownMinTokenCount,
                              int minTokenCount,
                              int minTagCount)
Construct a trainer for a token/shape chunker based on the specified token categorizer, tokenizer factory and numerical parameters. The parameters are described in detail in the class documentation above.

Parameters:
categorizer - Token categorizer for unknown tokens.
factory - Tokenizer factory for tokenizing data.
knownMinTokenCount - Number of instances required for a token to count as known for unknown training.
minTokenCount - Minimum token count for token contexts to survive after pruning.
minTagCount - Minimum count for tag contexts to survive after pruning.
Method Detail

handle

@Deprecated
public void handle(String[] tokens,
                              String[] whitespaces,
                              String[] tags)
Deprecated. Use handle(Chunking) instead.

Trains the underlying estimator on the specified BIO-encoded chunk tagging.

Specified by:
handle in interface TagHandler
Parameters:
tokens - Sequence of tokens to train.
whitespaces - Sequence of whitespaces (ignored).
tags - Sequence of tags to train.
Throws:
IllegalArgumentException - If the tags and tokens are different lengths.
NullPointerException - If any of the tags or tokens are null.

handle

public void handle(Chunking chunking)
Add the specified chunking as a training event.

Specified by:
handle in interface ObjectHandler<Chunking>
Parameters:
chunking - Chunking for training.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Compiles a chunker based on the training data received by this trainer to the specified object output.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which the chunker is written.
Throws:
IOException - If there is an underlying I/O error.