|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.chunk.TrainTokenShapeChunker
public class TrainTokenShapeChunker
A TrainTokenShapeChunker is used to train a token and
shape-based chunker.
Estimation is based on a joint model of tags
T1,...,TN and tokens W1,...,WN, which is
approximated with a limited history and smoothed using linear
interpolation.
By the chain rule:
P(W1,...,WN,T1,...TN)
= P(W1,T1) * P(W2,T2|W1,T1) * P(W3,T3|W1,W2,T1,T2)
* ... * P(WN,TN|W1,...,WN-1,T1,...,TN-1)
The longer contexts are approximated with the two previous
tokens and one previous tag.
P(WN,TN|W1,...,WN-1,T1,...,TN-1)
~ P(WN,TN|WN-2,WN-1,TN-1)
The shorter contexts are padded with tags and tokens for the
beginning of a stream, and an addition end-of-stream symbol is
trained after the last symbol in the input.
The joint model is further decomposed into a conditional tag model
and a conditional token model by the chain rule:
P(WN,TN|WN-2,WN-1,TN-1)
= P(TN|WN-2,WN-1,TN-1)
* P(WN|WN-2,WN-1,TN-1,TN)
The token model is further approximated as:
P(WN|WN-2,WN-1,TN-1,TN)
~ P(WN|WN-1,interior(TN-1),TN)
where interior(TN-1) is the interior
version of a tag; for instance:
interior("ST_PERSON").equals("PERSON")
interior("PERSON").equals("PERSON")
This performs what is known as "model tying", and it
amounts to sharing the models for the two contexts.
The tag model is also approximated by tying start
and interior tag histories:
P(TN|WN-2,WN-1,TN-1)
~ P(TN|WN-2,WN-1,interior(TN-1))
The tag and token models are themselves simple
linear interpolation models, with smoothing parameters defined
by the Witten-Bell method. The order
of contexts for the token model is:
P(WN|TN,interior(TN-1),WN-1)
~ lambda(TN,interior(TN-1),WN-1) * P_ml(WN|TN,interior(TN-1),WN-1)
+ (1-lambda(")) * P(WN|TN,interior(TN-1))
P(WN|TN,interior(TN-1))
~ lambda(TN,interior(TN-1)) * P_ml(WN|TN,interior(TN-1))
+ (1-lambda(")) * P(WN|TN)
P(WN|TN) ~ lambda(TN) * P_ml(WN|TN)
+ 1-lambda(") * UNIFORM_ESTIMATE
The last step is degenerate in that SUM_W P(W|T) =
INFINITY, because there are infinitely many possible tokens,
and each is assigned the uniform estimate. To fix this, a model
would be needed of character sequences that ensured SUM_W
P(W|T) = 1.0. (The steps to do the final uniform estimate
are handled by the compiled estimator.)
The tag estimator is smoothed by:
P(TN|interior(TN-1),WN-1,WN-2)
~ lambda(interior(TN-1),WN-1,WN-2) * P_ml(TN|interior(TN-1),WN-1,WN-2)
+ (1-lambda(")) * P(TN|interior(TN-1),WN-1)
P(TN|interior(TN-1),WN-1)
~ lambda(interior(TN-1),WN-1) * P_ml(TN|interior(TN-1),WN-1)
+ (1-lambda(")) * P_ml(TN|interior(TN-1))
Note that the smoothing stops at estimating a tag in terms
of the previous tags. This guarantees that only bigram tag
sequences seen in the training data get non-zero probability
under the estimator.
|
Sequences of training pairs are added via the handle(String[],String[],String[]) or the handle(Chunking) methods.
| Constructor Summary | |
|---|---|
TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory)
Construct a trainer for a token/shape chunker based on the specified token categorizer and tokenizer factory. |
|
TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory,
int knownMinTokenCount,
int minTokenCount,
int minTagCount)
Construct a trainer for a token/shape chunker based on the specified token categorizer, tokenizer factory and numerical parameters. |
|
| Method Summary | |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compiles a chunker based on the training data received by this trainer to the specified object output. |
void |
handle(Chunking chunking)
Add the specified chunking as a training event. |
void |
handle(String[] tokens,
String[] whitespaces,
String[] tags)
Trains the underlying estimator on the specified BIO-encoded chunk tagging. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory)
4.0, the
number of tokens to 3,000,000, the
known minimum token count to 8, and the min tag and
token count for pruning to 1.
categorizer - Token categorizer for unknown tokens.factory - Tokenizer factory for creating tokenizers.
public TrainTokenShapeChunker(TokenCategorizer categorizer,
TokenizerFactory factory,
int knownMinTokenCount,
int minTokenCount,
int minTagCount)
categorizer - Token categorizer for unknown tokens.factory - Tokenizer factory for tokenizing data.knownMinTokenCount - Number of instances required for
a token to count as known for unknown training.minTokenCount - Minimum token count for token contexts to
survive after pruning.minTagCount - Minimum count for tag contexts to survive
after pruning.| Method Detail |
|---|
public void handle(String[] tokens,
String[] whitespaces,
String[] tags)
handle in interface TagHandlertokens - Sequence of tokens to train.whitespaces - Sequence of whitespaces (ignored).tags - Sequence of tags to train.
IllegalArgumentException - If the tags and tokens are
different lengths.
NullPointerException - If any of the tags or tokens are null.public void handle(Chunking chunking)
handle in interface ObjectHandler<Chunking>chunking - Chunking for training.
public void compileTo(ObjectOutput objOut)
throws IOException
compileTo in interface CompilableobjOut - Object output to which the chunker is written.
IOException - If there is an underlying I/O error.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||