|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.hmm.AbstractHmm
com.aliasi.hmm.AbstractHmmEstimator
com.aliasi.hmm.HmmCharLmEstimator
public class HmmCharLmEstimator
An HmmCharLmEstimator employs a maximum a posteriori
transition estimator and a bounded character language model
emission estimator.
The emission language models are instances NGramBoundaryLM. As such, they explicitly model start-of-token
(prefix) and end-of-token (suffix) and basic token-shape features.
The language model parameters are the usual ones: n-gram length,
interpolation ratio (controls amount of smoothing), and number of
characters (controls final smoothing).
The initial state and final state estimators are multinomial distributions, as is the conditional estimator of the next state given a previous state. The default behavior is to use maximum likelihood estimates with no smoothing for initial state, final state, and transition likelihoods in the model. That is, the estimated likelihood of a state being an initial state is proportional its training data frequency, with the actual likelihood being the training data frequency divided by the total training data frequency across tags.
With the constructor HmmCharLmEstimator(int,int,double,boolean), a flag may be
specified to use smoothing for states. The smoothing used is
add-one smoothing, also called Laplace smoothing. For each state,
it adds one to the count for that state being an initial state and
for that state being a final state. For each pair of states, it
adds one to the count of the transitions (including the self
transition, which is only counted once.) This smoothing is
equivalent to putting an alpha=1 uniform Dirichlet
prior on the initial state, final state, and conditional next
state estimators, with the resulting estimates being the maximum a
posteriori estimates.
In the real world, corpora are noisy or incomplete. As of
version 3.4.0, this estimator accepts taggings with
null categories. If a category is null,
its emission is not trained, nor are the transitions to it,
transitions from it, start states involving it, or end states
involving it.
The estimator will also accept inputs with null emissions. In the case of a null emission or null category, the emission model will not be trained for that particular token/category pair.
| Constructor Summary | |
|---|---|
HmmCharLmEstimator()
Construct an HMM estimator with default parameter settings. |
|
HmmCharLmEstimator(int charLmMaxNGram,
int maxCharacters,
double charLmInterpolation)
Construct an HMM estimator with the specified maximum character n-gram size, maximum number of characters in the data, and character n-gram interpolation parameter, with no state smoothing. |
|
HmmCharLmEstimator(int charLmMaxNGram,
int maxCharacters,
double charLmInterpolation,
boolean smootheStates)
Construct an HMM estimator with the specified maximum character n-gram size, maximum number of characters in the data, character n-gram interpolation parameter, and state smoothing. |
|
| Method Summary | |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compiles a copy of this estimated HMM to the specified object output. |
NGramBoundaryLM |
emissionLm(String state)
Returns the language model used for emission probabilities for the specified state. |
double |
emitLog2Prob(String state,
CharSequence emission)
Returns the log (base 2) of the emission estimate. |
double |
emitProb(String state,
CharSequence emission)
Returns the estimate of the probability of the specified string being emitted from the specified state. |
double |
endProb(String state)
Returns the end probability for the specified state. |
double |
startProb(String state)
Returns the start probability for the specified state. |
void |
trainEmit(String state,
CharSequence emission)
Train the emission estimator with the specified training instance consisting of a state and emission. |
void |
trainEnd(String state)
Train the end state estimator with the specified end state. |
void |
trainStart(String state)
Train the start state estimator with the specified start state. |
void |
trainTransit(String sourceState,
String targetState)
Trains the transition estimator from the specified transition from the specified source state to the specified target state. |
double |
transitProb(String source,
String target)
Returns the transition estimate from the specified source state to the specified target state. |
| Methods inherited from class com.aliasi.hmm.AbstractHmmEstimator |
|---|
handle, numTrainingCases, numTrainingTokens |
| Methods inherited from class com.aliasi.hmm.AbstractHmm |
|---|
addState, emitLog2Prob, emitProb, endLog2Prob, endLog2Prob, endProb, startLog2Prob, startLog2Prob, startProb, stateSymbolTable, transitLog2Prob, transitLog2Prob, transitProb |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public HmmCharLmEstimator()
6 for the maximum character
n-gram and 6.0, Character.MAX_VALUE-1 for
the maximum number of characters, 6.0 for the
character n-gram interpolation factor, and no state likelihood
smoothing.
public HmmCharLmEstimator(int charLmMaxNGram,
int maxCharacters,
double charLmInterpolation)
NGramBoundaryLM.NGramBoundaryLM(int,int,double,char).
charLmMaxNGram - Maximum n-gram for emission character
language models.maxCharacters - Maximum number of unique characters in
the training and test data.charLmInterpolation - Interpolation parameter for character
language models.
IllegalArgumentException - If the max n-gram is less
than one, the max characters is less than 1 or greater than
Character.MAX_VALUE-1, or if the interpolation
parameter is negative or greater than 1.0.
public HmmCharLmEstimator(int charLmMaxNGram,
int maxCharacters,
double charLmInterpolation,
boolean smootheStates)
NGramBoundaryLM.NGramBoundaryLM(int,int,double,char).
For information on state smoothing, see the class documentation
above.
charLmMaxNGram - Maximum n-gram for emission character
language models.maxCharacters - Maximum number of unique characters in
the training and test data.charLmInterpolation - Interpolation parameter for character
language models.smootheStates - Flag indicating if add one smoothing is
carried out for HMM states.
IllegalArgumentException - If the max n-gram is less
than one, the max characters is less than 1 or greater than
Character.MAX_VALUE-1, or if the interpolation
parameter is negative or greater than 1.0.| Method Detail |
|---|
public void trainStart(String state)
AbstractHmmEstimator
trainStart in class AbstractHmmEstimatorstate - State being trained.public void trainEnd(String state)
AbstractHmmEstimator
trainEnd in class AbstractHmmEstimatorstate - State being trained.
public void trainEmit(String state,
CharSequence emission)
AbstractHmmEstimator
trainEmit in class AbstractHmmEstimatorstate - State being trained.emission - Emission from state being trained.
public void trainTransit(String sourceState,
String targetState)
AbstractHmmEstimator
trainTransit in class AbstractHmmEstimatorsourceState - State from which the transition is made.targetState - State to which the transition is made.public double startProb(String state)
AbstractHmm
startProb in interface HiddenMarkovModelstartProb in class AbstractHmmstate - HMM state.
public double endProb(String state)
AbstractHmm
endProb in interface HiddenMarkovModelendProb in class AbstractHmmstate - HMM state.
public double transitProb(String source,
String target)
trainTransit(String,String), in
order to produce add-one smoothing. Typically, maximum
likelihood estimates of state transitions are fine for HMMs
trained with large sets of supervised data.
transitProb in interface HiddenMarkovModeltransitProb in class AbstractHmmsource - Originating state for the transition.target - Resulting state after the transition.
public double emitProb(String state,
CharSequence emission)
emitProb in interface HiddenMarkovModelemitProb in class AbstractHmmstate - State of HMM.emission - String emitted by state.
public double emitLog2Prob(String state,
CharSequence emission)
AbstractHmmAbstractHmm.emitProb(String,CharSequence) for more information.
This method is implemented in terms of Math.log2(double) and AbstractHmm.emitProb(String,CharSequence).
emitLog2Prob in interface HiddenMarkovModelemitLog2Prob in class AbstractHmmstate - Label of state.emission - Character sequence emitted.
public NGramBoundaryLM emissionLm(String state)
state - State of the HMM.
public void compileTo(ObjectOutput objOut)
throws IOException
AbstractHmmEstimatorHiddenMarkovModel, but will
most likely not be an instance of the same class as the object
being compiled.
compileTo in interface CompilablecompileTo in class AbstractHmmEstimatorobjOut - Object output to which this estimator is
compiled.
IOException - If there is an I/O exception compiling this
object.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||