com.aliasi.hmm
Class HmmCharLmEstimator

java.lang.Object
  extended by com.aliasi.hmm.AbstractHmm
      extended by com.aliasi.hmm.AbstractHmmEstimator
          extended by com.aliasi.hmm.HmmCharLmEstimator
All Implemented Interfaces:
Handler, ObjectHandler<Tagging<String>>, TagHandler, HiddenMarkovModel, Compilable

public class HmmCharLmEstimator
extends AbstractHmmEstimator

An HmmCharLmEstimator employs a maximum a posteriori transition estimator and a bounded character language model emission estimator.

Emission Language Models

The emission language models are instances NGramBoundaryLM. As such, they explicitly model start-of-token (prefix) and end-of-token (suffix) and basic token-shape features. The language model parameters are the usual ones: n-gram length, interpolation ratio (controls amount of smoothing), and number of characters (controls final smoothing).

Transition Estimates and Smoothing

The initial state and final state estimators are multinomial distributions, as is the conditional estimator of the next state given a previous state. The default behavior is to use maximum likelihood estimates with no smoothing for initial state, final state, and transition likelihoods in the model. That is, the estimated likelihood of a state being an initial state is proportional its training data frequency, with the actual likelihood being the training data frequency divided by the total training data frequency across tags.

With the constructor HmmCharLmEstimator(int,int,double,boolean), a flag may be specified to use smoothing for states. The smoothing used is add-one smoothing, also called Laplace smoothing. For each state, it adds one to the count for that state being an initial state and for that state being a final state. For each pair of states, it adds one to the count of the transitions (including the self transition, which is only counted once.) This smoothing is equivalent to putting an alpha=1 uniform Dirichlet prior on the initial state, final state, and conditional next state estimators, with the resulting estimates being the maximum a posteriori estimates.

Training with Partial Data

In the real world, corpora are noisy or incomplete. As of version 3.4.0, this estimator accepts taggings with null categories. If a category is null, its emission is not trained, nor are the transitions to it, transitions from it, start states involving it, or end states involving it.

The estimator will also accept inputs with null emissions. In the case of a null emission or null category, the emission model will not be trained for that particular token/category pair.

The HMM evaluator has been updated to support unknown taggings. See the class documentation for HmmEvaluator.

Since:
LingPipe2.1
Version:
3.8
Author:
Bob Carpenter

Constructor Summary
HmmCharLmEstimator()
          Construct an HMM estimator with default parameter settings.
HmmCharLmEstimator(int charLmMaxNGram, int maxCharacters, double charLmInterpolation)
          Construct an HMM estimator with the specified maximum character n-gram size, maximum number of characters in the data, and character n-gram interpolation parameter, with no state smoothing.
HmmCharLmEstimator(int charLmMaxNGram, int maxCharacters, double charLmInterpolation, boolean smootheStates)
          Construct an HMM estimator with the specified maximum character n-gram size, maximum number of characters in the data, character n-gram interpolation parameter, and state smoothing.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Compiles a copy of this estimated HMM to the specified object output.
 NGramBoundaryLM emissionLm(String state)
          Returns the language model used for emission probabilities for the specified state.
 double emitLog2Prob(String state, CharSequence emission)
          Returns the log (base 2) of the emission estimate.
 double emitProb(String state, CharSequence emission)
          Returns the estimate of the probability of the specified string being emitted from the specified state.
 double endProb(String state)
          Returns the end probability for the specified state.
 double startProb(String state)
          Returns the start probability for the specified state.
 void trainEmit(String state, CharSequence emission)
          Train the emission estimator with the specified training instance consisting of a state and emission.
 void trainEnd(String state)
          Train the end state estimator with the specified end state.
 void trainStart(String state)
          Train the start state estimator with the specified start state.
 void trainTransit(String sourceState, String targetState)
          Trains the transition estimator from the specified transition from the specified source state to the specified target state.
 double transitProb(String source, String target)
          Returns the transition estimate from the specified source state to the specified target state.
 
Methods inherited from class com.aliasi.hmm.AbstractHmmEstimator
handle, handle, numTrainingCases, numTrainingTokens
 
Methods inherited from class com.aliasi.hmm.AbstractHmm
addState, emitLog2Prob, emitProb, endLog2Prob, endLog2Prob, endProb, startLog2Prob, startLog2Prob, startProb, stateSymbolTable, transitLog2Prob, transitLog2Prob, transitProb
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HmmCharLmEstimator

public HmmCharLmEstimator()
Construct an HMM estimator with default parameter settings. The defaults are 6 for the maximum character n-gram and 6.0, Character.MAX_VALUE-1 for the maximum number of characters, 6.0 for the character n-gram interpolation factor, and no state likelihood smoothing.


HmmCharLmEstimator

public HmmCharLmEstimator(int charLmMaxNGram,
                          int maxCharacters,
                          double charLmInterpolation)
Construct an HMM estimator with the specified maximum character n-gram size, maximum number of characters in the data, and character n-gram interpolation parameter, with no state smoothing. For more information on these parameters, see NGramBoundaryLM.NGramBoundaryLM(int,int,double,char).

Parameters:
charLmMaxNGram - Maximum n-gram for emission character language models.
maxCharacters - Maximum number of unique characters in the training and test data.
charLmInterpolation - Interpolation parameter for character language models.
Throws:
IllegalArgumentException - If the max n-gram is less than one, the max characters is less than 1 or greater than Character.MAX_VALUE-1, or if the interpolation parameter is negative or greater than 1.0.

HmmCharLmEstimator

public HmmCharLmEstimator(int charLmMaxNGram,
                          int maxCharacters,
                          double charLmInterpolation,
                          boolean smootheStates)
Construct an HMM estimator with the specified maximum character n-gram size, maximum number of characters in the data, character n-gram interpolation parameter, and state smoothing. For more information on these parameters, see NGramBoundaryLM.NGramBoundaryLM(int,int,double,char). For information on state smoothing, see the class documentation above.

Parameters:
charLmMaxNGram - Maximum n-gram for emission character language models.
maxCharacters - Maximum number of unique characters in the training and test data.
charLmInterpolation - Interpolation parameter for character language models.
smootheStates - Flag indicating if add one smoothing is carried out for HMM states.
Throws:
IllegalArgumentException - If the max n-gram is less than one, the max characters is less than 1 or greater than Character.MAX_VALUE-1, or if the interpolation parameter is negative or greater than 1.0.
Method Detail

trainStart

public void trainStart(String state)
Description copied from class: AbstractHmmEstimator
Train the start state estimator with the specified start state. This increases the likelihood that the specified state will be the state of the first token.

Specified by:
trainStart in class AbstractHmmEstimator
Parameters:
state - State being trained.

trainEnd

public void trainEnd(String state)
Description copied from class: AbstractHmmEstimator
Train the end state estimator with the specified end state. This increases the likelihood that the specified state will be the state of the last token.

Specified by:
trainEnd in class AbstractHmmEstimator
Parameters:
state - State being trained.

trainEmit

public void trainEmit(String state,
                      CharSequence emission)
Description copied from class: AbstractHmmEstimator
Train the emission estimator with the specified training instance consisting of a state and emission. This method may be used for dictionary-based training for a particular state.

Specified by:
trainEmit in class AbstractHmmEstimator
Parameters:
state - State being trained.
emission - Emission from state being trained.

trainTransit

public void trainTransit(String sourceState,
                         String targetState)
Description copied from class: AbstractHmmEstimator
Trains the transition estimator from the specified transition from the specified source state to the specified target state.

Specified by:
trainTransit in class AbstractHmmEstimator
Parameters:
sourceState - State from which the transition is made.
targetState - State to which the transition is made.

startProb

public double startProb(String state)
Description copied from class: AbstractHmm
Returns the start probability for the specified state.

Specified by:
startProb in interface HiddenMarkovModel
Specified by:
startProb in class AbstractHmm
Parameters:
state - HMM state.
Returns:
Start probability of specified state.

endProb

public double endProb(String state)
Description copied from class: AbstractHmm
Returns the end probability for the specified state.

Specified by:
endProb in interface HiddenMarkovModel
Specified by:
endProb in class AbstractHmm
Parameters:
state - HMM state.
Returns:
End probability of specified state.

transitProb

public double transitProb(String source,
                          String target)
Returns the transition estimate from the specified source state to the specified target state. For this estimator, this is just the maximum likelihood estimate. If all transitions should be allowed, then each pair of states should be presented in both orders to trainTransit(String,String), in order to produce add-one smoothing. Typically, maximum likelihood estimates of state transitions are fine for HMMs trained with large sets of supervised data.

Specified by:
transitProb in interface HiddenMarkovModel
Specified by:
transitProb in class AbstractHmm
Parameters:
source - Originating state for the transition.
target - Resulting state after the transition.
Returns:
Maximum likelihood estimate of transition probability given training data.

emitProb

public double emitProb(String state,
                       CharSequence emission)
Returns the estimate of the probability of the specified string being emitted from the specified state. For a character language-model based HMM, this is just the language model estimate of the string likelihood of the emission for the particular state.

Specified by:
emitProb in interface HiddenMarkovModel
Specified by:
emitProb in class AbstractHmm
Parameters:
state - State of HMM.
emission - String emitted by state.
Returns:
Estimate of probability of state emitting string.

emitLog2Prob

public double emitLog2Prob(String state,
                           CharSequence emission)
Description copied from class: AbstractHmm
Returns the log (base 2) of the emission estimate. See AbstractHmm.emitProb(String,CharSequence) for more information.

This method is implemented in terms of Math.log2(double) and AbstractHmm.emitProb(String,CharSequence).

Specified by:
emitLog2Prob in interface HiddenMarkovModel
Overrides:
emitLog2Prob in class AbstractHmm
Parameters:
state - Label of state.
emission - Character sequence emitted.
Returns:
Log (base 2) estimate of likelihood of the state emitting the string.

emissionLm

public NGramBoundaryLM emissionLm(String state)
Returns the language model used for emission probabilities for the specified state. By grabbing the models directly in this way, they may be pruned, etc., before being compiled

Parameters:
state - State of the HMM.
Returns:
The language model for the specified state.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Description copied from class: AbstractHmmEstimator
Compiles a copy of this estimated HMM to the specified object output. Reading in the resulting bytes with an object input will produce an instance of HiddenMarkovModel, but will most likely not be an instance of the same class as the object being compiled.

Specified by:
compileTo in interface Compilable
Specified by:
compileTo in class AbstractHmmEstimator
Parameters:
objOut - Object output to which this estimator is compiled.
Throws:
IOException - If there is an I/O exception compiling this object.