com.aliasi.hmm
Class HmmDecoder

java.lang.Object
  extended by com.aliasi.hmm.HmmDecoder
All Implemented Interfaces:
MarginalTagger<String>, NBestTagger<String>, Tagger<String>

public class HmmDecoder
extends Object
implements Tagger<String>, NBestTagger<String>, MarginalTagger<String>

An HmmDecoder provides implementations of first-best, n-best and marginal taggers for hidden Markov models (HMMs). A decoder is constructed from a hidden Markov model.

First-best Output

HMM decoders implement the interface Tagger, which specifies a first-best tagging method tag(List). This method provides the likely (first best) path of HMM states (tags) given a sequence of string emissions (outputs). First-best decoding is implemented using Viterbi's dynamic programming algorithm.

N-best Output

HMM decoders also implement the interface NBestTagger, which specifies the method tagNBest(List,int) and tagNBestConditional(List,int). These methods both return an iterator over scored taggings. N-best decoding is implemented using the Viterbi algorithm in a forward pass and the A* algorithm in the backward pass using the Viterbi estimates as exact completion estimates. The variant conditional method further normalizes outputs to posterior conditional probabilities, and is a bit more expensive to compute.

Confidence and Lattice Output

HMM decoders also implement the MarginalTagger interface, which specifies a method tagMarginal(List) for providing marginal probability estimates for categories for a token given the input string. Marginal decoding is implemented using the standard forward-backward algorithm. The lattice is an instance of TagLattice, which itself implements TagLattice; see that class's documentation for information on how to retrieve cumulative (total) probabilities for input token sequences and posterior conditional probabilities (confidence scores) per token.

Caching

The decoder is able to cache emission probabilities, the computation of which is often the bottleneck in decoding. Caches may be provided in the constructor for both ordinary (linear) probabilities and log probabilities of emissions. A cache is simply an instance of Map from strings to arrays of doubles.

The first-best and n-best decoders only uses log probabilities, whereas the n-best normalized and lattice decoders use linear probabilities. Only the probabilities computed are cached, so if a program only does first-best processing, only linear estimates are cached.

Any implementation of Map may be used as a cache, but particular attention must be paid to thread safety and scalability. A fully synchronized cache can be created with:

 Map<String,double[]> cache 
     = java.util.Collections.synchronizedMap(new HashMap<String,double[]>());
LingPipe's map implementation FastCache is designed specifically to be used as a cache in settings such as these.

It is often (e.g. on English newsire) easy to get high token coverage (e.g. 97%) with a rather modestly sized cache (e.g. 100K tokens). Other corpora and languages may vary and we encourage experimentation with efficiency versus memory for caching. Note that run times will speed up as more and more estimates are returned from the cache rather than being computed directly.

Synchrnonization and Thread Safety

This class does not perform any underlying sychronization. If the hidden Markov model is not thread safe, then it must be synchronized. Similarly for the caches. Note that FastCache, while not synchronized, is thread safe. Similarly, the compilation of an HMM trained with HmmCharLmEstimator is thread safe, in fact allowing safe concurrent access because it is immutable.

Beam and Pruning

For first-best and n-best decoding, a beam may be used to prune unlikely hypotheses. This beam is set during construction or through the method setLog2EmissionBeam(double) (setting and access must be concurrent read/exclusive write synchronized from the caller). The beam works token by token. As each token is considered, any tag whose emission log (base 2) likelihood is more than the beam less than the bes5t emission estimate is eliminated from further consideration.

Since:
LingPipe2.1
Version:
4.0.0
Author:
Bob Carpenter

Constructor Summary
HmmDecoder(HiddenMarkovModel hmm)
          Construct an HMM decoder from the specified HMM.
HmmDecoder(HiddenMarkovModel hmm, Map<String,double[]> emissionCache, Map<String,double[]> emissionLog2Cache)
          Construct an HMM decoder from the specified HMM using the specified caches for linear and log probabilities.
HmmDecoder(HiddenMarkovModel hmm, Map<String,double[]> emissionCache, Map<String,double[]> emissionLog2Cache, double log2Beam, double log2EmissionBeam)
          Construct an HMM decoder from the specified HMM using the specified caches for linear and log probabilities, with the specified beam width for emission estimates.
 
Method Summary
 Map<String,double[]> emissionCache()
          Returns the mapping used to cache emission probabilities, or null if not caching.
 Map<String,double[]> emissionLog2Cache()
          Returns the mapping used to cache log (base 2) emission probabilities, or null if not caching.
 HiddenMarkovModel getHmm()
          Returns the hidden Markov model underlying this decoder.
 void setEmissionCache(Map<String,double[]> cache)
          Sets the emission cache to the specified value.
 void setEmissionLog2Cache(Map<String,double[]> cache)
          Sets the log emission cache to the specified value.
 void setLog2Beam(double log2Beam)
          Sets the value of the log2 beam to the specified value.
 void setLog2EmissionBeam(double log2EmissionBeam)
          Sets the log (base 2) emission beam width.
 Tagging<String> tag(List<String> tokens)
          Return the tagging for the specified list of tokens.
 TagLattice<String> tagMarginal(List<String> tokens)
          Return the marginal tagging for the specified list of input tokens.
 Iterator<ScoredTagging<String>> tagNBest(List<String> tokens, int maxResults)
          Return an iterator over the n-best scored taggings for the specified input tokens up to a specified maximum n.
 Iterator<ScoredTagging<String>> tagNBestConditional(List<String> tokens, int maxResults)
          Return an iterator over the n-best scored taggings for the specified input tokens up to a specified maximum n, with scores normalized to conditional probabilities.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HmmDecoder

public HmmDecoder(HiddenMarkovModel hmm)
Construct an HMM decoder from the specified HMM. No caching is applied to estimates, and the beams are set to positive infinity, turning off pruning. This constructor is appropriate for dynamic models with changing probability estimates.

Parameters:
hmm - Model to use as basis of decoding.

HmmDecoder

public HmmDecoder(HiddenMarkovModel hmm,
                  Map<String,double[]> emissionCache,
                  Map<String,double[]> emissionLog2Cache)
Construct an HMM decoder from the specified HMM using the specified caches for linear and log probabilities. The beams are set to positive infinity, turning off pruning. Either or both of the caches may be null, in which case the corresponding values will not be cached.

Parameters:
hmm - Model to use for decoding.
emissionCache - Map to use for emission caching.
emissionLog2Cache - Map to use for log emission caching.

HmmDecoder

public HmmDecoder(HiddenMarkovModel hmm,
                  Map<String,double[]> emissionCache,
                  Map<String,double[]> emissionLog2Cache,
                  double log2Beam,
                  double log2EmissionBeam)
Construct an HMM decoder from the specified HMM using the specified caches for linear and log probabilities, with the specified beam width for emission estimates. Either or both of the caches may be null, in which case the corresponding values will not be cached.

Parameters:
hmm - Model to use for decoding.
emissionCache - Map to use for emission caching.
emissionLog2Cache - Map to use for log emission caching.
log2Beam - The log (base 2) beam for pruning full hypotheses.
log2EmissionBeam - The log (base 2) beam for pruning emission hypotheses.
Throws:
IllegalArgumentException - If either beam is not a non-negative number.
Method Detail

getHmm

public HiddenMarkovModel getHmm()
Returns the hidden Markov model underlying this decoder. The returned value is the actual HMM used by this decoder, so changes to it will affect this decoder.

Returns:
The HMM for this decoder.

emissionCache

public Map<String,double[]> emissionCache()
Returns the mapping used to cache emission probabilities, or null if not caching. This is the actual mapping, so changes to it will affect this decoder.

Returns:
The emission probability cache.

emissionLog2Cache

public Map<String,double[]> emissionLog2Cache()
Returns the mapping used to cache log (base 2) emission probabilities, or null if not caching. This is the actual mapping, so changes to it will affect this decoder.

Returns:
The emission probability cache.

setEmissionCache

public void setEmissionCache(Map<String,double[]> cache)
Sets the emission cache to the specified value.

Warning: This method should not be executed concurrently with any calls to decoding, as it may produce an inconsistent result. The typical application will be to set a cache before using a decoder.

Parameters:
cache - Cache for linear emission estimates.

setLog2EmissionBeam

public void setLog2EmissionBeam(double log2EmissionBeam)
Sets the log (base 2) emission beam width. Any tag with a log (base 2) emission probability more than the beam width less than the best hypothesis is discarded. Setting the beam width to zero results in pruning to category with the best emission. Setting the beam width to positive infinity effectively turns off the beam.

Parameters:
log2EmissionBeam - Width of beam.
Throws:
IllegalArgumentException - If the beam width is negative.

setLog2Beam

public void setLog2Beam(double log2Beam)
Sets the value of the log2 beam to the specified value. This beam controls pruning based on full Viterbi values for a given lattice slice. See the class documentation above for more details.

Parameters:
log2Beam - The log (base 2) Viterbi beam.
Throws:
IllegalArgumentException - If the beam width is negative.

setEmissionLog2Cache

public void setEmissionLog2Cache(Map<String,double[]> cache)
Sets the log emission cache to the specified value.

Warning: This method should not be executed concurrently with any calls to decoding, as it may produce an inconsistent result. The typical application will be to set a cache before using a decoder.

Parameters:
cache - Cache for linear emission estimates.

tag

public Tagging<String> tag(List<String> tokens)
Description copied from interface: Tagger
Return the tagging for the specified list of tokens.

Specified by:
tag in interface Tagger<String>
Parameters:
tokens - Input tokens to tag.
Returns:
Tagging for the specified input tokens.

tagNBest

public Iterator<ScoredTagging<String>> tagNBest(List<String> tokens,
                                                int maxResults)
Description copied from interface: NBestTagger
Return an iterator over the n-best scored taggings for the specified input tokens up to a specified maximum n.

Specified by:
tagNBest in interface NBestTagger<String>
Parameters:
tokens - Input tokens to tag.
maxResults - Maximum number of results to return.
Returns:
Iterator over the n-best scored taggings for the specified tokens.

tagNBestConditional

public Iterator<ScoredTagging<String>> tagNBestConditional(List<String> tokens,
                                                           int maxResults)
Description copied from interface: NBestTagger
Return an iterator over the n-best scored taggings for the specified input tokens up to a specified maximum n, with scores normalized to conditional probabilities.

Optional operation.

Specified by:
tagNBestConditional in interface NBestTagger<String>
Parameters:
tokens - Input tokens to tag.
maxResults - Maximum number of results to return.
Returns:
Iterator over the n-best scored taggings for the specified tokens.

tagMarginal

public TagLattice<String> tagMarginal(List<String> tokens)
Description copied from interface: MarginalTagger
Return the marginal tagging for the specified list of input tokens.

Specified by:
tagMarginal in interface MarginalTagger<String>
Parameters:
tokens - Input tokens to tag.
Returns:
The lattice of tags for the specified tokens.