com.aliasi.lm
Class UniformBoundaryLM

java.lang.Object
  extended by com.aliasi.lm.UniformBoundaryLM
All Implemented Interfaces:
LanguageModel, LanguageModel.Dynamic, LanguageModel.Sequence, Compilable

public class UniformBoundaryLM
extends Object
implements LanguageModel.Dynamic, LanguageModel.Sequence

A UniformBoundaryLM implements a uniform sequence language model with a specified number of outcomes and the same probability assigned to the end-of-stream marker. The formula for computing sequence likelihood estimates is:

log2Estimate(cSeq) = = log2 ( (cSeq.length()+1) / (numOutcomes+1) )
Adding one to the number of outcomes makes the end-of-sequence just as likely as any other character. Adding one to the sequence length adds the log likelihood of the end-of-sequence marker itself.

This model is defined as dynamic for convenience. Calls to the training methods have no effect.

Since:
LingPipe2.0
Version:
3.8.1
Author:
Bob Carpenter

Nested Class Summary
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Field Summary
static UniformBoundaryLM ZERO_LM
          A constant uniform boundary language model returning zero log estimates.
 
Constructor Summary
UniformBoundaryLM()
          Construct uniform boundary language model with the full set of characters.
UniformBoundaryLM(double crossEntropyRate)
          Create a constant uniform boundary LM with the specified character cross-entropy rate.
UniformBoundaryLM(int numOutcomes)
          Construct a uniform boundary language model with the specified number of outcomes.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Writes a compiled version of this model to the specified object output.
 double log2Estimate(char[] cs, int start, int end)
          Returns an estimate of the log (base 2) probability of the specified character slice.
 double log2Estimate(CharSequence cSeq)
          Returns an estimate of the log (base 2) probability of the specified character sequence.
 int numOutcomes()
          Returns the number of outcomes for this uniform model.
 void train(char[] cs, int start, int end)
          Ignores the training data.
 void train(char[] cs, int start, int end, int count)
          Ignores the training data.
 void train(CharSequence cSeq)
          Ignores the training data.
 void train(CharSequence cSeq, int count)
          Ignores the training data.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ZERO_LM

public static final UniformBoundaryLM ZERO_LM
A constant uniform boundary language model returning zero log estimates. This is done by setting the number of characters to zero.

This constant is particularly useful for removing the contribution of whitespace characters to token n-gram language models.

Constructor Detail

UniformBoundaryLM

public UniformBoundaryLM()
Construct uniform boundary language model with the full set of characters.


UniformBoundaryLM

public UniformBoundaryLM(int numOutcomes)
Construct a uniform boundary language model with the specified number of outcomes. The estimate will include the end-of-stream boundary output and thus the per-character estimate will be 1/(numOutcomes+1).

Parameters:
numOutcomes - Number of outcomes.

UniformBoundaryLM

public UniformBoundaryLM(double crossEntropyRate)
Create a constant uniform boundary LM with the specified character cross-entropy rate. Recall that cross-entropy is the negative character average log probability. Thus the log estimate returned for a boundary model will include the final terminator, and yield:
log2 P(cs) = - crossEntropyRate * (cs.length() + 1)
The number of outcomes is set by rounding down the exponent of the cross-entropy and subtracting one for the boundary character:
numOutcomes = (int) 2.0crossEntropyRate - 1
Even if the above expression evaluates to less than zero, the number of outcomes will then be rounded up to zero.

Parameters:
crossEntropyRate - The cross-entropy rate of the model.
Throws:
IllegalArgumentException - If the cross-entropy rate is not finite and non-negative.
Method Detail

numOutcomes

public int numOutcomes()
Returns the number of outcomes for this uniform model.

Returns:
The number of outcomes for this uniform model.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled version of this model to the specified object output. The object read back in will also be an instance of UniformBoundaryLM.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this model is written.
Throws:
IOException - If there is an I/O error during the write.

train

public void train(char[] cs,
                  int start,
                  int end)
Ignores the training data.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - Ignored.
start - Ignored.
end - Ignored.

train

public void train(char[] cs,
                  int start,
                  int end,
                  int count)
Ignores the training data.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - Ignored.
start - Ignored.
end - Ignored.
count - Ignored.

train

public void train(CharSequence cSeq)
Ignores the training data.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cSeq - Ignored.

train

public void train(CharSequence cSeq,
                  int count)
Ignores the training data.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cSeq - Ignored.
count - Ignored.

log2Estimate

public double log2Estimate(char[] cs,
                           int start,
                           int end)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character slice.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
Log estimate of likelihood of specified character sequence.

log2Estimate

public double log2Estimate(CharSequence cSeq)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character sequence.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cSeq - Character sequence to estimate.
Returns:
Log estimate of likelihood of specified character sequence.