com.aliasi.lm
Class NGramBoundaryLM

java.lang.Object
  extended by com.aliasi.lm.NGramBoundaryLM
All Implemented Interfaces:
Handler, TextHandler, LanguageModel, LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Sequence, Model<CharSequence>, Compilable

public class NGramBoundaryLM
extends Object
implements TextHandler, LanguageModel.Sequence, LanguageModel.Conditional, LanguageModel.Dynamic, Model<CharSequence>

An NGramBoundaryLM provides a dynamic sequence language model for which training, estimation and pruning may be interleaved. A sequence language model normalizes probabilities over all sequences.

The model may be compiled to an object output; the compiled model read from the corresponding object input will be an instance of CompiledNGramBoundaryLM.

This class wraps an n-gram process language model by supplying a special boundary character boundaryChar at construction time which will be added to the total number of characters in defining the estimator. For each training event, the boundary character is inserted both before and after the character sequence provided. The actual unigram count of this boundary must then be decremented so that the initial character isn't counted in estimates. During estimation, the initial boundary character is used as context and the final one is used to estimate the end-of-stream likelihood. Thus if Ppr is the underlying process model then the boundary model defines estimates by:

Pb(c1,...,cN)
  = Ppr(boundaryChar|boundaryChar,c1,...,cN)
    * Σ1<=i<=N Ppr(ci|boundaryChar,c1,...,ci-1)
  = Ppr(boundaryChar,c1,...,cN,boundaryChar) - Ppr(boundaryChar)
The result of serializing and deserializing an n-gram boundary language model is a compiled implementation of a conditional sequence language model. The serialization format is the boundary character followed by the serialization of the contained writable process language model.

Models may be pruned by pruning the substring counter returned by substringCounter(). See the documentation for the class of the return object, TrieCharSeqCounter, for more information.

Since:
LingPipe2.0
Version:
3.9.1
Author:
Bob Carpenter

Nested Class Summary
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Constructor Summary
NGramBoundaryLM(int maxNGram)
          Constructs a dynamic n-gram sequence language model with the specified maximum n-gram and default values for other parameters.
NGramBoundaryLM(int maxNGram, int numChars)
          Constructs a dynamic n-gram sequence language model with the specified maximum n-gram, specified maximum number of observed characters, and default values for other parameters.
NGramBoundaryLM(int maxNGram, int numChars, double lambdaFactor, char boundaryChar)
          Construct a dynamic n-gram sequence language model with the specified maximum n-gram length, number of characters, interpolation ratio hyperparameter and boundary character.
NGramBoundaryLM(NGramProcessLM processLm, char boundaryChar)
          Construct an n-gram boundary language model with the specified boundary character and underlying process language model.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Writes a compiled version of this boundary language model to the specified object output.
 NGramProcessLM getProcessLM()
          Returns the underlying n-gram process language model for this boundary language model.
 void handle(char[] cs, int start, int length)
          Deprecated. Use handle(CharSequence) instead.
 void handle(CharSequence cSeq)
          Train the language model on the specified character sequence.
 double log2ConditionalEstimate(char[] cs, int start, int end)
          Returns the log (base 2) of the probability estimate for the conditional probability of the last character in the specified slice given the previous characters.
 double log2ConditionalEstimate(CharSequence cs)
          Returns the log (base 2) of the probabilty estimate for the conditional probability of the last character in the specified character sequence given the previous characters.
 double log2Estimate(char[] cs, int start, int end)
          Returns an estimate of the log (base 2) probability of the specified character slice.
 double log2Estimate(CharSequence cs)
          Returns an estimate of the log (base 2) probability of the specified character sequence.
 double log2Prob(CharSequence cSeq)
          This method is a convenience impelementation of the Model interface which delegates the call to log2Estimate(CharSequence).
 char[] observedCharacters()
          Returns the characters that have been observed for this language model, including the special boundary character.
 double prob(CharSequence cSeq)
          This method is a convenience implementation of the Model interface which returns the result of raising 2.0 to the power of the result of a call to log2Estimate(CharSequence).
static NGramBoundaryLM readFrom(InputStream in)
          Read a process language model from the specified input stream.
 TrieCharSeqCounter substringCounter()
          Returns the underlying substring counter for this language model.
 String toString()
          Returns a string-based representation of this language model.
 void train(char[] cs, int start, int end)
          Update the model with the training data provided by the specified character slice.
 void train(char[] cs, int start, int end, int count)
          Update the model with the training data provided by the specified character sequence with the specifiedc count.
 void train(CharSequence cs)
          Update the model with the training data provided by the specified character sequence with a count of one.
 void train(CharSequence cs, int count)
          Update the model with the training data provided by the specified character sequence with the specified count.
 void writeTo(OutputStream out)
          Writes this language model to the specified output stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

NGramBoundaryLM

public NGramBoundaryLM(int maxNGram)
Constructs a dynamic n-gram sequence language model with the specified maximum n-gram and default values for other parameters.

The default number of characters is Character.MAX_VALUE-1, the default interpolation parameter ratio is equal to the n-gram length, and the boundary character is the byte-order marker U+FFFF

Parameters:
maxNGram - Maximum n-gram length in model.

NGramBoundaryLM

public NGramBoundaryLM(int maxNGram,
                       int numChars)
Constructs a dynamic n-gram sequence language model with the specified maximum n-gram, specified maximum number of observed characters, and default values for other parameters.

The default interpolation parameter ratio is equal to the n-gram length, and the boundary character is the byte-order marker U+FFFF

Parameters:
maxNGram - Maximum n-gram length in model.
numChars - Maximum number of character seen in training and test sets.

NGramBoundaryLM

public NGramBoundaryLM(int maxNGram,
                       int numChars,
                       double lambdaFactor,
                       char boundaryChar)
Construct a dynamic n-gram sequence language model with the specified maximum n-gram length, number of characters, interpolation ratio hyperparameter and boundary character. Note that the boundary character must not occur as a regular character in the input. Unicode provides several options for marker characters; for instance the byte order markers U+FFFF or U+FEFF may be used internally by applications but may not be part of valid unicode character streams and thus make ideal choices for boundary characters. See: Unicode Standard, Chapter 15.8: NonCharacters

Parameters:
maxNGram - Maximum n-gram length in model.
numChars - Maximum number of character seen in training and test sets.
lambdaFactor - Interpolation ratio hyperparameter.
boundaryChar - Boundary character.

NGramBoundaryLM

public NGramBoundaryLM(NGramProcessLM processLm,
                       char boundaryChar)
Construct an n-gram boundary language model with the specified boundary character and underlying process language model.

This constructor may be used to reconstitute a serialized model. By writing the trie character sequence counter for the underlying process language model, it may be read back in. This may be used to construct a process language model, which may be used to reconstruct a boundary language model using this constructor.

Parameters:
processLm - Underlying process language model.
boundaryChar - Character used to encode boundaries.
Method Detail

writeTo

public void writeTo(OutputStream out)
             throws IOException
Writes this language model to the specified output stream.

A bit output is wrapped around the output stream for writing. The format begins with a delta-encoding of the boundary character plus 1, and is followed by the bit output of the underlying process language model.

Parameters:
out - Output stream from which to read the language model.
Throws:
IOException - If there is an underlying I/O error.

readFrom

public static NGramBoundaryLM readFrom(InputStream in)
                                throws IOException
Read a process language model from the specified input stream.

See writeTo(OutputStream) for a description of the binary format.

Parameters:
in - Input stream from which to read the model.
Returns:
Process language model read from stream.
Throws:
IOException - If there is an underlying I/O error.

getProcessLM

public NGramProcessLM getProcessLM()
Returns the underlying n-gram process language model for this boundary language model. Changes to the returned model affect this language model.

Returns:
The underlying process language model.

observedCharacters

public char[] observedCharacters()
Returns the characters that have been observed for this language model, including the special boundary character.

Specified by:
observedCharacters in interface LanguageModel.Conditional
Returns:
The observed characters for this langauge model.

substringCounter

public TrieCharSeqCounter substringCounter()
Returns the underlying substring counter for this language model. This model may be pruned by pruning the counter returned by this method.

Returns:
The underlying substring counter for this language model.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled version of this boundary language model to the specified object output. The result may be read back in by casting the result of ObjectInput.readObject() to CompiledNGramBoundaryLM.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this model is compiled.
Throws:
IOException - If there is an I/O exception during the write.

handle

@Deprecated
public void handle(char[] cs,
                              int start,
                              int length)
Deprecated. Use handle(CharSequence) instead.

Convenience implementation of the TextHandler interface, which delegates to train(char[],int,int). Note that this method uses start and length encoding of a slice, whereas the training method uses start and end encodings.

Specified by:
handle in interface TextHandler
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
length - Number of characters in slice.
Throws:
IndexOutOfBoundsException - If the indices do not fall within the specified character array.

handle

public void handle(CharSequence cSeq)
Train the language model on the specified character sequence. This method just delegates to train(CharSequence).

Parameters:
cSeq - Character sequence on which to train.

train

public void train(CharSequence cs,
                  int count)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character sequence with the specified count. Calling this method, train(cs,n) is equivalent to calling train(cs) a total of n times.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - The character sequence to use as training data.
count - Number of instances to train.

train

public void train(CharSequence cs)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character sequence with a count of one.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - The character sequence to use as training data.

train

public void train(char[] cs,
                  int start,
                  int end)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character slice.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - The underlying character array for the slice.
start - Index of first character in the slice.
end - Index of one plus the last character in the training slice.

train

public void train(char[] cs,
                  int start,
                  int end,
                  int count)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character sequence with the specifiedc count. Calling this method, train(cs,n) is equivalent to calling train(cs) a total of n times. Update the model with the training data provided by the specified character slice.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - The underlying character array for the slice.
start - Index of first character in the slice.
end - Index of one plus the last character in the training slice.
count - Number of instances to train.

log2ConditionalEstimate

public double log2ConditionalEstimate(CharSequence cs)
Description copied from interface: LanguageModel.Conditional
Returns the log (base 2) of the probabilty estimate for the conditional probability of the last character in the specified character sequence given the previous characters.

Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
Parameters:
cs - Character sequence to estimate.
Returns:
The log conditional probability estimate.

log2ConditionalEstimate

public double log2ConditionalEstimate(char[] cs,
                                      int start,
                                      int end)
Description copied from interface: LanguageModel.Conditional
Returns the log (base 2) of the probability estimate for the conditional probability of the last character in the specified slice given the previous characters.

Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus the index of the last character in the slice.
Returns:
The log conditional probability estimate.

log2Estimate

public double log2Estimate(CharSequence cs)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character sequence.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cs - Character sequence to estimate.
Returns:
Log estimate of likelihood of specified character sequence.

log2Estimate

public double log2Estimate(char[] cs,
                           int start,
                           int end)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character slice.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
Log estimate of likelihood of specified character sequence.

log2Prob

public double log2Prob(CharSequence cSeq)
This method is a convenience impelementation of the Model interface which delegates the call to log2Estimate(CharSequence).

Specified by:
log2Prob in interface Model<CharSequence>
Parameters:
cSeq - Character sequence whose probability is returned.
Returns:
The log (base 2) probability of the specified character sequence.

prob

public double prob(CharSequence cSeq)
This method is a convenience implementation of the Model interface which returns the result of raising 2.0 to the power of the result of a call to log2Estimate(CharSequence).

Specified by:
prob in interface Model<CharSequence>
Parameters:
cSeq - Character sequence whose probability is returned.
Returns:
The log probability of the specified character sequence.

toString

public String toString()
Returns a string-based representation of this language model. It displays the boundary character and the contained process language model.

Overrides:
toString in class Object
Returns:
A string-based representation of this language model.