com.aliasi.lm
Class NGramProcessLM

java.lang.Object
  extended by com.aliasi.lm.NGramProcessLM
All Implemented Interfaces:
Handler, ObjectHandler<CharSequence>, LanguageModel, LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, Model<CharSequence>, Compilable, Serializable

public class NGramProcessLM
extends Object
implements Model<CharSequence>, LanguageModel.Process, LanguageModel.Conditional, LanguageModel.Dynamic, ObjectHandler<CharSequence>, Serializable

An NGramProcessLM provides a dynamic conditional process language model process for which training, estimation, and pruning may be interleaved. A process language model normalizes probablities for a given length of input.

The model may be compiled to an object output stream; the compiled model read back in will be an instance of CompiledNGramProcessLM.

This class implements a generative language model based on the chain rule, as specified by LanguageModel.Conditional. The maximum likelihood estimator (see CharSeqCounter), is smoothed by linear interpolation with the next lower-order context model:

P'(ck|cj,...,ck-1)
= lambda(cj,...,ck-1) * PML(ck|cj,...,ck-1)
    + (1-lambda(cj,...,ck-1)) * P'(ck|cj+1,...,ck-1)

The PML terms in the above definition are maximum likelihood estimates based on frequency:

 PML(ck|cj,...,ck-1)
 = count(cj,...,ck-1, ck)
 / extCount(cj,...,ck-1)
The count is just the number of times a given string has been seein the data, whereas extCount is the number of times an extension to the string has been seen in the data:
 extCount(c1,...,cn)
 = Σd count(c1,...,cn,d)

In the parametric Witten-Bell method, the interpolation ratio lambda is defined based on extensions of the context of estimation to be:

lambda(c1,...,cn)
= extCount(c1,...,cn)
    / (extCount(c1,...,cn) + L * numExtensions(c1,...,cn))
where c1,...,cn is the conditioning context for estimation, extCount is as defined above, where numExtensions is the number of extensions of a context:
 numExtensions(c1,...,cn)
 = cardinality( { d | count(c1,...,cn,d) > 0 } )
and where L is a hyperparameter of the distribution (described below).

As a base case, P(ck) is interpolated with the uniform distribution PU, with interpolation defined as usual with the argument to lambda being the empty (i.e. zero length) sequence:

 P(d) = lambda() * PML(d)
      + (1-lambda()) * PU(d)
The uniform distribution PU only depends on the number of possible characters used in training and tests:
 PU(c) = 1/alphabetSize
where alphabetSize is the maximum number of distinct characters in this model.

The free hyperparameter L in the smoothing equation determines the balance between higher-order and lower-order models. A higher value for L gives more of the weight to lower-order contexts. As the amount of data grows against a fixed alphabet of characters, the impact of L is reduced. In Witten and Bell's original paper, the hyperparameter L was set to 1.0, which is not a particularly good choice for most text sources. A value of the lambda factor that is roughly the length of the longest n-gram seems to be a good rule of thumb.

Methods are provided for computing a sample cross-entropy rate for a character sequence. The sample cross-entropy H(c1,...,cn;PM) for sequence c1,...,cn in probability model PM is defined to be the average log (base 2) probability of the characters in the sequence according to the model. In symbols:

H(c1,...,cn;PM) = (-log2 PM(c1,...,cn))/n
The cross-entropy rate of distribution P' with respect to a distribution P is defined by:
H(P',P) = Σx P(x) * log2 P'(x)
The Shannon-McMillan-Breiman theorem shows that as the length of the sample drawn from the true distribution P grows, the sample cross-entropy rate approaches the actual cross-entropy rate. In symbols:
H(P,PM) = limn->infinity H(c1,...,cn;PM)/n
The entropy of a distribution P is defined by its cross-entropy against itself, H(P,P). A distribution's entropy is a lower bound on its cross-entropy; in symbols, H(P',P) > H(P,P) for all distributions P'.

Pruning

Models may be pruned by pruning the underlying substring counter for the language model. This counter is returned by the method substringCounter(). See the class documentat for the return result TrieCharSeqCounter for more information.

Serialization

Models may be serialized in the usual way by creating an object output object and writing the object:

 NGramProcessLM lm = ...;
 ObjectOutput out = ...;
 out.writeObject(lm);
Reading just reverses the process:
 ObjectInput in = ...;
 NGramProcessLM lm = (NGramProcessLM) in.readObject();
Serialization is based on the methods writeTo(OutputStream) and readFrom(InputStream). These write compressed forms of the model to streams in binary format.

Warning: The object input and output used for serialization must extend InputStream and OutputStream. The only implementations of ObjectInput and ObjectOutput as of the 1.6 JDK do extend the streams, so this will only be a problem with customized object input or output objects. If you need this method to work with custom input and output objects that do not extend the corresponding streams, drop us a line and we can perhaps refactor the output methods to remove this restriction.

References

For information on the Witten-Bell interpolation method, see:

Since:
LingPipe2.0
Version:
4.0.0
Author:
Bob Carpenter
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Constructor Summary
NGramProcessLM(int maxNGram)
          Constructs an n-gram process language model with the specified maximum n-gram length.
NGramProcessLM(int numChars, double lambdaFactor, TrieCharSeqCounter counter)
          Construct an n-gram process language model with the specified number of characters, interpolation parameter and character sequence counter.
NGramProcessLM(int maxNGram, int numChars)
          Construct an n-gram language process with the specified maximum n-gram length and maximum number of characters.
NGramProcessLM(int maxNGram, int numChars, double lambdaFactor)
          Construct an n-gram language process with the specified maximum n-gram length, number of characters, and interpolation ratio hyperparameter.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Writes a compiled version of this process language model to the specified object output.
 double getLambdaFactor()
          Returns the current setting of the interpolation ratio hyperparameter.
 void handle(CharSequence cSeq)
          Implements the object handler interface over character sequences for training.
 double log2ConditionalEstimate(char[] cs, int start, int end)
          Returns the log (base 2) of the probability estimate for the conditional probability of the last character in the specified slice given the previous characters.
 double log2ConditionalEstimate(char[] cs, int start, int end, int maxNGram, double lambdaFactor)
          Returns the log (base 2) conditional estimate for a specified character slice with a specified maximum n-gram and specified hyperparameter.
 double log2ConditionalEstimate(CharSequence cSeq)
          Returns the log (base 2) of the probabilty estimate for the conditional probability of the last character in the specified character sequence given the previous characters.
 double log2ConditionalEstimate(CharSequence cSeq, int maxNGram, double lambdaFactor)
          Returns the log (base 2) conditional estimate of the last character in the specified character sequence given the previous characters based only on counts of n-grams up to the specified maximum n-gram.
 double log2Estimate(char[] cs, int start, int end)
          Returns an estimate of the log (base 2) probability of the specified character slice.
 double log2Estimate(CharSequence cSeq)
          Returns an estimate of the log (base 2) probability of the specified character sequence.
 double log2Prob(CharSequence cSeq)
          Returns the log (base 2) of the probability of the specified object.
 int maxNGram()
          Returns the maximum n-gram length for this model.
 char[] observedCharacters()
          Returns the array of characters that have been observed for this model.
 double prob(CharSequence cSeq)
          Returns the probability of the specified object.
static NGramProcessLM readFrom(InputStream in)
          Reads a language model from the specified input stream.
 void setLambdaFactor(double lambdaFactor)
          Sets the value of the interpolation ratio hyperparameter to the specified value.
 void setNumChars(int numChars)
          Sets the number of characters for this language model.
 TrieCharSeqCounter substringCounter()
          Returns the substring counter for this language model.
 String toString()
          Returns a string-based representation of this language model.
 void train(char[] cs, int start, int end)
          Update the model with the training data provided by the specified character slice.
 void train(char[] cs, int start, int end, int incr)
          Update the model with the training data provided by the specified character sequence with the specifiedc count.
 void train(CharSequence cSeq)
          Update the model with the training data provided by the specified character sequence with a count of one.
 void train(CharSequence cSeq, int incr)
          Update the model with the training data provided by the specified character sequence with the specified count.
 void trainConditional(char[] cs, int start, int end, int condEnd)
          Trains the specified conditional outcome(s) of the specified character slice given the background slice.
 void writeTo(OutputStream out)
          Writes this language model to the specified output stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

NGramProcessLM

public NGramProcessLM(int maxNGram)
Constructs an n-gram process language model with the specified maximum n-gram length. The number of characters is set to the default of Character.MAX_VALUE and the interpolation parameter is set to the default of being equal to the n-gram length.

Parameters:
maxNGram - Maximum length n-gram for which counts are stored.

NGramProcessLM

public NGramProcessLM(int maxNGram,
                      int numChars)
Construct an n-gram language process with the specified maximum n-gram length and maximum number of characters. The interpolation hyperparameter will be set to the same value as the maximum n-gram length.

Parameters:
maxNGram - Maximum length n-gram for which counts are stored.
numChars - Maximum number of characters in training data.
Throws:
IllegalArgumentException - If the maximum n-gram is less than 1 or if the number of characters is not between 1 and the maximum number of characters.

NGramProcessLM

public NGramProcessLM(int maxNGram,
                      int numChars,
                      double lambdaFactor)
Construct an n-gram language process with the specified maximum n-gram length, number of characters, and interpolation ratio hyperparameter.

Parameters:
maxNGram - Maximum length n-gram for which counts are stored.
numChars - Maximum number of characters in training data.
lambdaFactor - Central value of interpolation hyperparameter explored.
Throws:
IllegalArgumentException - If the maximum n-gram is less than 1, the number of characters is not between 1 and the maximum number of characters, of if the lambda factor is not greater than or equal to 0.

NGramProcessLM

public NGramProcessLM(int numChars,
                      double lambdaFactor,
                      TrieCharSeqCounter counter)
Construct an n-gram process language model with the specified number of characters, interpolation parameter and character sequence counter. The maximum n-gram is determined by the sequence counter.

The counter argument allows serialized counters to be read back in and used to create an n-gram process LM.

Parameters:
numChars - Maximum number of characters in training and test data.
lambdaFactor - Interpolation parameter (see class doc).
counter - Character sequence counter to use.
Throws:
IllegalArgumentException - If the number of characters is not between 1 and the maximum number of characters, of if the lambda factor is not greater than or equal to 0.
Method Detail

writeTo

public void writeTo(OutputStream out)
             throws IOException
Writes this language model to the specified output stream.

A language model is written using a BitOutput wrapped around the specified output stream. This bit output is used to delta encode the maximum n-gram, number of characters, lambda factor times 1,000,000, and then the underlying sequence counter using TrieCharSeqCounter.writeCounter(CharSeqCounter,TrieWriter,int). The bit output is flushed, but the output stream is not closed.

A language model can be read and written using the following code, given a file f:

 NGramProcessLM lm = ...;
 File f = ...;
 OutputStream out = new FileOutputStream(f);
 BufferedOutputStream bufOut = new BufferedOutputStream(out);
 lm.writeTo(bufOut);
 bufOut.close();

 ...
 InputStream in = new FileInputStream(f);
 BufferedInputStream bufIn = new BufferedInputStream(in);
 NGramProcessLM lm2 = NGramProcessLM.readFrom(bufIn);
 bufIn.close();

Parameters:
out - Output stream to which to write language model.
Throws:
IOException - If there is an underlying I/O error.

readFrom

public static NGramProcessLM readFrom(InputStream in)
                               throws IOException
Reads a language model from the specified input stream.

See writeTo(OutputStream) for information on the binary I/O format.

Parameters:
in - Input stream from which to read a language model.
Returns:
The language model read from the stream.
Throws:
IOException - If there is an underlying I/O error.

log2Prob

public double log2Prob(CharSequence cSeq)
Description copied from interface: Model
Returns the log (base 2) of the probability of the specified object.

Specified by:
log2Prob in interface Model<CharSequence>
Parameters:
cSeq - The object whose probability is returned.
Returns:
The log (base 2) probability of the specified object.

prob

public double prob(CharSequence cSeq)
Description copied from interface: Model
Returns the probability of the specified object.

Specified by:
prob in interface Model<CharSequence>
Parameters:
cSeq - The object whose probability is returned.
Returns:
The probability of the specified object.

log2Estimate

public final double log2Estimate(CharSequence cSeq)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character sequence.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cSeq - Character sequence to estimate.
Returns:
Log estimate of likelihood of specified character sequence.

log2Estimate

public final double log2Estimate(char[] cs,
                                 int start,
                                 int end)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character slice.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
Log estimate of likelihood of specified character sequence.

train

public void train(CharSequence cSeq)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character sequence with a count of one.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cSeq - The character sequence to use as training data.

train

public void train(CharSequence cSeq,
                  int incr)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character sequence with the specified count. Calling this method, train(cs,n) is equivalent to calling train(cs) a total of n times.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cSeq - The character sequence to use as training data.
incr - Number of instances to train.

train

public void train(char[] cs,
                  int start,
                  int end)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character slice.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - The underlying character array for the slice.
start - Index of first character in the slice.
end - Index of one plus the last character in the training slice.

train

public void train(char[] cs,
                  int start,
                  int end,
                  int incr)
Description copied from interface: LanguageModel.Dynamic
Update the model with the training data provided by the specified character sequence with the specifiedc count. Calling this method, train(cs,n) is equivalent to calling train(cs) a total of n times. Update the model with the training data provided by the specified character slice.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - The underlying character array for the slice.
start - Index of first character in the slice.
end - Index of one plus the last character in the training slice.
incr - Number of instances to train.

handle

public void handle(CharSequence cSeq)
Implements the object handler interface over character sequences for training. The implementation delegates to train(CharSequence).

Specified by:
handle in interface ObjectHandler<CharSequence>
Parameters:
cSeq - Character sequence on which to train.

trainConditional

public void trainConditional(char[] cs,
                             int start,
                             int end,
                             int condEnd)
Trains the specified conditional outcome(s) of the specified character slice given the background slice.

This method just shorthand for incrementing the counts of all substrings of cs from position start to end-1 inclusive, then decrementing all of the counts of substrings from position start to condEnd-1. For instance, if cs is "abcde".toCharArray(), then calling trainConditional(cs,0,5,2) will increment the counts of cde given ab, but will not increment the counts of ab directly. This increases the following probabilities:

P('e'|"abcd")   P('e'|"bcd")   P('e'|"cd")   P('e'|"d")   P('e'|"")
P('d'|"abc")    P('d'|"bc")    P('d'|"c")    P('d'|"")
P('c'|"ab")     P('c'|"b")     P('c'|"")
but does not increase the following probabilities:
P('b'|"a")   P('b'|"")
P('a'|"")

Parameters:
cs - Array of characters.
start - Start position for slice.
end - One past end position for slice.
condEnd - One past the end of the conditional portion of the slice.

observedCharacters

public char[] observedCharacters()
Description copied from interface: LanguageModel.Conditional
Returns the array of characters that have been observed for this model. The character array will be sorted into ascending unicode order.

Specified by:
observedCharacters in interface LanguageModel.Conditional
Returns:
The array of observed characters for this model.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled version of this process language model to the specified object output.

The object written will be an instance of CompiledNGramProcessLM. It may be read in by casting the result of ObjectInput.readObject().

Compilation is time consuming, because it must traverse the entire trie structure, and for each node, estimate its log probability and if it is internal, its log interpolation value. Given that time taken is proportional to the size of the trie, pruning first may greatly speed up this operation and reduce the size of the compiled object that is written.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which a compiled version of this langauge model will be written.
Throws:
IOException - If there is an I/O exception writing the compiled object.

log2ConditionalEstimate

public double log2ConditionalEstimate(CharSequence cSeq)
Description copied from interface: LanguageModel.Conditional
Returns the log (base 2) of the probabilty estimate for the conditional probability of the last character in the specified character sequence given the previous characters.

Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
Parameters:
cSeq - Character sequence to estimate.
Returns:
The log conditional probability estimate.

log2ConditionalEstimate

public double log2ConditionalEstimate(char[] cs,
                                      int start,
                                      int end)
Description copied from interface: LanguageModel.Conditional
Returns the log (base 2) of the probability estimate for the conditional probability of the last character in the specified slice given the previous characters.

Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus the index of the last character in the slice.
Returns:
The log conditional probability estimate.

substringCounter

public TrieCharSeqCounter substringCounter()
Returns the substring counter for this language model. Modifying the counts in the returned counter, such as by pruning, will change the estimates in this language model.

Returns:
Substring counter for this language model.

maxNGram

public int maxNGram()
Returns the maximum n-gram length for this model.

Returns:
The maximum n-gram length for this model.

log2ConditionalEstimate

public double log2ConditionalEstimate(CharSequence cSeq,
                                      int maxNGram,
                                      double lambdaFactor)
Returns the log (base 2) conditional estimate of the last character in the specified character sequence given the previous characters based only on counts of n-grams up to the specified maximum n-gram. If the maximum n-gram argument is greater than or equal to the one supplied at construction time, the results wil lbe the same as the ordinary conditional estimate.

Parameters:
cSeq - Character sequence to estimate.
maxNGram - Maximum length of n-gram count to use for estimate.
lambdaFactor - Value of interpolation hyperparameter for this estimate.
Returns:
Log (base 2) conditional estimate.
Throws:
IllegalArgumentException - If the character sequence is not at least one character long.

log2ConditionalEstimate

public double log2ConditionalEstimate(char[] cs,
                                      int start,
                                      int end,
                                      int maxNGram,
                                      double lambdaFactor)
Returns the log (base 2) conditional estimate for a specified character slice with a specified maximum n-gram and specified hyperparameter.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
maxNGram - Maximum length of n-gram to use in estimates.
lambdaFactor - Value of interpolation hyperparameter.
Returns:
Log (base 2) conditional estimate of the last character in the slice given the previous characters.
Throws:
IndexOutOfBoundsException - If the start index and end index minus one are out of range of the character array or if the character slice is less than one character long.

getLambdaFactor

public double getLambdaFactor()
Returns the current setting of the interpolation ratio hyperparameter. See the class documentation above for information on how the interpolation ratio is used in estimates.

Returns:
The current setting of the interpolation ratio hyperparameter.

setLambdaFactor

public final void setLambdaFactor(double lambdaFactor)
Sets the value of the interpolation ratio hyperparameter to the specified value. See the class documentation above for information on how the interpolation ratio is used in estimates.

Parameters:
lambdaFactor - New value for interpolation ratio hyperparameter.
Throws:
IllegalArgumentException - If the value is not greater than or equal to zero.

setNumChars

public final void setNumChars(int numChars)
Sets the number of characters for this language model. All subsequent estimates will be based on this number. See the class definition above for information on how the number of character is used to determine the base case uniform distribution.

Parameters:
numChars - New number of characters for this language model.
Throws:
IllegalArgumentException - If the number of characters is less than 0 or more than Character.MAX_VALUE.

toString

public String toString()
Returns a string-based representation of this language model.

Overrides:
toString in class Object
Returns:
A string-based representation of this language model.