com.aliasi.lm
Class CompiledNGramProcessLM

java.lang.Object
  extended by com.aliasi.lm.CompiledNGramProcessLM
All Implemented Interfaces:
LanguageModel, LanguageModel.Conditional, LanguageModel.Process, Model<CharSequence>

public class CompiledNGramProcessLM
extends Object
implements LanguageModel.Process, LanguageModel.Conditional, Model<CharSequence>

A CompiledNGramProcessLM implements a conditional process language model. Instances are constructed by reading a serialized version of an instance of NGramProcessLM through data input.

Compiled models contain precompulted estimates and smoothing parameters. For instance, consider an 3-gram process language model trained on the string "abracadabra", which yields the following counts:

    11
     a 5
       br 2
       ca 1
       da 1
    bra 2
    cad 1
    dab 1
    r 2
      a 2
         c 1
 
For instance, count("")=11, count("a")=5, count("ab")=2, count("abr")=2, count("br")=2, count("bra")=2, count("r")=2, and count("ra")=2, and count("rac")=1. Assuming a lambda factor hyperparameter set to 4.0, the compiled model consists of the following five parallel arrays:
    char (2) int (4) float (4) float (4) int (4)
Index Context Char Suffix log2 P(char|context) log2 (1 - λ(context.char)) First Dtr
0n/an/an/an/a -0.63226821
1""a0-2.6098135 -0.41503756
2""b0-3.8987012 -0.58496259
3""c0-4.845262 -0.3219280810
4""d0-4.845262 -0.3219280811
5""r0-3.8987012 -0.584962512
6"a"b2-2.5122285 -0.584962513
7"a"c3-3.4966948 -0.3219280814
8"a"d4-3.4966948 -0.3219280815
9"b"r5-1.4034244 -0.584962516
10"c"a1-1.5948515 -0.3219280817
11"d"a1-1.5948515 -0.3219280818
12"r"a1-1.1760978 -0.3219280819
13"ab"r9-0.77261907 This space intentionally left left blank: maximum length n-grams are not contexts.
14"ac"a10-1.1051782
15"ad"a11-1.1051782
16"br"a12-0.6703262
17"ca"d8-1.8843123
18"da"b6-1.5554274
19"ra"c7-1.8843123
The actual data is contained in the last five columns of the table. Thus internal nodes require 10 bytes and terminal nodes require 18 bytes. The indices in the first column and the implicit context represented by the node are for convenience. Each of these indices picks out a row in the table that corresponds to a node in the trie structure representing the counts. The nodes are arranged according to a unicode-order breadth-first traversal of the count trie. Each non-terminal node stores a character, the index of the suffix node (as in a suffix tree), a probability estimate for the character given the context of characters that led to the character, an interpolation value for the context represented by the context leading to the character plus the character itself, and finally a pointer to the first daughter of the node in the array. Note that the last daughter of a node is thus picked out by the first daughter of the next node minus one. For instance, the daughters of node 1 are 6, 7 and 8, and the daguther of node 9 is 16. The terminal nodes have no daughters; note that memory is not allocated for these terminal cells.

The second column indicates the context derived by walking from the root node (index 0) down to the node. For instance, the path "bra" leads from the root node 0 ("") to the node 2 ("b"), to node 9 ("br"), and finally to node 16 ("bra"). The tree is traversed by starting at the root node and looking up daughters by binary search.

If a full context and node are in the tree, the estimate can simply be looked up. For instance, log2 P(b) = -3.898, log2 P(r|b) = -1.403, and log2 P(a|br) = -0.670. If the context is available, but not the outcome, the result is just the addition of the log one minus lambda estimates down to the first context for which the outcome exists. These contexts to explore are all suffixes of the context currently being explored and may be looked up in the suffix array. For instance, consider:

P(c|br) = λ(br) PML(c|br) + (1-λ(br)) P(c|b)
In this case, the outcome c is not available for the context br (the only daughter is a), hence the maximum likelihood estimate is zero, and the result reduces to the second term:
P(c|br) = (1-λ(br)) P(c|b)
and hence
log2 P(c|br) = log2((1-λ(br)) P(c|b)) = log2(1-λ(br)) + log2 P(c|b)
This continues until the outcome is found. In this case, we continue with the next term in the same way:
= log2 (1-λ(br)) + log2 (1- λ(b)) + log2 P(c)
= -0.584 + -0.584 + -0.4845
Because 3-grams are the upper bound on context length, and terminal nodes have no daughters, we conserve length at the end of the smoothing and daughter arrays.

In practice, this smoothing may require going all the way down to the uniform model. The uniform model estimate is stored separately. The interpolation parameter for the root node works as for any other node.

For estimates of sequences, the final node used will act as the first potential context for the next estimate. This replaces a number of lookups equal to the n-gram length by binary character search with a simple array lookup.

Implementation Note: The suffix indices are not included in the binary serialization format. Instead, they are initialized right after the binary data is read in. This requires walking the trie-structure and computing each node's suffix by doing a walk from the root. For instance, a million node 8-gram model will require at most 8 million binary character searches during initialization.

Since:
LingPipe2.0
Version:
3.6
Author:
Bob Carpenter

Nested Class Summary
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Field Summary
static int ROOT_NODE_INDEX
          The index of the root node, namely 0.
 
Method Summary
 double log2ConditionalEstimate(char[] cs, int start, int end)
          Returns the log (base 2) of the probability estimate for the conditional probability of the last character in the specified slice given the previous characters.
 double log2ConditionalEstimate(CharSequence cSeq)
          Returns the log (base 2) of the probabilty estimate for the conditional probability of the last character in the specified character sequence given the previous characters.
 double log2Estimate(char[] cs, int start, int end)
          Returns an estimate of the log (base 2) probability of the specified character slice.
 double log2Estimate(CharSequence cSeq)
          Returns an estimate of the log (base 2) probability of the specified character sequence.
 double log2Estimate(int contextIndex, char nextChar)
          Returns the log (base 2) estimate of the specified character in the context with the specified index.
 double log2Prob(CharSequence cSeq)
          This method is a convenience impelementation of the Model interface which delegates the call to log2Estimate(CharSequence).
 int longestContextIndex(String context)
          Return the index in the parallel array structure underlying of the maximum length suffix of the specified string that is defined as a context.
 int maxNGram()
          Returns the maximum length n-gram used for this compiled language model.
 int nextContext(int contextIndex, char nextChar)
          Returns the index of the context formed by appending the specified character to the context of the specified index.
 int numNodes()
          Returns the total number of nodes in this language model's trie structure.
 char[] observedCharacters()
          Returns the array of characters that have been observed for this language model.
 double prob(CharSequence cSeq)
          This method is a convenience implementation of the Model interface which returns the result of raising 2.0 to the power of the result of a call to log2Estimate(CharSequence).
 String toString()
          Returns a string-based representation of this compiled n-gram.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ROOT_NODE_INDEX

public static final int ROOT_NODE_INDEX
The index of the root node, namely 0.

See Also:
Constant Field Values
Method Detail

observedCharacters

public char[] observedCharacters()
Returns the array of characters that have been observed for this language model. These are returned in increasing unicode order.

It is assumed that the observed characters are those stored unigram probabilities in the trie.

Specified by:
observedCharacters in interface LanguageModel.Conditional
Returns:
The array of characters that have been observed for this language model.

maxNGram

public int maxNGram()
Returns the maximum length n-gram used for this compiled language model. The maximum amount of context used for estimates will be one less than this.

Returns:
The maximum length n-gram counted in this language model.

numNodes

public int numNodes()
Returns the total number of nodes in this language model's trie structure. This represents the total number of sequences for which there are precompiled conditional probability estimates.

Returns:
The total number of nodes in the underlying trie structure.

longestContextIndex

public int longestContextIndex(String context)
Return the index in the parallel array structure underlying of the maximum length suffix of the specified string that is defined as a context.

Parameters:
context - String of context.
Returns:
int Index of maximum length suffix of the specified context.

log2Prob

public double log2Prob(CharSequence cSeq)
This method is a convenience impelementation of the Model interface which delegates the call to log2Estimate(CharSequence).

Specified by:
log2Prob in interface Model<CharSequence>
Parameters:
cSeq - Character sequence whose probability is returned.
Returns:
The log (base 2) probability of the specified character sequence.

prob

public double prob(CharSequence cSeq)
This method is a convenience implementation of the Model interface which returns the result of raising 2.0 to the power of the result of a call to log2Estimate(CharSequence).

Specified by:
prob in interface Model<CharSequence>
Parameters:
cSeq - Character sequence whose probability is returned.
Returns:
The log probability of the specified character sequence.

log2Estimate

public final double log2Estimate(CharSequence cSeq)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character sequence.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cSeq - Character sequence to estimate.
Returns:
Log estimate of likelihood of specified character sequence.

log2Estimate

public final double log2Estimate(int contextIndex,
                                 char nextChar)
Returns the log (base 2) estimate of the specified character in the context with the specified index. This corresponds to values returned by the conditional estimates when the context and outcome character are specified in a singlecharacter sequence or slice.

The main use of this method is to incrementally compute conditional estimates and contexts, in conjunction with the method nextContext(int,char).

Parameters:
contextIndex - Index of context of estimate.
nextChar - Character being estimated.
Returns:
Log (base 2) estimate of character in context.

nextContext

public int nextContext(int contextIndex,
                       char nextChar)
Returns the index of the context formed by appending the specified character to the context of the specified index. The main use of this method is to incrementally compute conditional estimates and contexts, in conjunction with the method log2Estimate(int,char).

Note that the index of the root node is always 0.

Parameters:
contextIndex - Index of present context.
nextChar - Next character.
Returns:
Index of context formed by appending next character to the present context.
Throws:
IllegalArgumentException - If the context index is less than zero or greater than the last context index.

log2Estimate

public final double log2Estimate(char[] cs,
                                 int start,
                                 int end)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character slice.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
Log estimate of likelihood of specified character sequence.

log2ConditionalEstimate

public double log2ConditionalEstimate(CharSequence cSeq)
Description copied from interface: LanguageModel.Conditional
Returns the log (base 2) of the probabilty estimate for the conditional probability of the last character in the specified character sequence given the previous characters.

Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
Parameters:
cSeq - Character sequence to estimate.
Returns:
The log conditional probability estimate.

log2ConditionalEstimate

public double log2ConditionalEstimate(char[] cs,
                                      int start,
                                      int end)
Description copied from interface: LanguageModel.Conditional
Returns the log (base 2) of the probability estimate for the conditional probability of the last character in the specified slice given the previous characters.

Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus the index of the last character in the slice.
Returns:
The log conditional probability estimate.

toString

public String toString()
Returns a string-based representation of this compiled n-gram. It writes one row per line of the parallel indices. It should probably not be called with very large models, as the resulting string will be much larger than the model itself.

Overrides:
toString in class Object
Returns:
String-based representation of this model.