com.aliasi.lm
Class TokenizedLM

java.lang.Object
  extended by com.aliasi.lm.TokenizedLM
All Implemented Interfaces:
Handler, ObjectHandler<CharSequence>, TextHandler, LanguageModel, LanguageModel.Dynamic, LanguageModel.Sequence, LanguageModel.Tokenized, Compilable

public class TokenizedLM
extends Object
implements TextHandler, LanguageModel.Dynamic, LanguageModel.Sequence, LanguageModel.Tokenized, ObjectHandler<CharSequence>

A TokenizedLM provides a dynamic sequence language model which models token sequences with an n-gram model, and whitespace and unknown tokens with their own sequence language models.

A tokenized language model factors the probability assigned to a character sequence as follows:

P(cs) = Ptok(toks(cs)) Πt in unknownToks(cs) Punk(t) Πw in whitespaces(cs) Pwhsp(w)
where

The token n-gram model itself uses the same method of counting and smoothing as described in the class documentation for NGramProcessLM. Like NGramBoundaryLM, boundary tokens are inserted before and after other tokens. And like the n-gram character boundary model, the initial boundary estimate is subtracted from the overall estimate for normalization purposes.

Tokens are all converted to integer identifiers using an internal dynamic symbol table. All symbols in symbol tables get non-negative identifiers; the negative value -1 is used for the unknown token in models, just as in symbol tables. The value -2 is used for the boundary marker in the counters.

In order for all estimates to be non-zero, the integer sequence counter used to back the token model is initialized with a count of 1 for the end-of-stream identifier (-2). The unknown token count for any context is taken to be the number of outcomes in that context. Because unknowns are estimated directly in this manner, there is no need to interpolate the unigram model with a uniform model for unknown outcome. Instead, the occurrence of an unknown is modeled directly and its identity is modeled by the unknown token language model.

In order to produce a properly normalized sequence model, the concatenation of tokens and whitespaces returned by the tokenizer should concatenate together to produce the original input. Note that this condition is not checked at runtime. But, sequences may be normalized before being trained and evaluated for a language model. For instance, all alphabetic characters might be reduced to lower case and all punctuation characters removed and all non-empty sequences of whitespace reduced to a single space character. A langauge model may then be defined over this normalized space of input, not the original space (and may thus use a reduced number of characters for its uniform estimates). Although this normalization may be carried out by a tokenizer in practice, for instance for use in a tokenized classifier, an normalization is consistent the interface specification for LanguageModel.Sequence or LanguageModel.Dynamic only if done on the outside.

Since:
LingPipe2.0
Version:
3.9.1
Author:
Bob Carpenter

Nested Class Summary
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Nested classes/interfaces inherited from interface com.aliasi.lm.LanguageModel
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
 
Field Summary
static int BOUNDARY_TOKEN
          The symbol used for boundaries in the counter, -2.
static int UNKNOWN_TOKEN
          The symbol used for unknown symbol IDs.
 
Constructor Summary
TokenizedLM(TokenizerFactory factory, int nGramOrder)
          Constructs a tokenized language model with the specified tokenization factory and n-gram order.
TokenizedLM(TokenizerFactory tokenizerFactory, int nGramOrder, LanguageModel.Sequence unknownTokenModel, LanguageModel.Sequence whitespaceModel, double lambdaFactor)
          Construct a tokenized language model with the specified tokenization factory and n-gram order, sequence models for unknown tokens and whitespace, and an interpolation hyperparameter.
TokenizedLM(TokenizerFactory tokenizerFactory, int nGramOrder, LanguageModel.Sequence unknownTokenModel, LanguageModel.Sequence whitespaceModel, double lambdaFactor, boolean initialIncrementBoundary)
          Construct a tokenized language model with the specified tokenization factory and n-gram order, sequence models for unknown tokens and whitespace, and an interpolation hyperparameter, as well as a flag indicating whether to automatically increment a null input to avoid numerical problems with zero counts.
 
Method Summary
 double chiSquaredIndependence(int[] nGram)
          Returns the maximum value of Pearson's C2 independence test statistic resulting from splitting the specified n-gram in half to derive a contingency matrix.
 ScoredObject<String[]>[] collocations(int nGram, int minCount, int maxReturned)
          Deprecated. Use collocationSet(int,int,int) instead.
 SortedSet<ScoredObject<String[]>> collocationSet(int nGram, int minCount, int maxReturned)
          Returns an array of collocations in order of confidence that their token sequences are not independent.
 void compileTo(ObjectOutput objOut)
          Writes a compiled version of this tokenized language model to the specified object output.
 ScoredObject<String[]>[] frequentTerms(int nGram, int maxReturned)
          Deprecated. Use frequentTermSet(int,int) instead.
 SortedSet<ScoredObject<String[]>> frequentTermSet(int nGram, int maxReturned)
          Returns the most frequent n-gram terms in the training data up to the specified maximum number.
 void handle(char[] cs, int start, int length)
          Deprecated. Use handle(CharSequence) instead.
 void handle(CharSequence cs)
          Trains the language model on the specified character sequence.
 void handleNGrams(int nGramLength, int minCount, ObjectHandler<String[]> handler)
          Visits the n-grams of the specified length with at least the specified minimum count stored in the underlying counter of this tokenized language model and passes them to the specified handler.
 ScoredObject<String[]>[] infrequentTerms(int nGram, int maxReturned)
          Deprecated. Use infrequentTermSet(int,int) instead.
 SortedSet<ScoredObject<String[]>> infrequentTermSet(int nGram, int maxReturned)
          Returns the least frequent n-gram terms in the training data up to the specified maximum number.
 double lambdaFactor()
          Returns the interpolation ratio, or lambda factor, for interpolating in this tokenized language model.
 double log2Estimate(char[] cs, int start, int end)
          Returns an estimate of the log (base 2) probability of the specified character slice.
 double log2Estimate(CharSequence cSeq)
          Returns an estimate of the log (base 2) probability of the specified character sequence.
 ScoredObject<String[]>[] newTerms(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
          Deprecated. Use newTermSet(int,int,int,LanguageModel.Tokenized) instead.
 SortedSet<ScoredObject<String[]>> newTermSet(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
          Returns a list of scored n-grams ordered by the significance of the degree to which their counts in this model exceed their expected counts in a specified background model.
 int nGramOrder()
          Returns the order of the token n-gram model underlying this tokenized language model.
 ScoredObject<String[]>[] oldTerms(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
          Deprecated. Use oldTermSet(int,int,int,LanguageModel.Tokenized) instead.
 SortedSet<ScoredObject<String[]>> oldTermSet(int nGram, int minCount, int maxReturned, LanguageModel.Tokenized backgroundLM)
          Returns a list of scored n-grams ordered in reverse order of significance with respect to the background model.
 double processLog2Probability(String[] tokens)
          Returns the probability of the specified tokens in the underlying token n-gram distribution.
 TrieIntSeqCounter sequenceCounter()
          Returns the integer sequence counter underlying this model.
 SymbolTable symbolTable()
          Returns the symbol table underlying this tokenized language model's token n-gram model.
 TokenizerFactory tokenizerFactory()
          Returns the tokenizer factory for this tokenized language model.
 double tokenLog2Probability(String[] tokens, int start, int end)
          Returns the log (base 2) probability of the specified token slice in the underlying token n-gram distribution.
 double tokenProbability(String[] tokens, int start, int end)
          Returns the probability of the specified token slice in the token n-gram distribution.
 String toString()
          Returns a string-based representation of the token counts for this language model.
 void train(char[] cs, int start, int end)
          Deprecated. Use handle(CharSequence) instead.
 void train(char[] cs, int start, int end, int count)
          Deprecated. Use train(CharSequence,int) instead.
 void train(CharSequence cSeq)
          Deprecated. Use handle(CharSequence) instead.
 void train(CharSequence cSeq, int count)
          Trains the token sequence model, whitespace model (if dynamic) and unknown token model (if dynamic) with the specified count number of instances.
 void trainSequence(char[] cs, int start, int end, int count)
          Deprecated. Use trainSequence(CharSequence,int) instead.
 void trainSequence(CharSequence cSeq, int count)
          This method increments the count of the entire sequence specified.
 LanguageModel.Sequence unknownTokenLM()
          Returns the unknown token seqeunce language model for this tokenized language model.
 LanguageModel.Sequence whitespaceLM()
          Returns the whitespace language model for this tokenized language model.
 double z(int[] nGram, int nGramSampleCount, int totalSampleCount)
          Returns the z-score of the specified n-gram with the specified count out of a total sample count, as measured against the expectation of this tokenized language model.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

UNKNOWN_TOKEN

public static final int UNKNOWN_TOKEN
The symbol used for unknown symbol IDs.

See Also:
Constant Field Values

BOUNDARY_TOKEN

public static final int BOUNDARY_TOKEN
The symbol used for boundaries in the counter, -2.

See Also:
Constant Field Values
Constructor Detail

TokenizedLM

public TokenizedLM(TokenizerFactory factory,
                   int nGramOrder)
Constructs a tokenized language model with the specified tokenization factory and n-gram order. The unknown token and whitespace models are both uniform sequence language models with default parameters as described in the documentation for the constructor UniformBoundaryLM.UniformBoundaryLM(). The default interpolation hyperparameter is equal to the n-gram Order.

Parameters:
factory - Tokenizer factory for the model.
nGramOrder - N-gram Order.
Throws:
IllegalArgumentException - If the n-gram order is less than 0.

TokenizedLM

public TokenizedLM(TokenizerFactory tokenizerFactory,
                   int nGramOrder,
                   LanguageModel.Sequence unknownTokenModel,
                   LanguageModel.Sequence whitespaceModel,
                   double lambdaFactor)
Construct a tokenized language model with the specified tokenization factory and n-gram order, sequence models for unknown tokens and whitespace, and an interpolation hyperparameter.

In order for this model to be serializable, the unknown token and whitespace models should be serializable. If they do not, a runtime exception will be thrown when attempting to serialize this model. If these models implement LanguageModel.Dynamic, they will be trained by calls to the training method.

Parameters:
tokenizerFactory - Tokenizer factory for the model.
nGramOrder - Length of maximum n-gram for model.
unknownTokenModel - Sequence model for unknown tokens.
whitespaceModel - Sequence model for all whitespace.
lambdaFactor - Value of the interpolation hyperparameter.
Throws:
IllegalArgumentException - If the n-gram order is less than 1 or the interpolation is not a non-negative number.

TokenizedLM

public TokenizedLM(TokenizerFactory tokenizerFactory,
                   int nGramOrder,
                   LanguageModel.Sequence unknownTokenModel,
                   LanguageModel.Sequence whitespaceModel,
                   double lambdaFactor,
                   boolean initialIncrementBoundary)
Construct a tokenized language model with the specified tokenization factory and n-gram order, sequence models for unknown tokens and whitespace, and an interpolation hyperparameter, as well as a flag indicating whether to automatically increment a null input to avoid numerical problems with zero counts.

In order for this model to be serializable, the unknown token and whitespace models should be serializable. If they do not, a runtime exception will be thrown when attempting to serialize this model. If these models implement LanguageModel.Dynamic, they will be trained by calls to the training method.

Parameters:
tokenizerFactory - Tokenizer factory for the model.
nGramOrder - Length of maximum n-gram for model.
unknownTokenModel - Sequence model for unknown tokens.
whitespaceModel - Sequence model for all whitespace.
lambdaFactor - Value of the interpolation hyperparameter.
initialIncrementBoundary - Flag indicating whether or not to increment the subsequence { BOUNDARY_TOKEN } automatically after construction to avoid NaN error states.
Throws:
IllegalArgumentException - If the n-gram order is less than 1 or the interpolation is not a non-negative number.
Method Detail

lambdaFactor

public double lambdaFactor()
Returns the interpolation ratio, or lambda factor, for interpolating in this tokenized language model. See the class documentation above for more details.

Returns:
The interpolation ratio for this LM.

sequenceCounter

public TrieIntSeqCounter sequenceCounter()
Returns the integer sequence counter underlying this model. Symbols are mapped to integers using the symbol table returned by symbolTable(). Changes to this counter affect this tokenized language model.

Returns:
The sequence counter underlying this model.

symbolTable

public SymbolTable symbolTable()
Returns the symbol table underlying this tokenized language model's token n-gram model. Changes to the symbol table affect this tokenized language model.

Returns:
The symbol table underlying this language model.

nGramOrder

public int nGramOrder()
Returns the order of the token n-gram model underlying this tokenized language model.

Returns:
The order of the token n-gram model underlying this tokenized language model.

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Returns the tokenizer factory for this tokenized language model.

Returns:
The tokenizer factory for this tokenized language model.

unknownTokenLM

public LanguageModel.Sequence unknownTokenLM()
Returns the unknown token seqeunce language model for this tokenized language model. Changes to the returned language model affect this tokenized language model.

Returns:
The unknown token language model.

whitespaceLM

public LanguageModel.Sequence whitespaceLM()
Returns the whitespace language model for this tokenized language model. Changes to the returned language model affect this tokenized language model.

Returns:
The whitespace language model.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled version of this tokenized language model to the specified object output. When the model is read back in it will be an instance of CompiledTokenizedLM.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which a compiled version of this model is written.
Throws:
IOException - If there is an I/O error writing the output.

handleNGrams

public void handleNGrams(int nGramLength,
                         int minCount,
                         ObjectHandler<String[]> handler)
Visits the n-grams of the specified length with at least the specified minimum count stored in the underlying counter of this tokenized language model and passes them to the specified handler.

Parameters:
nGramLength - Length of n-grams visited.
minCount - Minimum count of a visited n-gram.
handler - Handler whose handle method is called for each visited n-gram.

train

@Deprecated
public void train(CharSequence cSeq)
Deprecated. Use handle(CharSequence) instead.

Trains the token sequence model, whitespace model (if dynamic) and unknown token model (if dynamic).

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cSeq - Character sequence to train.

train

public void train(CharSequence cSeq,
                  int count)
Trains the token sequence model, whitespace model (if dynamic) and unknown token model (if dynamic) with the specified count number of instances. Calling train(cs,n) is equivalent to calling train(cs) a total of n times.

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cSeq - Character sequence to train.
count - Number of instances to train.
Throws:
IllegalArgumentException - If the count is not positive.

train

@Deprecated
public void train(char[] cs,
                             int start,
                             int end)
Deprecated. Use handle(CharSequence) instead.

Trains the token sequence model, whitespace model (if dynamic) and unknown token model (if dynamic).

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one plus last character in slice.
Throws:
IndexOutOfBoundsException - If the indices are out of range for the character array.

handle

@Deprecated
public void handle(char[] cs,
                              int start,
                              int length)
Deprecated. Use handle(CharSequence) instead.

This method is a convenience implementation of the TextHandler interface which delegates calls to train(char[], int, int).

Specified by:
handle in interface TextHandler
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
length - Length of slice.
Throws:
IndexOutOfBoundsException - If the indices are out of range for the character array.

handle

public void handle(CharSequence cs)
Trains the language model on the specified character sequence.

This method delegates to the train(CharSequence,int) method.

This method implements the ObjectHandler<CharSequence> interface.

Specified by:
handle in interface ObjectHandler<CharSequence>
Parameters:
cs - Object to be handled.

train

@Deprecated
public void train(char[] cs,
                             int start,
                             int end,
                             int count)
Deprecated. Use train(CharSequence,int) instead.

Trains the token sequence model, whitespace model (if dynamic) and unknown token model (if dynamic).

Specified by:
train in interface LanguageModel.Dynamic
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one plus last character in slice.
count - Number of instances of sequence to train.
Throws:
IndexOutOfBoundsException - If the indices are out of range for the character array.
IllegalArgumentException - If the count is negative.

trainSequence

@Deprecated
public void trainSequence(char[] cs,
                                     int start,
                                     int end,
                                     int count)
Deprecated. Use trainSequence(CharSequence,int) instead.

This method trains the last token in the sequence given the previous tokens. See trainSequence(CharSequence, int) for more information.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one plus last character in slice.
Throws:
IndexOutOfBoundsException - If the indices are out of range for the character array.
IllegalArgumentException - If the count is negative.

trainSequence

public void trainSequence(CharSequence cSeq,
                          int count)
This method increments the count of the entire sequence specified. Note that this method does not increment any of the token subsequences and does not increment the whitespace or token smoothing models.

This method may be used to train a tokenized language model from individual character sequence counts. Because the token smoothing models are not implemented for this method, a pure token model may be constructed by calling train(CharSequence,int) for character sequences corresponding to unigrams rather than this method in order to train token smoothing with character subseuqneces.

For instance, with com.aliasi.tokenizer.IndoEuropeanTokenizerFactory, the sequence calling trainSequence("the fast computer",5) would extract three tokens, the, fast and computer, and would increment the count of the three-token sequence, but not any of its subsequences.

If the number of tokens is longer than the maximum n-gram length, only the final tokens are trained. For instance, with an n-gram length of 2, and the Indo-European tokenizer factory, calling trainSequence("a slightly faster computer",93) is equivalent to calling trainSequence("faster computer",93).

All tokens trained are added to the symbol table. This does not include any initial tokens that are not used because the maximum n-gram length is too short.

Parameters:
cSeq - Character sequence to train.
count - Number of instances to train.
Throws:
IllegalArgumentException - If the count is negative.

log2Estimate

public double log2Estimate(CharSequence cSeq)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character sequence.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cSeq - Character sequence to estimate.
Returns:
Log estimate of likelihood of specified character sequence.

log2Estimate

public double log2Estimate(char[] cs,
                           int start,
                           int end)
Description copied from interface: LanguageModel
Returns an estimate of the log (base 2) probability of the specified character slice.

Specified by:
log2Estimate in interface LanguageModel
Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
Log estimate of likelihood of specified character sequence.

tokenProbability

public double tokenProbability(String[] tokens,
                               int start,
                               int end)
Description copied from interface: LanguageModel.Tokenized
Returns the probability of the specified token slice in the token n-gram distribution. This estimate includes the estimates of the actual token for unknown tokens.

Specified by:
tokenProbability in interface LanguageModel.Tokenized
Parameters:
tokens - Underlying array of tokens.
start - Index of first token in slice.
end - Index of one past the last token in the slice.
Returns:
The probability of the token slice.

tokenLog2Probability

public double tokenLog2Probability(String[] tokens,
                                   int start,
                                   int end)
Description copied from interface: LanguageModel.Tokenized
Returns the log (base 2) probability of the specified token slice in the underlying token n-gram distribution. This includes the estimation of the actual token for unknown tokens.

Specified by:
tokenLog2Probability in interface LanguageModel.Tokenized
Parameters:
tokens - Underlying array of tokens.
start - Index of first token in slice.
end - Index of one past the last token in the slice.
Returns:
The log (base 2) probability of the token slice.

processLog2Probability

public double processLog2Probability(String[] tokens)
Returns the probability of the specified tokens in the underlying token n-gram distribution. This includes the estimation of the actual token for unknown tokens.

Parameters:
tokens - Tokens whose probability is returned.
Returns:
The probability of the tokens.

collocations

@Deprecated
public ScoredObject<String[]>[] collocations(int nGram,
                                                        int minCount,
                                                        int maxReturned)
Deprecated. Use collocationSet(int,int,int) instead.

See collocationSet(int,int,int).

Parameters:
nGram - Length of n-grams to search for collocations.
minCount - Minimum count for a returned n-gram.
maxReturned - Maximum number of results returned.
Returns:
Array of collocations in confidence order.

collocationSet

public SortedSet<ScoredObject<String[]>> collocationSet(int nGram,
                                                        int minCount,
                                                        int maxReturned)
Returns an array of collocations in order of confidence that their token sequences are not independent. The object contained in the returned scored objects will be an instance of String[] containing tokens. The length of n-gram, minimum count for a result and the maximum number of results returned are all specified. The confidence ordering is based on the result of Pearson's C2 independence statistic as computed by chiSquaredIndependence(int[]).

Parameters:
nGram - Length of n-grams to search for collocations.
minCount - Minimum count for a returned n-gram.
maxReturned - Maximum number of results returned.
Returns:
Array of collocations in confidence order.

newTerms

@Deprecated
public ScoredObject<String[]>[] newTerms(int nGram,
                                                    int minCount,
                                                    int maxReturned,
                                                    LanguageModel.Tokenized backgroundLM)
Deprecated. Use newTermSet(int,int,int,LanguageModel.Tokenized) instead.

See newTermSet(int,int,int,LanguageModel.Tokenized).

Parameters:
nGram - Length of n-grams to search for significant new terms.
minCount - Minimum count for a returned n-gram.
maxReturned - Maximum number of results returned.
backgroundLM - Background language model against which significance is measured.
Returns:
Array of new terms ordered by significance.

newTermSet

public SortedSet<ScoredObject<String[]>> newTermSet(int nGram,
                                                    int minCount,
                                                    int maxReturned,
                                                    LanguageModel.Tokenized backgroundLM)
Returns a list of scored n-grams ordered by the significance of the degree to which their counts in this model exceed their expected counts in a specified background model. The returned scored object array contains ScoredObject instances whose objects are terms represented as string arrays and whose scores are the collocation score for the term. For instance, the new terms may be printed in order of significance by:
 ScoredObject[] terms = new Terms(3,5,100,bgLM);
 for (int i = 0; i < terms.length; ++i) {
     String[] term = (String[]) terms[i].getObject();
     double score = terms[i].score();
     ...
 }
 

The exact scoring used is the z-score as defined in BinomialDistribution.z(double,int,int) with the success probability defined by the n-grams probability estimate in the background model, the number of successes being the count of the n-gram in this model and the number of trials being the total count in this model.

See oldTermSet(int,int,int,LanguageModel.Tokenized) for a method that returns the least significant terms in this model relative to a background model.

Parameters:
nGram - Length of n-grams to search for significant new terms.
minCount - Minimum count for a returned n-gram.
maxReturned - Maximum number of results returned.
backgroundLM - Background language model against which significance is measured.
Returns:
New terms ordered by significance.

oldTerms

@Deprecated
public ScoredObject<String[]>[] oldTerms(int nGram,
                                                    int minCount,
                                                    int maxReturned,
                                                    LanguageModel.Tokenized backgroundLM)
Deprecated. Use oldTermSet(int,int,int,LanguageModel.Tokenized) instead.

See oldTermSet(int,int,int,LanguageModel.Tokenized).

Parameters:
nGram - Length of n-grams to search for significant old terms.
minCount - Minimum count in background model for a returned n-gram.
maxReturned - Maximum number of results returned.
backgroundLM - Background language model from which counts are derived.
Returns:
Array of old terms ordered by significance.

oldTermSet

public SortedSet<ScoredObject<String[]>> oldTermSet(int nGram,
                                                    int minCount,
                                                    int maxReturned,
                                                    LanguageModel.Tokenized backgroundLM)
Returns a list of scored n-grams ordered in reverse order of significance with respect to the background model. In other words, these are ones that occur less often in this model than they would have been expected to given the background model.

Note that only terms that exist in the foreground model are considered. By contrast, reversing the roles of the models in the sister method newTermSet(int,int,int,LanguageModel.Tokenized) considers every n-gram in the background model and may return slightly different results.

Parameters:
nGram - Length of n-grams to search for significant old terms.
minCount - Minimum count in background model for a returned n-gram.
maxReturned - Maximum number of results returned.
backgroundLM - Background language model from which counts are derived.
Returns:
Old terms ordered by significance.

frequentTerms

@Deprecated
public ScoredObject<String[]>[] frequentTerms(int nGram,
                                                         int maxReturned)
Deprecated. Use frequentTermSet(int,int) instead.

See frequentTermSet(int,int).

Parameters:
nGram - Length of n-grams to search.
maxReturned - Maximum number of results returned.

frequentTermSet

public SortedSet<ScoredObject<String[]>> frequentTermSet(int nGram,
                                                         int maxReturned)
Returns the most frequent n-gram terms in the training data up to the specified maximum number. The terms are ordered by raw counts and returned in order. The scored objects in the return array have objects that are the terms themselves and scores based on count.

See infrequentTermSet(int,int) to retrieve the most frequent terms.

Parameters:
nGram - Length of n-grams to search.
maxReturned - Maximum number of results returned.

infrequentTerms

@Deprecated
public ScoredObject<String[]>[] infrequentTerms(int nGram,
                                                           int maxReturned)
Deprecated. Use infrequentTermSet(int,int) instead.

See infrequentTermSet(int,int).

Parameters:
nGram - Length of n-grams to search.
maxReturned - Maximum number of results returned.

infrequentTermSet

public SortedSet<ScoredObject<String[]>> infrequentTermSet(int nGram,
                                                           int maxReturned)
Returns the least frequent n-gram terms in the training data up to the specified maximum number. The terms are ordered by raw counts and returned in reverse order. The scored objects in the return array have objects that are the terms themselves and scores based on count.

See frequentTermSet(int,int) to retrieve the most frequent terms.

Parameters:
nGram - Length of n-grams to search.
maxReturned - Maximum number of results returned.

chiSquaredIndependence

public double chiSquaredIndependence(int[] nGram)
Returns the maximum value of Pearson's C2 independence test statistic resulting from splitting the specified n-gram in half to derive a contingency matrix. Higher return values indicate more dependence among the terms in the n-gram.

The input n-gram is split into two halves, Term1 and Term2, each of which is a non-empty sequence of integers. Term1 consists of the tokens indexed 0 to mid-1 and Term2 from mid to end-1.

The contingency matrix for computing the independence statistic is:

 +Term2-Term2
+Term1Term(+,+)Term(+,-)
-Term1Term(-,+)Term(-,-)
where values for a specified integer sequence nGram and midpoint 0 < mid < end is:
Term(+,+) = count(nGram,0,end)
Term(+,-) = count(nGram,0,mid) - count(nGram,0,end)
Term(-,+) = count(nGram,mid,end) - count(nGram,0,end)
Term(-,-) = totalCount - Term(+,+) - Term(+,-) - Term(-,+)
Note that using the overall total count provides a slight overapproximation of the count of appropriate-length n-grams.

For further information on the independence test, see the documentation for Statistics.chiSquaredIndependence(double,double,double,double).

Parameters:
nGram - Array of integers whose independence statistic is returned.
Returns:
Minimum independence test statistic score for splits of the n-gram.
Throws:
IllegalArgumentException - If the specified n-gram is not at least two elements long.

z

public double z(int[] nGram,
                int nGramSampleCount,
                int totalSampleCount)
Returns the z-score of the specified n-gram with the specified count out of a total sample count, as measured against the expectation of this tokenized language model. Negative z-scores mean the sample n-gram count is lower than expected and positive z-scores mean the sample n-gram count is higher than expected. Z-scores close to zero indicate the sample count is in line with expectations according to this language model.

Formulas for z-scores and an explanation of their scaling by deviation is described in the documentation for the static method BinomialDistribution.z(double,int,int).

Parameters:
nGram - The n-gram to test.
nGramSampleCount - The number of observations of the n-gram in the sample.
totalSampleCount - The total number of samples.
Returns:
The z-score for the specified sample counts against the expections of this language model.

toString

public String toString()
Returns a string-based representation of the token counts for this language model.

Overrides:
toString in class Object
Returns:
A string-based representation of this model.