com.aliasi.lm
Interface LanguageModel

All Known Subinterfaces:
LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Process, LanguageModel.Sequence, LanguageModel.Tokenized
All Known Implementing Classes:
CompiledNGramBoundaryLM, CompiledNGramProcessLM, CompiledTokenizedLM, NGramBoundaryLM, NGramProcessLM, TokenizedLM, UniformBoundaryLM, UniformProcessLM

public interface LanguageModel

A LanguageModel provides an estimate of the probability of a sequence of characters. Sequences of characters may be specified via an array slice or with a Java CharSequence, which is an interface implemented by String, StringBuilder and the new I/O buffer class CharBuffer.

There are several subinterfaces of language model. The primary distinction is between LanguageModel.Sequence and LanguageModel.Process, which place different normalization requirements on their estimates. Sequence models require the sum of the estimates to be 1.0 over all character sequences, whereas a process requires for each length that the sum of estimates to be 1.0 over all sequences of that length. Every language model should be marked by one of these two sub-interfaces.

The LanguageModel.Conditional interface provides additional methods for conditional estimates. The LanguageModel.Dynamic interface provides a method for training the model with sample character sequence data. Finally, several of the language model implementations are serializable to an object output stream.

Since:
LingPipe2.0
Version:
2.2
Author:
Bob Carpenter

Nested Class Summary
static interface LanguageModel.Conditional
          A LanguageModel.Conditional is a language model that implements conditional estimates of characters given previous characters.
static interface LanguageModel.Dynamic
          A LanguageModel.Dynamic accepts training events in the form of character slices or sequences.
static interface LanguageModel.Process
          A LanguageModel.Process is normalized by length.
static interface LanguageModel.Sequence
          A LanguageModel.Sequence is normalized over all character sequences.
static interface LanguageModel.Tokenized
          A LanguageModel.Tokenized provides a means of estimating the probability of a sequence of tokens.
 
Method Summary
 double log2Estimate(char[] cs, int start, int end)
          Returns an estimate of the log (base 2) probability of the specified character slice.
 double log2Estimate(CharSequence cs)
          Returns an estimate of the log (base 2) probability of the specified character sequence.
 

Method Detail

log2Estimate

double log2Estimate(char[] cs,
                    int start,
                    int end)
Returns an estimate of the log (base 2) probability of the specified character slice.

Parameters:
cs - Underlying array of characters.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
Log estimate of likelihood of specified character sequence.
Throws:
IndexOutOfBoundsException - If the start and end minus one points are outside of the bounds of the character array.

log2Estimate

double log2Estimate(CharSequence cs)
Returns an estimate of the log (base 2) probability of the specified character sequence.

Parameters:
cs - Character sequence to estimate.
Returns:
Log estimate of likelihood of specified character sequence.