com.aliasi.lm
Class NGramBoundaryLM
java.lang.Object
com.aliasi.lm.NGramBoundaryLM
- All Implemented Interfaces:
- LanguageModel, LanguageModel.Conditional, LanguageModel.Dynamic, LanguageModel.Sequence, Model<CharSequence>, Compilable
public class NGramBoundaryLM
- extends Object
- implements LanguageModel.Sequence, LanguageModel.Conditional, LanguageModel.Dynamic, Model<CharSequence>
An NGramBoundaryLM provides a dynamic sequence
language model for which training, estimation and pruning may be
interleaved. A sequence language model normalizes probabilities
over all sequences.
The model may be compiled to an object output; the compiled
model read from the corresponding object input will be
an instance of CompiledNGramBoundaryLM.
This class wraps an n-gram process language model by supplying a
special boundary character boundaryChar at
construction time which will be added to the total number of
characters in defining the estimator. For each training event, the
boundary character is inserted both before and after the character
sequence provided. The actual unigram count of this boundary must
then be decremented so that the initial character isn't counted in
estimates. During estimation, the initial boundary character is
used as context and the final one is used to estimate the
end-of-stream likelihood. Thus if Ppr
is the underlying process model then the boundary model defines
estimates by:
Pb(c1,...,cN)
= Ppr(boundaryChar|boundaryChar,c1,...,cN)
* Σ1<=i<=N
Ppr(ci|boundaryChar,c1,...,ci-1)
= Ppr(boundaryChar,c1,...,cN,boundaryChar)
- Ppr(boundaryChar)
The result of serializing and deserializing an n-gram boundary
language model is a compiled implementation of a conditional
sequence language model. The serialization format is the boundary character
followed by the serialization of the contained writable process
language model.
Models may be pruned by pruning the substring counter returned
by substringCounter(). See the documentation for the
class of the return object, TrieCharSeqCounter, for more
information.
- Since:
- LingPipe2.0
- Version:
- 3.5.1
- Author:
- Bob Carpenter
|
Constructor Summary |
NGramBoundaryLM(int maxNGram)
Constructs a dynamic n-gram sequence language model with the
specified maximum n-gram and default values for other
parameters. |
NGramBoundaryLM(int maxNGram,
int numChars)
Constructs a dynamic n-gram sequence language model with the
specified maximum n-gram, specified maximum number of observed
characters, and default values for other parameters. |
NGramBoundaryLM(int maxNGram,
int numChars,
double lambdaFactor,
char boundaryChar)
Construct a dynamic n-gram sequence language model with the
specified maximum n-gram length, number of characters,
interpolation ratio hyperparameter and boundary character. |
NGramBoundaryLM(NGramProcessLM processLm,
char boundaryChar)
Construct an n-gram boundary language model with the specified
boundary character and underlying process language model. |
|
Method Summary |
void |
compileTo(ObjectOutput objOut)
Writes a compiled version of this boundary language model to
the specified object output. |
NGramProcessLM |
getProcessLM()
Returns the underlying n-gram process language model
for this boundary language model. |
double |
log2ConditionalEstimate(char[] cs,
int start,
int end)
Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters. |
double |
log2ConditionalEstimate(CharSequence cs)
Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters. |
double |
log2Estimate(char[] cs,
int start,
int end)
Returns an estimate of the log (base 2) probability of the
specified character slice. |
double |
log2Estimate(CharSequence cs)
Returns an estimate of the log (base 2) probability of the
specified character sequence. |
double |
log2Prob(CharSequence cSeq)
This method is a convenience impelementation of the Model interface which delegates the call to log2Estimate(CharSequence). |
char[] |
observedCharacters()
Returns the characters that have been observed for this
language model, including the special boundary character. |
double |
prob(CharSequence cSeq)
This method is a convenience implementation of the Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence). |
static NGramBoundaryLM |
readFrom(InputStream in)
Read a process language model from the specified input
stream. |
TrieCharSeqCounter |
substringCounter()
Returns the underlying substring counter for this language
model. |
String |
toString()
Returns a string-based representation of this language model. |
void |
train(char[] cs,
int start,
int end)
Update the model with the training data provided by
the specified character slice. |
void |
train(char[] cs,
int start,
int end,
int count)
Update the model with the training data provided by the
specified character sequence with the specifiedc count. |
void |
train(CharSequence cs)
Update the model with the training data provided by the
specified character sequence with a count of one. |
void |
train(CharSequence cs,
int count)
Update the model with the training data provided by the
specified character sequence with the specified count. |
void |
writeTo(OutputStream out)
Writes this language model to the specified output stream. |
NGramBoundaryLM
public NGramBoundaryLM(int maxNGram)
- Constructs a dynamic n-gram sequence language model with the
specified maximum n-gram and default values for other
parameters.
The default number of characters is Character.MAX_VALUE-1, the default interpolation
parameter ratio is equal to the n-gram length, and the boundary
character is the byte-order marker U+FFFF
- Parameters:
maxNGram - Maximum n-gram length in model.
NGramBoundaryLM
public NGramBoundaryLM(int maxNGram,
int numChars)
- Constructs a dynamic n-gram sequence language model with the
specified maximum n-gram, specified maximum number of observed
characters, and default values for other parameters.
The default interpolation
parameter ratio is equal to the n-gram length, and the boundary
character is the byte-order marker U+FFFF
- Parameters:
maxNGram - Maximum n-gram length in model.numChars - Maximum number of character seen in training
and test sets.
NGramBoundaryLM
public NGramBoundaryLM(int maxNGram,
int numChars,
double lambdaFactor,
char boundaryChar)
- Construct a dynamic n-gram sequence language model with the
specified maximum n-gram length, number of characters,
interpolation ratio hyperparameter and boundary character.
Note that the boundary character must not occur as a regular
character in the input. Unicode provides several options for
marker characters; for instance the byte order markers
U+FFFF or U+FEFF may be used
internally by applications but may not be part of valid unicode
character streams and thus make ideal choices for boundary
characters. See:
Unicode Standard, Chapter 15.8: NonCharacters
- Parameters:
maxNGram - Maximum n-gram length in model.numChars - Maximum number of character seen in training
and test sets.lambdaFactor - Interpolation ratio hyperparameter.boundaryChar - Boundary character.
NGramBoundaryLM
public NGramBoundaryLM(NGramProcessLM processLm,
char boundaryChar)
- Construct an n-gram boundary language model with the specified
boundary character and underlying process language model.
This constructor may be used to reconstitute a serialized
model. By writing the trie character sequence counter for the
underlying process language model, it may be read back in.
This may be used to construct a process language model, which
may be used to reconstruct a boundary language model using
this constructor.
- Parameters:
processLm - Underlying process language model.boundaryChar - Character used to encode boundaries.
writeTo
public void writeTo(OutputStream out)
throws IOException
- Writes this language model to the specified output stream.
A bit output is wrapped around the output stream for
writing. The format begins with a delta-encoding of
the boundary character plus 1, and is followed by the
bit output of the underlying process language model.
- Parameters:
out - Output stream from which to read the language model.
- Throws:
IOException - If there is an underlying I/O error.
readFrom
public static NGramBoundaryLM readFrom(InputStream in)
throws IOException
- Read a process language model from the specified input
stream.
See writeTo(OutputStream) for a description
of the binary format.
- Parameters:
in - Input stream from which to read the model.
- Returns:
- Process language model read from stream.
- Throws:
IOException - If there is an underlying I/O error.
getProcessLM
public NGramProcessLM getProcessLM()
- Returns the underlying n-gram process language model
for this boundary language model. Changes to the returned
model affect this language model.
- Returns:
- The underlying process language model.
observedCharacters
public char[] observedCharacters()
- Returns the characters that have been observed for this
language model, including the special boundary character.
- Specified by:
observedCharacters in interface LanguageModel.Conditional
- Returns:
- The observed characters for this langauge model.
substringCounter
public TrieCharSeqCounter substringCounter()
- Returns the underlying substring counter for this language
model. This model may be pruned by pruning the counter
returned by this method.
- Returns:
- The underlying substring counter for this language model.
compileTo
public void compileTo(ObjectOutput objOut)
throws IOException
- Writes a compiled version of this boundary language model to
the specified object output. The result may be read back in
by casting the result of
ObjectInput.readObject() to
CompiledNGramBoundaryLM.
- Specified by:
compileTo in interface Compilable
- Parameters:
objOut - Object output to which this model is compiled.
- Throws:
IOException - If there is an I/O exception during the
write.
train
public void train(CharSequence cs,
int count)
- Description copied from interface:
LanguageModel.Dynamic
- Update the model with the training data provided by the
specified character sequence with the specified count.
Calling this method,
train(cs,n) is equivalent
to calling train(cs) a total of n
times.
- Specified by:
train in interface LanguageModel.Dynamic
- Parameters:
cs - The character sequence to use as training data.count - Number of instances to train.
train
public void train(CharSequence cs)
- Description copied from interface:
LanguageModel.Dynamic
- Update the model with the training data provided by the
specified character sequence with a count of one.
- Specified by:
train in interface LanguageModel.Dynamic
- Parameters:
cs - The character sequence to use as training data.
train
public void train(char[] cs,
int start,
int end)
- Description copied from interface:
LanguageModel.Dynamic
- Update the model with the training data provided by
the specified character slice.
- Specified by:
train in interface LanguageModel.Dynamic
- Parameters:
cs - The underlying character array for the slice.start - Index of first character in the slice.end - Index of one plus the last character in the
training slice.
train
public void train(char[] cs,
int start,
int end,
int count)
- Description copied from interface:
LanguageModel.Dynamic
- Update the model with the training data provided by the
specified character sequence with the specifiedc count. *
Calling this method,
train(cs,n) is equivalent
to calling train(cs) a total of
n times.
/**
Update the model with the training data provided by
the specified character slice.
- Specified by:
train in interface LanguageModel.Dynamic
- Parameters:
cs - The underlying character array for the slice.start - Index of first character in the slice.end - Index of one plus the last character in the
training slice.count - Number of instances to train.
log2ConditionalEstimate
public double log2ConditionalEstimate(CharSequence cs)
- Description copied from interface:
LanguageModel.Conditional
- Returns the log (base 2) of the probabilty estimate for the
conditional probability of the last character in the specified
character sequence given the previous characters.
- Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
- Parameters:
cs - Character sequence to estimate.
- Returns:
- The log conditional probability estimate.
log2ConditionalEstimate
public double log2ConditionalEstimate(char[] cs,
int start,
int end)
- Description copied from interface:
LanguageModel.Conditional
- Returns the log (base 2) of the probability estimate for the
conditional probability of the last character in the specified
slice given the previous characters.
- Specified by:
log2ConditionalEstimate in interface LanguageModel.Conditional
- Parameters:
cs - Underlying array of characters.start - Index of first character in slice.end - One plus the index of the last character in the slice.
- Returns:
- The log conditional probability estimate.
log2Estimate
public double log2Estimate(CharSequence cs)
- Description copied from interface:
LanguageModel
- Returns an estimate of the log (base 2) probability of the
specified character sequence.
- Specified by:
log2Estimate in interface LanguageModel
- Parameters:
cs - Character sequence to estimate.
- Returns:
- Log estimate of likelihood of specified character
sequence.
log2Estimate
public double log2Estimate(char[] cs,
int start,
int end)
- Description copied from interface:
LanguageModel
- Returns an estimate of the log (base 2) probability of the
specified character slice.
- Specified by:
log2Estimate in interface LanguageModel
- Parameters:
cs - Underlying array of characters.start - Index of first character in slice.end - One plus index of last character in slice.
- Returns:
- Log estimate of likelihood of specified character
sequence.
log2Prob
public double log2Prob(CharSequence cSeq)
- This method is a convenience impelementation of the
Model interface which delegates the call to log2Estimate(CharSequence).
- Specified by:
log2Prob in interface Model<CharSequence>
- Parameters:
cSeq - Character sequence whose probability is returned.
- Returns:
- The log (base 2) probability of the specified character sequence.
prob
public double prob(CharSequence cSeq)
- This method is a convenience implementation of the
Model
interface which returns the result of raising 2.0 to the
power of the result of a call to log2Estimate(CharSequence).
- Specified by:
prob in interface Model<CharSequence>
- Parameters:
cSeq - Character sequence whose probability is returned.
- Returns:
- The log probability of the specified character sequence.
toString
public String toString()
- Returns a string-based representation of this language model.
It displays the boundary character and the contained
process language model.
- Overrides:
toString in class Object
- Returns:
- A string-based representation of this language model.