com.aliasi.spell
Class CompiledSpellChecker

java.lang.Object
  extended by com.aliasi.spell.CompiledSpellChecker
All Implemented Interfaces:
SpellChecker

public class CompiledSpellChecker
extends Object
implements SpellChecker

The CompiledSpellChecker class implements a first-best spell checker based on models of what users are likely to mean and what errors they are likely to make in expressing their meaning. This class is based on a character language model which represents likely user intentions, and a weighted edit distance, which represents how noise is introduced into the signal via typos, brainos, or other sources such as case-normalization, diacritic removal, bad character encodings, etc.

The usual way of creating a compiled checker is through an instance of TrainSpellChecker. The result of compiling the spell checker training class and reading it back in is a compiled spell checker. Only the basic models, weighted edit distance, and token set are supplied through compilation; all other parameters described below need to be set after an instance is read in from its compiled form. The token set may be null at construction time and may be set later.

This class adopts the noisy-channel model approach to decoding likely user intentions given received signals. Spelling correction simply returns the most likely intended message given the message actually received. In symbols:

didYouMean(received)
= ArgMaxintended P(intended | received)
= ArgMaxintended P(intended,received) / P(received)
= ArgMaxintended P(intended,received)
= ArgMaxintended P(intended) * P(received|intended)
The estimator P(intended), called the source model, estimates which signals are likely to be sent along the channel. For instance, the source might be a model of user's intent in entering information on a web page. The estimator P(received|intended), called the channel model, estimates how intended messages are likely to be garbled.

For this class, the source language model must be a compiled n-gram character language model. Compiled models are required for the efficiency of their suffix-tree encodings in evaluating sequences of characters. Optimizing held-out sample cross-entropy is not necessarily the best approach to building these language models, because they are being used here in a discriminitive fashion, much as in language-model-based classification, tagging or chunking.

For this class, the channel model must be a weighted edit distance. For traditional spelling correction, this is a model of typos and brainos. There are two static constant weighted edit distances supplied in this class which are useful for other decoding tasks. The CASE_RESTORING distance may be used to restore case in single-case text. The TOKENIZING model may be used to tokenize untokenized text, and is used in our Chinese tokenization demo.

All input is normalized for whitespace, which consists of removing initial and final whitespaces and reducing all other space sequences to a single space character. A single space character is used as the initial context for the source language model. A single final uneditable space character is estimated at the end of the language model, thus adapting the process language model to be used as a bounded sequence language model just as in the language model package itself.

This class optionally restricts corrections to sequences of valid tokens. The valid tokens are supplied as a set either during construction time or later. If the set of valid tokens is null, then the output is not token sensitive, and results may include tokens that are not in the training data. Token-matching is case sensitive.

If a set of valid tokens is supplied, then a tokenizer factory should also be supplied to carry out tokenization normalization on input. This tokenizer factory will be used to separate input tokens with single spaces. This tokenization may also be done externally and normalized text passed into the didYouMean method; this approach makes sense if the tokenization is happening elsewhere already.

There are a number of tuning parameters for this class. The coarsest form of tuning simply sets whether or not particular edits may be performed. For instance, setAllowDelete(boolean) is used to turn deletion on or off. Although edits with negative infinity scores will never be used, it is more efficient to simply disallow them if they are all infinite. This is used in the Chinese tokenizer, for instance, to only allow insertions and matches.

There are three scoring parameters that determine how expensive input characters are to edit. The first of these is setKnownTokenEditCost(double), which provides a penalty to be added to the cost of editing characters that fall within known tokens. This value is only used for token-sensitive correctors. Setting this to a low value makes it less likely to suggest an edit on a known token. The default value is -2.0, which on a log (base 2) scale makes editing characters in known tokens 1/4 as likely as editing characters in unknown tokens.

The next two scoring parameters provide penalties for editing the first or second character in a token, whether it is known or not. In most cases, users make more mistakes later in words than in the first few characters. These values are controlled independently through values provided at construction time or by using the methods setFirstCharEditCost(double) and setSecondCharEditCost(double). The default values for these are -2.0 and -1.0 respectively.

The final tuning parameter is controlled with setNumConsecutiveInsertionsAllowed(int), which determines how many characters may be inserted in a row. The default value is 1, and setting this to 2 or higher may seriously slow down correction, especially if it not token sensitive.

Search is further controlled by an n-best parameter, which specifies the number of ongoing hypotheses considered after inspecting each character. This value is settable either in the constructor or for models compiled from a trainer, by using the method setNBest(int). This lower this value, the faster the resulting spelling correction. But the danger is that with low values, there may be search errors, where the correct hypothesis is pruned because it did not look promising enough early on. In general, this value should be set as low as possible without causing search errors.

This class requires external concurrent-read/synchronous-write (CRSW) synchronization. All of the methods begining with set must be executed exclusively in order to guarantee consistent results; all other methods may be executed concurrently. The didYouMean(String) method for spelling correction may be called concurrently with the same blocking and thread safety constraints as the underlying language model and edit distance, both of which are called repeatedly by this method. If both the language model and edit distance are thread safe and non-blocking, as in all of LingPipe's implementations, then didYouMean will also be concurrently executable and non-blocking.

Blocking Corrections

There are two ways to block tokens from being edited. The first is by setting a minimum length of edited tokens. Standard language models trained on texts tend to overestimate the likelihood of queries that contain well-known short words or phrases like of or a. The method setMinimumTokenLengthToCorrect(int) sets a minimum token length that is corrected. The default value is 0.

The second way to block corrections is to provide a set of tokens that are never corrected. One way to construct such a set during training is by taking large-count tokens from the counter returned by TrainSpellChecker.tokenCounter().

Note that these methods are heuristics that move the spelling corrector in the same direction as two existing parameters. First, there is the pair of methods setFirstCharEditCost(double) and setSecondCharEditCost(double) which make it less likely to edit the first two characters (which are all of the characters in a two-character token). Second, there is a flexible penalty for editing known tokens that may be set with setKnownTokenEditCost(double).

Blocking corrections has a positive effect on speed, because it eliminates any search over the tokens that are excluded from correction.

N-best Output

It is possible to retrieve a list of possible spelling corrections, ordered by plausibility. The method didYouMeanNBest(String) returns an iterator over corrections in decreasing order of likelihood. Note that the same exact string may be proposed more than once as a correction because of alternative edits leading to the same result. For instance, "a" may be turned into "b" by substitution in one step, or by deletion and insertion (or insertion then deletion) in two steps. These alternatives typically have different scores and only the highest-scoring one is maintained at any given stage of the algorithm by the first-best analyzer.

The n-best analyzer needs a much wider n-best list in order to return sensible results, especially for very long inputs. The specified n-best size for the spell checker should, in fact, be substantially larger than the desired number of n-best results.

Since:
LingPipe2.0
Version:
3.8
Author:
Bob Carpenter

Field Summary
static WeightedEditDistance CASE_RESTORING
          A weighted edit distance ordered by similarity that treats case variants as zero cost and all other edits as infinite cost.
static WeightedEditDistance TOKENIZING
          A weighted edit distance ordered by similarity that allows free space insertion.
 
Constructor Summary
CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, Set<String> tokenSet)
          Construct a compiled spell checker based on the specified language model and edit distance, with a null tokenizer factory, the specified set of valid output tokens, with default value for n-best size, known token edit cost and first and second character edit costs.
CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, Set<String> tokenSet, int nBestSize)
          Construct a compiled spell checker based on the specified language model and edit distance, a null tokenizer factory, the set of valid output tokens, and maximum n-best size, with default known token and first and second character edit costs.
CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory factory, Set<String> tokenSet, int nBestSize)
          Construct a compiled spell checker based on the specified language model and edit distance, tokenizer factory, the set of valid output tokens, and maximum n-best size, with default known token and first and second character edit costs.
CompiledSpellChecker(CompiledNGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory factory, Set<String> tokenSet, int nBestSize, double knownTokenEditCost, double firstCharEditCost, double secondCharEditCost)
          Construct a compiled spell checker based on the specified language model and similarity edit distance, set of valid output tokens, maximum n-best size per character, and the specified edit penalities for editing known tokens or the first or second characters of tokens.
 
Method Summary
 boolean allowDelete()
          Returns true if this spell checker allows deletions.
 boolean allowInsert()
          Returns true if this spell checker allows insertions.
 boolean allowMatch()
          Returns true if this spell checker allows matches.
 boolean allowSubstitute()
          Returns true if this spell checker allows substitutions.
 boolean allowTranspose()
          Returns true if this spell checker allows transpositions.
 String didYouMean(String receivedMsg)
          Returns a first-best hypothesis of the intended message given a received message.
 Iterator<ScoredObject<String>> didYouMeanNBest(String receivedMsg)
          Returns an iterator over the n-best spelling corrections for the specified input string.
 Set<String> doNotEditTokens()
          Returns an unmodifiable view of the set of tokens that will never be edited in this compiled spell checker.
 WeightedEditDistance editDistance()
          Returns the weighted edit distance for this compiled spell checker.
 double firstCharEditCost()
          Returns the cost penalty for editing the first character in a token.
 double knownTokenEditCost()
          Returns the cost penalty for editing a character in a known token.
 CompiledNGramProcessLM languageModel()
          Returns the compiled language model for this spell checker.
 int minimumTokenLengthToCorrect()
          Returns the minimum length of token that will be corrected.
 int nBestSize()
          Returns the n-best size for this spell checker.
 int numConsecutiveInsertionsAllowed()
          Returns the number of consecutive insertions allowed.
 String parametersToString()
          Returns a string-based representation of the parameters of this compiled spell checker.
 double secondCharEditCost()
          Returns the cost penalty for editing the second character in a token.
 void setAllowDelete(boolean allowDelete)
          Sets this spell checker to allow deletions if the specified value is true and to disallow them if it is false.
 void setAllowInsert(boolean allowInsert)
          Sets this spell checker to allow insertions if the specified value is true and to disallow them if it is false.
 void setAllowMatch(boolean allowMatch)
          Sets this spell checker to allow matches if the specified value is true and to disallow them if it is false.
 void setAllowSubstitute(boolean allowSubstitute)
          Sets this spell checker to allow substitutions if the specified value is true and to disallow them if it is false.
 void setAllowTranspose(boolean allowTranspose)
          Sets this spell checker to allow transpositions if the specified value is true and to disallow them if it is false.
 void setDoNotEditTokens(Set<String> tokens)
          Updates the set of do-not-edit tokens to be the specified value.
 void setEditDistance(WeightedEditDistance editDistance)
          Sets the edit distance for this spell checker to the specified value.
 void setFirstCharEditCost(double cost)
          Set the first character edit cost to the specified value.
 void setKnownTokenEditCost(double cost)
          Set the known token edit cost to the specified value.
 void setLanguageModel(CompiledNGramProcessLM lm)
          Sets the language model for this spell checker to the specified value.
 void setMinimumTokenLengthToCorrect(int tokenCharLength)
          Sets a minimum character length for tokens to be eligible for editing.
 void setNBest(int size)
          Sets The n-best size to the specified value.
 void setNumConsecutiveInsertionsAllowed(int numAllowed)
          Set the number of consecutive insertions allowed to the specified value.
 void setSecondCharEditCost(double cost)
          Set the second character edit cost to the specified value.
 void setTokenizerFactory(TokenizerFactory factory)
          Sets the tokenizer factory for input processing to the specified value.
 void setTokenSet(Set<String> tokenSet)
          Sets the set of tokens that can be produced by editing.
 TokenizerFactory tokenizerFactory()
          Returns the tokenizer factory for this spell checker.
 Set<String> tokenSet()
          Returns an unmodifiable view the set of tokens for this spell checker.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CASE_RESTORING

public static WeightedEditDistance CASE_RESTORING
A weighted edit distance ordered by similarity that treats case variants as zero cost and all other edits as infinite cost. The inifite cost is Double.NEGATIVE_INFINITY. See WeightedEditDistance for more information on similarity-based distances.

If this model is used for spelling correction, the result is a system that simply chooses the most likely case for output characters given an input character and does not change anything else.

Case here is based on the methods Character.isUpperCase(char), Character.isLowerCase(char) and equality is tested by converting the upper case character to lower case using Character.toLowerCase(char).

This edit distance is compilable and the result of writing it and reading it is referentially equal to this instance.


TOKENIZING

public static WeightedEditDistance TOKENIZING
A weighted edit distance ordered by similarity that allows free space insertion. The cost of inserting a space is zero, the cost of matching is zero, and all other costs are infinite. See WeightedEditDistance for more information on similarity-based distances.

If this model is used for spelling correction, the result is as system that will retokenize input with no spaces. For instance, if the source model is trained with chinese tokens separated by spaces and the input is a sequence of chinese characters not separated by spaces, the output is a space-separated tokenization. If the source model is valid pronunciations separated by spaces and the input is pronunciations not separated by spaces, the result is a tokenization.

This edit distance is compilable and the result of writing it and reading it is referentially equal to this instance.

Constructor Detail

CompiledSpellChecker

public CompiledSpellChecker(CompiledNGramProcessLM lm,
                            WeightedEditDistance editDistance,
                            TokenizerFactory factory,
                            Set<String> tokenSet,
                            int nBestSize,
                            double knownTokenEditCost,
                            double firstCharEditCost,
                            double secondCharEditCost)
Construct a compiled spell checker based on the specified language model and similarity edit distance, set of valid output tokens, maximum n-best size per character, and the specified edit penalities for editing known tokens or the first or second characters of tokens. The set of do-not-edit tokens is initiall empty; set it using setDoNotEditTokens(Set).

The weighted edit distance is required to be a similarity measure for compatibility with the order of log likelihoods in the source (language) model. See WeightedEditDistance for more information about similarity versus dissimilarity distance measures.

If the set of tokens is null, the constructed spelling checker will not be token-sensitive. That is, it will allow edits to strings which are not tokens in the token set.

Parameters:
lm - Source language model.
editDistance - Channel edit distance model.
factory - Tokenizer factory for tokenizing inputs.
tokenSet - Set of valid tokens for outputs or null if output is not token sensitive.
nBestSize - Size of n-best list for spell checking. hypothesis is pruned.
knownTokenEditCost - Penalty for editing known tokens per edit.
firstCharEditCost - Penalty for editing while scanning the first character in a token.
secondCharEditCost - Penalty for editing while scanning the second character in a token.

CompiledSpellChecker

public CompiledSpellChecker(CompiledNGramProcessLM lm,
                            WeightedEditDistance editDistance,
                            TokenizerFactory factory,
                            Set<String> tokenSet,
                            int nBestSize)
Construct a compiled spell checker based on the specified language model and edit distance, tokenizer factory, the set of valid output tokens, and maximum n-best size, with default known token and first and second character edit costs. The set of do-not-edit tokens is initiall empty; set it using setDoNotEditTokens(Set).

Parameters:
lm - Source language model.
editDistance - Channel edit distance model.
factory - Tokenizer factory for tokenizing inputs.
tokenSet - Set of valid tokens for outputs or null if output is not token sensitive.
nBestSize - Size of n-best list for spell checking. hypothesis is pruned.
Throws:
IllegalArgumentException - If the edit distance is not a similarity measure.

CompiledSpellChecker

public CompiledSpellChecker(CompiledNGramProcessLM lm,
                            WeightedEditDistance editDistance,
                            Set<String> tokenSet,
                            int nBestSize)
Construct a compiled spell checker based on the specified language model and edit distance, a null tokenizer factory, the set of valid output tokens, and maximum n-best size, with default known token and first and second character edit costs. The set of do-not-edit tokens is initiall empty; set it using setDoNotEditTokens(Set).

Parameters:
lm - Source language model.
editDistance - Channel edit distance model.
tokenSet - Set of valid tokens for outputs or null if output is not token sensitive.
nBestSize - Size of n-best list for spell checking. hypothesis is pruned.

CompiledSpellChecker

public CompiledSpellChecker(CompiledNGramProcessLM lm,
                            WeightedEditDistance editDistance,
                            Set<String> tokenSet)
Construct a compiled spell checker based on the specified language model and edit distance, with a null tokenizer factory, the specified set of valid output tokens, with default value for n-best size, known token edit cost and first and second character edit costs. The set of do-not-edit tokens is initiall empty; set it using setDoNotEditTokens(Set).

Parameters:
lm - Source language model.
editDistance - Channel edit distance model.
tokenSet - Set of valid tokens for outputs or null if output is not token sensitive.
Method Detail

languageModel

public CompiledNGramProcessLM languageModel()
Returns the compiled language model for this spell checker. Compiled language models are themselves immutable, and the language model for a spell checker may not be changed, but the result returned by this method may be used to construct a new compiled spell checker.

Returns:
The language model for this spell checker.

editDistance

public WeightedEditDistance editDistance()
Returns the weighted edit distance for this compiled spell checker.

Returns:
The edit distance for this spell checker.

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Returns the tokenizer factory for this spell checker.

Returns:
The tokenizer factory for this spell checker.

tokenSet

public Set<String> tokenSet()
Returns an unmodifiable view the set of tokens for this spell checker. In order to change the token set, construct a new set and use setTokenSet(Set).

Returns:
The set of tokens for this spell checker.

doNotEditTokens

public Set<String> doNotEditTokens()
Returns an unmodifiable view of the set of tokens that will never be edited in this compiled spell checker. To change the value of this set, use setDoNotEditTokens(Set).

Returns:
The set of tokens that will not be edited.

setDoNotEditTokens

public void setDoNotEditTokens(Set<String> tokens)
Updates the set of do-not-edit tokens to be the specified value. If one of these tokens shows up in the input, it will also show up in any correction supplied.

Parameters:
tokens - Set of tokens not to edit.

nBestSize

public int nBestSize()
Returns the n-best size for this spell checker. See the class documentation above and the documentation for the method setNBest(int) for more information.

Returns:
The n-best size for this spell checker.

knownTokenEditCost

public double knownTokenEditCost()
Returns the cost penalty for editing a character in a known token. This penalty is added to each edit within a known token.

Returns:
Known token edit penalty.

firstCharEditCost

public double firstCharEditCost()
Returns the cost penalty for editing the first character in a token. This penalty is added to each edit while scanning the first character of a token in the input.

As a special case, transposition only pays a single penalty based on the penalty of the first character in the transposition.

Returns:
First character edit penalty.

secondCharEditCost

public double secondCharEditCost()
Returns the cost penalty for editing the second character in a token. This penalty is added for each edit while scanning the second character in an input.

Returns:
Second character edit penalty.

setKnownTokenEditCost

public void setKnownTokenEditCost(double cost)
Set the known token edit cost to the specified value.

Parameters:
cost - New value for known token edit cost.

setFirstCharEditCost

public void setFirstCharEditCost(double cost)
Set the first character edit cost to the specified value.

Parameters:
cost - New value for the first character edit cost.

setSecondCharEditCost

public void setSecondCharEditCost(double cost)
Set the second character edit cost to the specified value.

Parameters:
cost - New value for the second character edit cost.

numConsecutiveInsertionsAllowed

public int numConsecutiveInsertionsAllowed()
Returns the number of consecutive insertions allowed. This will be zero if insertions are not allowed.


allowInsert

public boolean allowInsert()
Returns true if this spell checker allows insertions.

Returns:
true if this spell checker allows insertions.

allowDelete

public boolean allowDelete()
Returns true if this spell checker allows deletions.

Returns:
true if this spell checker allows deletions.

allowMatch

public boolean allowMatch()
Returns true if this spell checker allows matches.

Returns:
true if this spell checker allows matches.

allowSubstitute

public boolean allowSubstitute()
Returns true if this spell checker allows substitutions.

Returns:
true if this spell checker allows substitutions.

allowTranspose

public boolean allowTranspose()
Returns true if this spell checker allows transpositions.

Returns:
true if this spell checker allows transpositions.

setEditDistance

public void setEditDistance(WeightedEditDistance editDistance)
Sets the edit distance for this spell checker to the specified value.

Parameters:
editDistance - Edit distance to use for spell checking.

setMinimumTokenLengthToCorrect

public void setMinimumTokenLengthToCorrect(int tokenCharLength)
Sets a minimum character length for tokens to be eligible for editing.

Parameters:
tokenCharLength - Edit distance to use for spell checking.
Throws:
IllegalArgumentException - If the character length specified is less than 0.

minimumTokenLengthToCorrect

public int minimumTokenLengthToCorrect()
Returns the minimum length of token that will be corrected. This value is initially 0, but may be set using setMinimumTokenLengthToCorrect(int).

Returns:
The minimum token length to correct.

setLanguageModel

public void setLanguageModel(CompiledNGramProcessLM lm)
Sets the language model for this spell checker to the specified value.

Parameters:
lm - New language model for this spell checker.

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory factory)
Sets the tokenizer factory for input processing to the specified value. If the value is null, no tokenization is performed on the input.

Parameters:
factory - Tokenizer factory for this spell checker.

setTokenSet

public final void setTokenSet(Set<String> tokenSet)
Sets the set of tokens that can be produced by editing. If the specified set is null, editing will not be token sensitive.

If the token set is null, nothing will happen.

Warning: Spelling correction without tokenization may be slow, especially with a large n-best size.

Parameters:
tokenSet - The new set of tokens or null if not tokenizing.

setNBest

public void setNBest(int size)
Sets The n-best size to the specified value. The n-best size controls the number of hypotheses maintained going forward for each character in the input. A higher value indicates a broader and slower search for corrections.

Parameters:
size - Size of the n-best lists at each character.
Throws:
IllegalArgumentException - If the size is less than one.

setAllowInsert

public void setAllowInsert(boolean allowInsert)
Sets this spell checker to allow insertions if the specified value is true and to disallow them if it is false. If the value is false, then the number of consecutive insertions allowed is also set to zero.

Parameters:
allowInsert - New insertion mode.

setAllowDelete

public void setAllowDelete(boolean allowDelete)
Sets this spell checker to allow deletions if the specified value is true and to disallow them if it is false.

Parameters:
allowDelete - New deletion mode.

setAllowMatch

public void setAllowMatch(boolean allowMatch)
Sets this spell checker to allow matches if the specified value is true and to disallow them if it is false.

Parameters:
allowMatch - New match mode.

setAllowSubstitute

public void setAllowSubstitute(boolean allowSubstitute)
Sets this spell checker to allow substitutions if the specified value is true and to disallow them if it is false.

Parameters:
allowSubstitute - New substitution mode.

setAllowTranspose

public void setAllowTranspose(boolean allowTranspose)
Sets this spell checker to allow transpositions if the specified value is true and to disallow them if it is false.

Parameters:
allowTranspose - New transposition mode.

setNumConsecutiveInsertionsAllowed

public void setNumConsecutiveInsertionsAllowed(int numAllowed)
Set the number of consecutive insertions allowed to the specified value. The value must not be negative. If the number of insertions allowed is specified to be greater than zero, then the allow insertion model will be set to true.

Parameters:
numAllowed - Number of insertions allowed in a row.
Throws:
IllegalArgumentException - If the number specified is less than zero.

didYouMean

public String didYouMean(String receivedMsg)
Returns a first-best hypothesis of the intended message given a received message. This method returns null if the received message is itself the best hypothesis. The exact definition of hypothesis ranking is provided in the class documentation above.

Specified by:
didYouMean in interface SpellChecker
Parameters:
receivedMsg - The message received over the noisy channel.
Returns:
The first-best hypothesis of the intended source message.

didYouMeanNBest

public Iterator<ScoredObject<String>> didYouMeanNBest(String receivedMsg)
Returns an iterator over the n-best spelling corrections for the specified input string. The iterator produces instances of ScoredObject, the object of which is the corrected string and the score of which is the joint score of edit (channel) costs and language model (source) cost of the output.

Unlike for HMMs and chunking, this n-best list is not exact due to pruning during spelling correction. The maximum number of returned results is determined by the n-best paramemter, as set through setNBest(int). The larger the n-best list, the higher-quality the results, even earlier on the list.

N-best spelling correction is not an exact computation due to heuristic pruning during decoding. Thus setting the n-best list to a larger result may result in better n-best results, even for earlier results on the list. For instance, the result of the first five corrections is not necessarily the same with a 5-element, 10-element or 1000-element n-best size (as specified by setNBest(int).

A rough confidence measure may be determined by comparing the scores, which are log (base 2) edit (channel) plus log (base 2) language model (source) scores. A very crude measure is to compare the score of the first result to the score of the second result; if there is a large gap, confidence is high. A tighter measure is to convert the log probabilities back to linear, add them all up, and then divide. For instance, if there were results:

0.2850.1430.002
Rank String Log (2) Prob Prob Conf
0foo-2 0.2500.571
0for-3 0.125
0food-40.062
0of-10 0.001
Here there are four results, with log probabilities -2, -3, -4 and -10, which have the corresponding linear probabilities. The sum of these probabilities is 0.438. Hence the confidence in the top-ranked answer is 0.250/0.438=0.571.

Warning: Spell checking with n-best output is currently implemented with a very naive algorithm and is thus very slow compared to first-best spelling correction. The reason for this is that there the dynamic programming is turned off for n-best spelling correction, hence a lot redundant computation is done.

Parameters:
receivedMsg - Input message.
Returns:
Iterator over n-best spelling suggestions.

parametersToString

public String parametersToString()
Returns a string-based representation of the parameters of this compiled spell checker.

Returns:
A string representing the parameters of this spell checker.