com.aliasi.spell
Class TrainSpellChecker

java.lang.Object
  extended by com.aliasi.spell.TrainSpellChecker
All Implemented Interfaces:
Handler, ObjectHandler<CharSequence>, TextHandler, Compilable, Serializable

public class TrainSpellChecker
extends Object
implements TextHandler, ObjectHandler<CharSequence>, Compilable, Serializable

A TrainSpellChecker instance provides a mechanism for collecting training data for a compiled spell checker. Training instances are nothing more than character sequences which represent likely user queries.

Data Normalization

In training the source language model, all training data is whitespace normalized with an initial whitespace, final whitespace, and all internal whitespace sequences converted to a single space character.

Token Sensitivity

A tokenization factory may be optionally specified for training token-sensitive spell checkers. With tokenization, input is further normalized to insert a single whitespace between all tokens not already separated by a space in the input. The tokens are then output during compilation and read back into the compiled spell checker. The set of tokens output may be pruned to remove any below a given count threshold. The resulting set of tokens is used to constrain the set of alternative spellings suggested during spelling correction to include only tokens in the observed token set.

Direct Training

As an alternative to using the spell checker trainer, a language model may be trained directly and supplied in compiled form along with a weighted edit distance to the public constructors for compiled spell checkers. It's critical that the normalization happens the same way as for the spell checker trainer.

Weighted Edit Distance

In constructing a spell checker trainer, a compilable weighted edit distance must be specified. This edit distance model will be compiled along with the language model and token set and used as the channel model in the compiled spell checker. The

Compilation

After training, a model is written out through the Compilable interface using compileTo(ObjectOutput). When this model is read back in, it will be an instance of CompiledSpellChecker. The compiled spell checkers allow many runtime parameters to be tuned; see the class documentation for full details.

Serialization

A spell checker trainer may be serialized in the usual way:
 TrainSpellChecker trainer = ...;
 ObjectOutput out = ...;
 out.writeObject(trainer);
And then read back in by reversing this operation:
 ObjectInput in = ...;
 TrainSpellChecker trainer
   = (TrainSpellChecker) in.readObject();

The resulting round trip produces a trainer that is functionally identical to the original one. Serialization is useufl for storing models for which more training data will be available later.

Warning: The object input and output used for serialization must extend InputStream and OutputStream. The only implementations of ObjectInput and ObjectOutput as of the 1.6 JDK do extend the streams, so this will only be a problem with customized object input or output objects. If you need this method to work with custom input and output objects that do not extend the corresponding streams, drop us a line and we can perhaps refactor the output methods to remove this restriction. [Note: This warning was inherited from NGramProcessLM.]

Since:
LingPipe2.0
Version:
3.9.1
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
TrainSpellChecker(NGramProcessLM lm, WeightedEditDistance editDistance)
          Construct a non-tokenizing spell checker trainer from the specified language model and edit distance.
TrainSpellChecker(NGramProcessLM lm, WeightedEditDistance editDistance, TokenizerFactory tokenizerFactory)
          Construct a spell checker trainer from the specified n-gram process language model, tokenizer factory and edit distance.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Writes a compiled spell checker to the specified object output.
 WeightedEditDistance editDistance()
          Returns the weighted edit distance (channel model) underlying this spell checker trainer.
 void handle(char[] cs, int start, int length)
          Deprecated. Use handle(CharSequence) instead.
 void handle(CharSequence cSeq)
          Train the spell checker on the specified character sequence.
 NGramProcessLM languageModel()
          Returns the n-gram process language model (source model) underlying this spell checker trainer.
 long numTrainingChars()
          Returns the total length in characters of all text used to train the spell checker.
 void pruneLM(int minCount)
          Prunes the underlying character language model to remove substring counts of less than the specified minimum.
 void pruneTokens(int minCount)
          Prunes the set of collected tokens of all tokens with count less than the specified minimum.
 ObjectToCounterMap<String> tokenCounter()
          Returns the counter for the tokens in the training set.
 void train(CharSequence cSeq)
          Deprecated. Use handle(CharSequence) instead.
 void train(CharSequence cSeq, int count)
          Train the spelling checker on the specified character sequence as if it had appeared with a frequency given by the specified count.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TrainSpellChecker

public TrainSpellChecker(NGramProcessLM lm,
                         WeightedEditDistance editDistance)
Construct a non-tokenizing spell checker trainer from the specified language model and edit distance. See SpellChecker for more information on the language model and edit distance models in the compiled spell checker.

Parameters:
lm - Compilable language model.
editDistance - Compilable weighted edit distance.
Throws:
IllegalArgumentException - If the edit distance is not compilable.

TrainSpellChecker

public TrainSpellChecker(NGramProcessLM lm,
                         WeightedEditDistance editDistance,
                         TokenizerFactory tokenizerFactory)
Construct a spell checker trainer from the specified n-gram process language model, tokenizer factory and edit distance. The language model must be an instance of the character-level n-gram process language model class. The edit distance must be compilable. The tokenizer factory may be null, in which case tokens are not saved as part of training and the compiled spell checker is not token sensitive. If the tokenizer factory is specified, it must be compilable.

Parameters:
lm - Compilable language model.
editDistance - Compilable weighted edit distance.
tokenizerFactory - Optional tokenizer factory.
Throws:
IllegalArgumentException - If the edit distance is not compilable or if the tokenizer factory is non-null and not compilable.
Method Detail

languageModel

public NGramProcessLM languageModel()
Returns the n-gram process language model (source model) underlying this spell checker trainer.

The returned value is a reference to the language model held by the trainer, so any changes to it will affect this spell checker.

Returns:
The n-gram process LM for this trainer.

editDistance

public WeightedEditDistance editDistance()
Returns the weighted edit distance (channel model) underlying this spell checker trainer.

The returned value is a reference to the langauge model held by the trainer, so any changes to it will affect this spell checker.

Returns:
The edit distance for this trainer.

tokenCounter

public ObjectToCounterMap<String> tokenCounter()
Returns the counter for the tokens in the training set. This may be used to print out the tokens with their counts for later perusal. The value returned is the actual counter, so any changes made to it will be reflected in this spell checker. Pruning the token counts may have eliminated tokens in the training data from the counter.

Returns:
The counter for the tokens in the training set.

train

@Deprecated
public void train(CharSequence cSeq)
Deprecated. Use handle(CharSequence) instead.

Train the spelling checker on the specified character sequence. The sequence is normalized by normalizing all whitespace sequences to a single space character and inserting an initial and final whitespace. If a tokenization factory is specified, a single space character is insterted between any tokens not already separated by a white space.

Parameters:
cSeq - Character sequence for training.

train

public void train(CharSequence cSeq,
                  int count)
Train the spelling checker on the specified character sequence as if it had appeared with a frequency given by the specified count.

See the method train(CharSequence) for information on the normalization carried out on the input character sequence.

Although calling this method is equivalent to calling train(CharSequence) the specified count number of times, this mehod is much more efficient because it does not require iteration.

This method may be used to boost the training for a specified input, or just to combine inputs into single method calls.

Parameters:
cSeq - Character sequence for training.
count - Frequency of sequence to train.
Throws:
IllegalArgumentException - If the specified count is negative.

numTrainingChars

public long numTrainingChars()
Returns the total length in characters of all text used to train the spell checker.

Returns:
The number of training characters seen.

handle

@Deprecated
public void handle(char[] cs,
                              int start,
                              int length)
Deprecated. Use handle(CharSequence) instead.

Train the spelling checker on the specified character slice. This method implements the necessary method for the TextHandler interface. Otherwise, it behaves exactly the same way as train(CharSequence).

Specified by:
handle in interface TextHandler
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
length - Number of characters in the slice.

handle

public void handle(CharSequence cSeq)
Train the spell checker on the specified character sequence. The sequence is normalized by normalizing all whitespace sequences to a single space character and inserting an initial and final whitespace. If a tokenization factory is specified, a single space character is insterted between any tokens not already separated by a white space.

Specified by:
handle in interface ObjectHandler<CharSequence>
Parameters:
cSeq - Characters for training.

pruneTokens

public void pruneTokens(int minCount)
Prunes the set of collected tokens of all tokens with count less than the specified minimum. If there was no tokenization factory specified for this spell checker, this method will have no effect.

Parameters:
minCount - Minimum count of preserved token.

pruneLM

public void pruneLM(int minCount)
Prunes the underlying character language model to remove substring counts of less than the specified minimum.

Parameters:
minCount - Minimum count of preserved substrings.

compileTo

public void compileTo(ObjectOutput objOut)
               throws IOException
Writes a compiled spell checker to the specified object output. The class of the spell checker read back in is CompiledSpellChecker.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this spell checker is written.
Throws:
IOException - If there is an I/O error while writing.