What is Spelling Correction?

Spelling correction takes a user input text and provides a "corrected" form. Typically, this is a search engine request, and we address that case first in this tutorial.

LingPipe's spelling corrector is also general enough to carry out correction tasks such as restoring case to single-case text, restoring spaces to texts without space, removing automatically inserted hyphenation, replacing ligatures messed up through PDF conversion, etc. We turn to the more general cases in the second half of this tutorial.

Automatic Phrase Completion

Auto-completion is an alternative kind of spell checking that tries to find the best completion for a user-entered prefix, allowing spell checking to happen in matching the completion. See the section Auto-Completion for a demo.

How's it Done?

LingPipe's spelling correction is based on a noisy-channel model, which models user mistakes (typos and brainos) and expected user input (based on the data). Mistakes are modeled by weighted edit distances and expected input by a character language model.

Who Thought of Doing it this Way?

The noisy-channel model was invented by Claude Shannon of Bell Laboratories in the 1940s. The original motivation was transmitting signals over noisy telephone lines. The original reference is cited below. Spelling correction is a widely used application of the noisy channel model.

Popular Web Implementations

Google's "Did You Mean"

Google provides spelling correction for user queries. For example, consider the following query (you may click to try it on Google):

This query has "about" 503 results. Google displays results in the usual way, preceded by the following question:

Did you mean: breck baldwin

The corrected query has "about" 385,000 hits.

Their interface subtly highlights the token that changed by underlining; in this case, the token "brec" is corrected to "breck". Styled as a link (blue and underlined), the user may directly click on the correction to perform a search.

Yahoo's "Did You Mean"

Yahoo uses exactly the same mechanism. For the search:

there are "about" 425 hits. Yahoo returns a slightly different correction:

Did you mean: bruce baldwin

The corrected search has "about" 4,080,000 hits. LingPipe may actually be tuned to simulate either of these first-best suggestions.

It's interesting to note that Yahoo doesn't provide the additional GUI focus of double-underlining on the corrected element. It's also interesting to note that they serve ads based on the corrected query.

MSN's "Were you looking for"

Now this one looks different:

MSN asks us if we were looking for a location:

* Were you looking for   'brec' near Baldwin, NY

Clicking on that link takes you to an MSN local page. Many other queries produced the same spelling correction as Google, Yahoo and Amazon.

Amazon's A9's "Did you mean"

The same query submitted to Amazon's search engine:

produces the same output as Google's, only with the same single underlining as Yahoo. the additional underlining of the corrected term.

Amazon's "Did you mean"

Searches on Amazon's product pages are handled differently. Their current interface for a query such as:

indicates there are no matches for the query and asks if we meant "300gb drive", while also somewhat confusingly showing results matching the corrected query.

With that same query, Google wants to insert a space, asking if we meant " 301 gb drive". MSN, like Amazon, also suggests "300gb drive". Yahoo just returns results, apparently treating "301gb" as a single term; more interestingly:

suggests a dozen alternatives, none of which appear helpful for finding that 300gb drive. Even a closely related query like 401gb drive yields very different corrections from all of the major search engines. So do other configurations, because all of these terms are distributed differently in the data.

How does it Work?

The Basic Model

The basic technology works as follows: The documents that the search engine is providing access to are added both to the search index and a language model. The language model stores seen phrases and maintains statistics about them. When a query is submitted, the src/QuerySpellCheck.java class looks for possible character edits, namely substitutions, insertions, replacements, transpositions, and deletions, that make the query a better fit for the lanaguage model. So if you type 'Gretski' as a query, and the underlying data is data from rec.sport.hockey, the language model will be much more familliar with the mildly edited 'Gretzky' and suggests it as an alternative.

Domain Sensitivity

The big advantage of this approach over dictionary-based spell checking is that the corrections are motivated by data in the search index. So "trt" will be corrected to "tort" in a legal domain, "tart" in a cooking domain, and "TRt" in a bio-informatics domain. On Google, there is no suggested correction, presumably because of web domains "trt.com", Thessaly Radio Television as well as Turkiye Radyo Televizyon, both aka TRT, etc.

Context-Sensitive Correction

Both Yahoo and Google perform context-sensitive correction. For instance, the query frod (an Old English term from the German meaning wise or experienced) has a suggested correction of ford (the automotive company, among others), whereas the query frod baggins has the corrected query frodo baggins (a 20th century English fictional character). That's the Yahoo behavior. Google doesn't correct frod baggins, even though there are about 785 hits for it versus 820,000 for Frodo Baggins. On the other hand, Google does correct frdo and frdo baggins. Amazon behaves similarly, but MSN corrects frd baggins to ford baggins rather than frodo baggins.

LingPipe's model supports exactly this kind of context-sensitive correction.

Running an example

If you have not already, download and install LingPipe. That done, change directory to this one:

> cd LING_PIPE/demos/tutorial/querySpellCheck

Then you may use the ant task:

ant querySpellCheck

though it is preferable to use a command line in this case because the program is interactive. For a command-line invocation, use the following invocation (on a single line and substituting ':' for ';' if you are on Unix/Linux/Mac):

java
-cp "../../../lingpipe-4.1.0.jar;
     ../../lib/lucene-core-2.3.0.jar;
     querySpellCheck.jar"
QuerySpellCheck

A session looks like the following, with user input in italics:

PHASE I: TRAINING
     CONFIGURATION:
     Model File: SpellCheck.model
     N-gram Length: 5
     File=..\..\data\rec.sport.hockey\train/52550
     File=..\..\data\rec.sport.hockey\train/52551
...
     File=..\..\data\rec.sport.hockey\train/55022
Writing model to file=SpellCheck.model
Writing lucene index to =lucene
Reading model from file=SpellCheck.model

PHASE II: CORRECTION
     Constructing Spell Checker from File

Enter query <RETURN>.
     Use just <RETURN>b to exit.

>Wayn Gretsky
Found 0 document(s) that matched query 'Wayn Gretsky':
Found 43 document(s) matching best alt='Wayne Gretzky':

>Wayne Gretzky
Found 43 document(s) that matched query 'Wayne Gretzky':
 No spelling correction found.

>Stanley Cub Plaoofs
Found 90 document(s) that matched query 'Stanley Cub Plaoofs':
Found 208 document(s) matching best alt='Stanley Cup Playoffs':

>StanleyCup
Found 0 document(s) that matched query 'StanleyCup':
Found 143 document(s) matching best alt='Stanley Cup':

>Stan ley Cup
Found 132 document(s) that matched query 'Stan ley Cup':
Found 143 document(s) matching best alt='Stanley Cup':

>
     Detected empty line.
     Ending test.

Pretty obviously the query spell checker is increasing the size of the returned results by fitting the queries better to the indexed data. This won't necessarily always do the right thing, but it allows for some fuzziness in getting spelling right which addresses a long standing problem with search engines.

One difference between the spell checker in LingPipe and many other common spell checkers is that it is able to split tokens, as in the query "StanleyCup" as well as combine tokens, as seen for the query "Stan ley Cup".

Note that this is a pretty trivial amount of data and we're only training with character 5-grams. To compensate, we have set the edit costs very low, but the result of this is overly eager correction. Typical spelling checking applications rely on orders of magnitude differences in occurrences, with edit distances set around the -8 to -10 level, requiring the correction to be 250 to 1000 times more likely than the edit in the training data. With Stanley Cup only occurring 143 times, we have to fudge the typical settings a bit for demo purposes.

Also note that we have not been at all careful in catching the many types of errors thrown by Lucene's rather fastidious query parser. So if you enter a query like "((]-<", you will receive a query parser exception for your troubles.

Getting Under the Hood

The QuerySpellCheck.java class brings together quite a few components as well as the Lucene search engine.

The basic flow is:

Setting up the Spell Checker

There are a fair number of steps required to set up the relevant classes. First we need a TrainSpellChecker object which we will train up on the data which we are indexing with the search engine. The setup is:

FixedWeightEditDistance fixedEdit =
    new FixedWeightEditDistance(MATCH_WEIGHT,
                                DELETE_WEIGHT,
                                INSERT_WEIGHT,
                                SUBSTITUTE_WEIGHT,
                                TRANSPOSE_WEIGHT);

NGramProcessLM lm = new NGramProcessLM(NGRAM_LENGTH);
TokenizerFactory tokenizerFactory
    = new IndoEuropeanTokenizerFactory();
TrainSpellChecker sc
    = new TrainSpellChecker(lm,fixedEdit,
                            tokenizerFactory);

Working from top to bottom, we start by creating a FixedWeightEditDistance object which sets the weights for operations like deletions, insertions etc. Reasonable values have been supplied in the demo but they can be played with to help with variations due to particular data sets. There is a lot of room for messing around with this class but that is beyond the scope of this tutorial.

The next step is to create a NGramProcessLM which will be doing the language modeling at the character level for our demo. All it needs to know is the number of characters to sample in its modeling of the data. Smaller data sets will do better with lower number of characters. Once again, this is something you can play with to evaluate the quality of your queries and the suggested corrections.

In addition we need a TokenizerFactory to help with the editing of queries--this tokenizer should ideally match the tokenizer being used by the Lucene, which must be an implementation of org.apache.lucene.analysis.Analyzer--see below.

It all comes together with the TrainSpellChecker class which will do the training sensitive to tokenization.

Setting up the Lucene search engine

Lucene builds the search index with the IndexWriter class which takes a directory to write the index to, an analyzer to tokenize the data and a boolean flag controlling whether a new index is created or the index is to be added to.

    static final File LUCENE_INDEX_DIR = new File("lucene");
    static final StandardAnalyzer ANALYZER = new StandardAnalyzer(Version.LUCENE_30);
    static final String TEXT_FIELD = "text";
...
        FSDirectory fsDir 
            = new SimpleFSDirectory(LUCENE_INDEX_DIR,
                                    new NativeFSLockFactory());
        IndexWriter luceneIndexWriter 
            = new IndexWriter(fsDir,
                              ANALYZER,
                              IndexWriter.MaxFieldLength.LIMITED);
...

In this case we have chosen to set a static variable ANALYZER to an instance of Lucene's StandardAnalyzer, which applies some standard filters to the data and converts it to lowercase. The last argument limits the maximum size of a field, with a default well above the size of documents we're dealing with.

Indexing documents and training spelling correction

Adding the data is very simple, all we need do is iterate over the documents and send them off to the respective methods for language model training and index addition.

String[] filesToIndex = DATA.list();
for (int i = 0; i < filesToIndex.length; ++i) {
    System.out.println("     File="
                       + DATA + "/"
                       +filesToIndex[i]);
    File file = new File(DATA,filesToIndex[i]);
    String charSequence
        = Files.readFromFile(file);
    sc.handle(charSequence);
    Document luceneDoc = new Document();
    Field textField
        = new Field(TEXT_FIELD,charSequence,
                    Field.Store.YES,Field.Index.TOKENIZED);
    luceneDoc.add(textField);
    luceneIndexWriter.addDocument(luceneDoc);
}

System.out.println("Writing model to file="
                   + MODEL_FILE);
writeModel(sc,MODEL_FILE);

System.out.println("Writing lucene index to ="
                   + LUCENE_INDEX_DIR);
luceneIndexWriter.close();

Once we have the string for the file, training the language model is a one liner sc.handle(charSequence);. Getting the data into Lucene is a bit more work because it is structuring the data in a more complex way in support of all those nice search engine features. We are creating a new Document object, adding the data under a field name (which we will need when querying) and finally adding the document to the index.

Now we illustrate writing the spell checker and Lucene index to disk, where they can be read back in by an application. Lucene wraps closing and writing the index to disk nicely and there is a single method call luceneIndexWriter.close() that does all the work. Writing out the spell checker is a bit more work and the steps are in the demo writeModel method.

private static void writeModel(TrainSpellChecker sc,
                               File modelFile)
    throws IOException {

    FileOutputStream fileOut
        = new FileOutputStream(modelFile);
    BufferedOutputStream bufOut
        = new BufferedOutputStream(fileOut);
    ObjectOutputStream objOut
        = new ObjectOutputStream(bufOut);

    // write the spell checker to the file
    sc.compileTo(objOut);

    Streams.closeOutputStream(objOut);
    Streams.closeOutputStream(bufOut);
    Streams.closeOutputStream(fileOut);
}

This is a bunch of Java I/O with the compileTo method doing the LingPipe specific stuff. (Note: For more robust Java coding, the closes would be in a try/finally block.)

Reading in from Disk

Once the model and index are written to disk, we can read them in and correct queries. The readModel method (not shown) shows how to combine the relevant Java I/O to de-serialize the class. We have changed class though to a CompiledSpellChecker which is required for generating corrections; the main reason for this is that compiled language models are much faster in execution than their non-compiled versions.

// read compiled model from model file
System.out.println("Reading model from file="
                   + MODEL_FILE);
CompiledSpellChecker compiledSC
    = readModel(MODEL_FILE);
IndexSearcher luceneIndex
    = new IndexSearcher(fsDir);

System.out.print(TEST_INTRO);
testModelLoop(compiledSC,luceneIndex);

Loading the Lucene similarly changes the core class to being IndexSearcher and we are ready to start correcting queries. The last line of the above code calls the interactive command line loop.

Query correction loop

An actual deployment of a search engine requires much more than what the demo presents--in particular some sort of access to documents. But hopefully the below shows enough to show how to get spell checking up and running within whatever fuller featured deployment you have in mind.

The method below has an endless while loop reading from the command prompt. The flow is to get the query, run it against Lucene and report the counts. Then look for an alternate query and run that against the Lucene index and again report the results.

static void testModelLoop(SpellChecker sc,
                          IndexSearcher luceneIndex)
        throws IOException, ParseException {

    InputStreamReader isReader
        = new InputStreamReader(System.in);
    BufferedReader bufReader
        = new BufferedReader(isReader);

    QueryParser queryParser 
        = new QueryParser(Version.LUCENE_30,
                          TEXT_FIELD,
                          ANALYZER);

    while (true) {
        // collect query or end if null
        System.out.print(">");
        System.out.flush();
        String queryString = bufReader.readLine();
        if (queryString == null || queryString.length() == 0)
            break;

        Query query = queryParser.parse(queryString);
        TopDocs results = searcher.search(query,MAX_HITS);
        System.out.println("Found "
            + results.totalHits
            + " document(s) that matched query '"
            + query + "':");

        // compute alternative spelling
        String bestAlternative = sc.didYouMean(queryString);

        Query alternativeQuery
            = queryParser.parse(bestAlternative);
        TopDocs results2
            = luceneIndex.search(alternativeQuery,MAX_HITS);

        System.out.println("Found " + results2.totalHits
            + " document(s) matched best alternate '"
            + bestAlternative + "':");
    }
}

Looking more closely at the code, other than messing about with I/O and killing the loop when an empty line is entered, there are just a few key methods. First we use the Lucene QueryParser class to produce a Query. This takes parameters including the query string its self, then the field in in the Lucene index to be searched, here TEXT_FIELD again (we only have one, but you could imagine a much more complex setup with title, author etc.), and finally an analyzer which will be used to tokenize the query, here LUCENE_TOKENIZER again. It is generally wise to use the same analyzer for indexing and for parsing queries.

Next we collect the documents that match the query with Hits hits = luceneIndex.search(origQuery);. That's all we need from Lucene to report the number of documents --- the Hits class also provides methods for iterating over the found documents in order of matching score.

Finally we get to the whole reason for this tutorial with String bestAlternative = sc.didYouMean(query); which takes a look at the language model we built along with the text of the query at hand, and attempts to find a 'better match' with simple edits of the query against the language model. If it doesn't find any better scoring variations on the query, it returns the original back as the best choice. Then the demo runs the returned query and reports on the search results.

Tuning and Deployment

A deployed query spell checker will need to be quite a bit fancier than this demo but this should make it clear how to add the capabilty to whatever your vision is. Tuning these models can be a bit complex, and the size of the underlying data can be an issue as well. Ideally, you will have an evaluation set of queries, both good and bad randomly generated by actually users that you can use to select appropriate paramaters.

Sizing, Efficiency and Tuning

In this section, we'll explore the effects of various model components on sizing and run time. Sizing will vary among the training component, the on-disk compiled model and the in-memory representation used by the compiled spell checker. Training efficiency is simple, but run-time efficiency is a more complex issue.

Prefixes and Branching Factor

Many of the efficiency issues are determined by the set of prefixes of tokens in the training set. For instance, the word "Corpus" has prefixes "", "C", "Co", "Cor", "Corp", "Corpu", "Corpus". The word "Corn" has five prefixes, the first four of which are shared with "Corpus". If we train using strings "Corn" and "Corpus", the prefix "Cor" may be followed by either the character 'n' or the character 'p'. The number of characters that may follow a prefix is called the branching factor of the prefix.

Character Language Model

The main component of a spelling model is a character language model which models the training text. The size required for a character language model is proportional to the number of unique substrings found in the training set. These are not token sensitive, but are bounded in length by the length of n-gram used in the language models.

Training

During training, a dynamic character language model is trained with all of the data provided. This is represented as a reasonably compact in-memory trie structure. The only factor affecting speed and size is the length of the n-gram in the character language model. Longer n-grams require more space and take more time to train.

The language model itself is available through the method TrainSpellChecker.lm(), though pruning may be carried out through the method TrainSpellChecker.pruneLM(int). This pruning will affect the in-memory representation and the compiled model. It is an irreversible operation. Pruning periodically to a low count threshold such as 2 may allow much more data to be trained than would be possible in an unpruned model. Because language data follows a power-law (Zipf) distribution, most of the sequences stored have counts of 1. It is usually better to train a large model and prune it than train the largest possible unpruned model. Pruning may or may not hurt accuracy with respect to an unpruned model; at a very low pruning threshold, such as 2, there is relatively little impact on accuracy of probability estimates.

Compilation and Run Time

Additional memory is required for the compiler to perform a breadth-first walk over the langauge model's trie. Usually, this isn't a very large overhead, but it'll be larger in languages with large branching factors.

When a language model is compiled, the bytes output correspond to parallel arrays representing the trie that are the same size as in memory. The compiled model will usually be a bit more compact than the in-memory model. The size is determined by the number of internal nodes and number of terminal nodes in the trie.

Token Sensitivity

A spell checker only optionally contains a tokenizer. If a tokenizer is provided, the set of tokens found in the training corpus is stored and used to control search. This only slightly slows down training, but does require the space to store the tokens, which is usually small compared to the space required to store the language model.

Training

The other component stored during training is the set of tokens along with their counts. The counts are stored as an instance of util.ObjectToCounterMap, which is available through the method TrainSpellChecker.tokenCounter(). The set of tokens may be pruned back to only include ones with counts above a specified threshold through the method TrainSpellChecker.pruneTokens(int).

Compiled Model

The compiled model only stores the set of tokens as a simple array of strings.

Run Time

The set of tokens is converted into a character-level trie structure that is used to control search. In LingPipe's encoding, a Java object is stored for every prefix in the set of tokens along with the character used to get there. Thus the runtime size of a model may be larger than on disk if there is a non-empty token set.

Search for corrections is carried out from left-to-right. At each point, a partial hypothesis is expanded with every character which could follow the partial hypothesis and lead to one of the known tokens (the user input is always allowed as an output hypothesis). Thus the branching factor controls the amount of time required to search for spelling corrections.

The effect of pruning the set of tokens may help or hurt accuracy, but it usually has little effect with low pruning thresholds and an increasingly negative effect on recall (finding corrections for ill-formed input) as the number of known tokens decreases.

Unicode Normalization and Case Sensitivity

The spell checker is case sensitive. If you need a non-case-sensitive spell checker, then the tokenizer factory is the right place to implement case normalization. You will need to provide a compilable tokenizer factory that handles case normalization. The alternative is to make sure all the training data is case normalized before being sent to the trainer and also before being sent to the spell checker. If this happens in the tokenizer factory, there is no way for the two to get out of synch. If you are training without tokenization, then normalization must be handled externally.

In addition to case sensitivity, the spell checker is sensitive to the particular unicode form of characters. If there are a wide range of formats used for characters (e.g. Latin1 mixed with separate diacritics and characters or mixed with half-width CJK versions), a package like the International Components for Unicode should be used for normalization. ICU is also the best package available for generic case normalization, which they call "case folding".

Compiled Spell Checker Parameters

The remaining parameters only apply to the compiled spell checker, and thus are all defined as methods in the class spell.CompiledSpellChecker.

Edits Allowed

There are five methods which control which edit operations are allowed: setAllowDelete(boolean), setAllowInsert(boolean), setAllowMatch(boolean), setAllowSubstitute(boolean), and setAllowTranspose(boolean). For instance, calling setAllowDelete(true) allows characters to be deleted in edits.

Allowing all edits almost always leads to more accurate decoders. Turning off edit types is always faster. The most expensive edit by far are insert and substitute; delete, match and transpose are all bounded by the input string.

In some special applications, such as using spelling correctors for Chinese tokenization or case correction, only certain edits will be allowed in the model. For instance, only insert and match for Chinese tokenization and only substitute and match for case correction.

Number of Consecutive Inserts

Insertion is the most expensive operation, because it may apply consecutively without bound as long as it is extending a known token (or any time at all if there are no tokens). Thus there is a method setNumConsecutiveInsertionsAllowed(int) which controls how many insertions are allowed in a row. Set this to 1, its default value, if at all possible. This will usually help with both accuracy and has a huge impact on speed.

N-best Size

The n-best size restricts how many potential edits are considered at once. The smaller the n-best size, the faster the decoder. The problem is with a small n-best size, the hypothesis that will eventually be the best may fall off the search space; this is known as a search error. Typically, n-best size will be tuned to the lowest value that does not lead to performance degradation through search errors.

If the spell checker is used to provide n-best results, the n-best size should be set significantly higher than the number of n-best results desired. This will prevent any of the top n final hypotheses falling off the queue of hypotheses too early, essentially preventing n-best search errors.

Do-not-Edit Tokens

A set of tokens may be provided through the method setDoNotEditTokens(Set<String>). If these tokens are seen in the input, they will not be considered for edits. This can have a huge impact on run time. The larger this set is, the faster the decoder will run. As with other parameters, donfigure the set of do-not-edit tokens to be as large as possible. Usually this is done by taking the object to counter map from the compiled spell checker and saving tokens with high counts. Unfortunately, there is not a direct way to save this set of do-not-edit tokens as part of the model because it would've broken backward compatibility on model format.

Minimum Token Length to Correct

Short tokens wreak havoc with both the language models and the search space. That's because strings like "of the" are so popular in text relative to the number of times they are misspelled. The method setMinimumTokenLengthToCorrect(int) sets the minimum token length in the input that is considered for correction. We have found that for English, setting this to 3 is a workable value. Users rarely misspell words shorter than this.

First- and Second-Character Edit Costs

In practice, users make fewer errors on the first and second characters in words than in the middles or ends of words. This is not reflected in the weighted edit distance directly, so we allow it to be set programatically in the decoder. The methods setFirstCharEditCost(double) and setSecondCharEditCost(double) provide additional costs to edits of the first or second character in a user's input token. Note that high first- and second-character edit costs make it less likely that short strings will be edited. We have found values of -2.0 for the first character and -1.0 for the second character to be good initial values for English.

Known Token Edit Cost

Often the language models will want to edit tokens that are known, because the resulting string is much more likely. In practice, if users spell tokens correctly, it is much less likely they are errors. This corrects for a bias in queries toward rarer known words that is not in line with the underlying character language model. The method setKnownTokenEditCost(double) may be used to provide an additional penalty for editing a known token. We have found values of -2.0 or -3.0 to be good initial values for English.

Evaluating Spelling Correctors

In this section, we provide examples of how to evaluate spelling correction. Evaluation is based on a human-annotated reference file matching queries with their corrected versions.

Training a Spell Checker

We've included a spell-checker trainer based on the hockey data in the directory $LINGPIPE/demos/data/rec.sports.hockey/train that not only compiles the model to a file, but also saves the token counter to a file.

To run the trainer, use the ant target train:

> ant train
Visiting directory=..\..\data\rec.sport.hockey\train
Training on file=..\..\data\rec.sport.hockey\train\52550
Training on file=..\..\data\rec.sport.hockey\train\52551
Training on file=..\..\data\rec.sport.hockey\train\52552
...
Training on file=..\..\data\rec.sport.hockey\train\55022

Compiling spell checker to file=rec.sport.hockey.CompiledSpellChecker

Serializing token counts to file=rec.sport.hockey.ObjectToCounterMap

FINISHED NORMALLY.

The training code may be found in the file src/TrainSpell.java. Because it is so similar to the earlier code, we only comment on how the models are constructed and written to file.

    ...
    NGramProcessLM lm = new NGramProcessLM(nGram);
    FixedWeightEditDistance fixedEdit = new FixedWeightEditDistance(); // dummy
    TokenizerFactory tokenizerFactory
        = new IndoEuropeanTokenizerFactory();
    TrainSpellChecker trainer
        = new TrainSpellChecker(lm,fixedEdit,tokenizerFactory);
    ...
    // train
    ...
    AbstractExternalizable.compileTo(trainer,outputModelFile);
    AbstractExternalizable.serializeTo(trainer.tokenCounter(),outputTokensFile);

Creating a Gold-Standard Reference File

To evaluate a spell checker, we need some indication of what the expected results are. For this demo, we define a simple file-based format allowing defining the truth.

Typically, the queries being evaluated for spell checking will be derived from an actual query log. The format is set up to allow the original lines from the query log to be maintained.

There is an example for the hockey data in $LINGPIPE/demos/data/spell-eval-annotated.txt:

D:2178  Canad dian 2006-03-01 20:37:39
O:Canad dian
C:Canadian

D:2178  CanadianHockey  2006-03-01 20:38:46
O:CanadianHockey
C:Canadian Hockey

D:2178  hokey   2006-03-01 20:38:46
O:hokey
C:hockey

D:2334  wayne gretski   2006-03-05 18:41:50
O:wayne gretski
C:Wayne Gretzky

D:2334  Calgary 2006-03-05 18:41:50
O:Calgary
C:Calgary

D:2334  hky      2006-03-05 18:41:50
O:hky
C:hockey

D:2334  exuberation      2006-03-05 18:41:50
O:exuberation
C:exuberation

The format is based on line-triplets with prefixes indicating function. The opening line of a triplet begins with D: and simply repeats the raw query log information. This is often helpful for purposes external to this evaluation. We carry these lines along for display, but they have no other role.

The next two lines are marked with O: and C: for original and corrected query. Note that if the original query was correct, the correct form should be identical. Also note that spelling correction is case sensitive, so the data in the gold-standard reference file is also case sensitive. For the purposes of this demo, we assume the queries do not have line breaks and that the characters are encoded in Latin1 (ISO-8859-1).

Running the Evaluation

The evaluation of the trained hockey spelling corrector against the demo evaluation data file is done through another preconfigured ant target, evalSimple.

Parameter Dump

The evaluator first prints out the parameters of the spell checker being evaluated:

> ant evalSimple

SEARCH
  N-best size=32

TOKEN SENSITIVITY
  Token sensitive=true
  # Known Tokens=9673

EDITS ALLOWED
  Allow insert=true
  Allow delete=true
  Allow match=true
  Allow substitute=true
  Allow transpose=true
  Num consecutive insertions allowed=1
  Minimum Length Token Edit=3
  # of do-not-Edit Tokens=0

EDIT COSTS
  Edit Distance=Edit Distance Class=class com.aliasi.spell.FixedWeightEditDistanceFixedWeightEditDistance costs:  match weight=0.0  insert weight=-2.0  delete weight=-2.0  substitute weight=-2.0  transpose weight=-2.0
  Known Token Edit Cost=-2.0
  First Char Edit Cost=-2.0
  Second Char Edit Cost=-1.0

EDIT DISTANCE
Edit Distance Class=class com.aliasi.spell.FixedWeightEditDistanceFixedWeightEditDistance costs:  match weight=0.0  insert weight=-2.0  delete weight=-2.0  substitute weight=-2.0  transpose weight=-2.0

TOKENIZER FACTORY
com.aliasi.tokenizer.IndoEuropeanTokenizerFactory@46ae506e

...

Case by Case Evaluation

Next, the program goes case by case through the reference data, evaluating each case:

...

ec:
D:2178  hock ey 2006-03-01 20:37:39
O: Canad (0) dian (0)  : -43.46794876269996
C: Canadian (71)  : -13.696078942157328
S: Canadian (71)  : -13.696078942157328
------- user error, spell check correctly

ec:
D:2178  CanadianHockey  2006-03-01 20:38:46
O: CanadianHockey (0)  : -49.16564917261712
C: Canadian (71) Hockey (191)  : -20.98497186028908
S: Canadian (71) Hockey (191)  : -20.98497186028908
------- user error, spell check correctly

ec:
D:2178  hokey   2006-03-01 20:38:46
O: hokey (0)  : -23.082462430000305
C: hockey (434)  : -11.477494647726417
S: hockey (434)  : -11.477494647726417
------- user error, spell check correctly

ee:
D:2334  wayne gretski   2006-03-05 18:41:50
O: wayne (4) gretski (0)  : -67.6034702733159
C: Wayne (39) Gretzky (68)  : -21.767181660390634
S: wayne (4) Gretzky (68)  : -24.96715794910415
------- user error, spell check wrong suggestion

cc:
D:2334  Calgary 2006-03-05 18:41:50
O: Calgary (128)  : -13.610888023802545
C: Calgary (128)  : -13.610888023802545
S: Calgary (128)  : -13.610888023802545
------- user correct, spell check correct

e_:
D:2334  zz      2006-03-05 18:41:50
O: hky (0)  : -17.589905977249146
C: hockey (434)  : -11.477494647726417
S: hky (0)  : -17.589905977249146
------- user error, spell check no suggestion

ee:
D:2334  zz      2006-03-05 18:41:50
O: exuberation (0)  : -42.65078176371753
C: exuberation (0)  : -42.65078176371753
S: edu (956)  : -10.196045160293579
------- user correct, spell check incorrect

...

The output looks very much like the input with further annotations. For each annotation entry the EvaluateSpell class runs the original query from the O: line in the reference file against the didYouMean() method of the spell checker. The C: line is also from the reference file, and represents the intended correction. The S: line contains the string returned by the didYouMean() method. The system is correct when the system answer matches the correct result.

The token counts in the training data for each token in the training data is displayed in parentheses. For instance, Wayne appeared 39 times in the data and wayne only 4 times. The tokens that appear zero times are special in that they are considered unknown tokens. Unknown tokens have less of a penalty to edit, but can not be the output of editing when spell checkers are configured to be token sensitive.

In addition, each of the O:, C: and S: lines has the log2Estmate() appended from the underlying language model in the CompiledSpellChecker. This score is very important for tuning as will become evident.

There are 5 possible outcomes in the scoring which is reflected in a compact form in the first line of the ouput:

ClassError?Description
ee:user errordidYouMean() wrong
ec:user errordidYouMean() right
e_:user errordidYouMean() no suggestion
ce:user correctdidYouMean() wrong
cc:user correctdidYouMean() no suggestion

Evaluation Summary

The final output is a summary of the various counts:

...
user error: sys correct 3, sys incorrect 1, sys no sugg 1
user correct: sys no suggestion 1, sys incorrect 1
Out of vocab query count: 1
Out of vocab query wrong: 1
Score 0.30000000000000004

Scoring Metric

The final report statistics are self-explanatory other than for the score, which is computed as follows:

   double score
       = userErrorSystemNoSuggestion * -.2.0
       + userErrorSystemWrongCorrection * -1.0
       + userErrorSystemCorrect * 1.0
       + userCorrectSystemWrongCorrection * -1.5;

The idea is to compute an overall utility for the spell checker. The scoring parameters will depend on the business use case. The above scoring system represents a reward of 1 in the case of the system properly correcting a user error, a penalty of -1 for suggesting the wrong correction, a penalty of -2 for not making any suggestion at all when there is a user error, and a penalty of -1.5 when the the user was correct and the system proposed an erroneous correction. The motivation for this particular scoring is that the worst kind of error is one where the user query is an error, but the system makes no suggestion, because this is the case where the user will never find the right answer. If we provide a correction for a correct user input, at least they see the correct results for search initially.

The Tuning Cycle

After creating labeled evaluation data, various system paramters may be tested for their performance. Typically, we start with reasonable defaults (which will be much larger penalties for edits when we get to realistic size data), and then tune the parameters one at a time in an attempt to improve the system.

It's very easy to tune posterior weights, like edit distances. It's harder to tune the underlying language model, as it must be recompiled. It's possible to serialize a large language model and then prune it to evaluate, though we do not provide code to do that as part of this tutorial.

It's important to realize that when tuning on a reference collection, it has the role of development data. It's tempting to highly optimize scores on development data, but it's dangerous due to the likelihood of overfitting the data. A statistically sound evaluation would have you set aside two labeled reference data sets. One would be used for development, that is, to tune the parameters. The second would then be used to report final test results. This is a much more sound estimate of future performance than testing on a single development set.

Automatic Phrase Completion

LingPipe provides an alternative to spelling checking that finds the most likely completion among a set of fixed phrases for the text entered so far by a user.

Web Form Auto-Completion

Auto-completion is by now familiar from the web, for instance on Google's Home Page. For instance, if I type <anaz> as a query, Google pops up the following suggestions:

Google auto completion example

Note that the application is performing spelling checking at the same time as completion. For instance, the top suggestion is "amazon", even though the query so far is "anaz". This is not surprising given the number of results reported for the phrases starting with "anaz" is so small.

Next note that it's not doing word suggestion, but phrase suggestion. Some of the results like "anazao salon" are two words.

One important difference between the auto-completion and spell checking is that auto-completion typically operates over a fixed set of phrases. What this means is that if I type a query <I want to find anaz>, there are no suggested completions.

Finding Phrases with Counts

The source of phrases for a web search is typically high frequency queries from the query logs. In Google's case, they don't return suggestions in order of number of results; if they did, anazapta would occur above anazao, since they both match exactly on the prefix.

For our demos, we'll assume that the user supplies a list of phrases followed by counts. For instance, we supply a list of U.S. states in $LINGPIPE/demos/data/us-state-populations-06.txt. This initial segment of the population count file is:

Alabama 4599000
Alaska 670000
Arizona 6166000
Arkansas 2811000
California 36458000
Colorado 4753000
Connecticut 3505000
Delaware 853000
District of Columbia 58200
Florida 18090000
...

The spaces are just spaces. We've just used state populations (estimated in the middle of 2006), which may or may not be a good proxy for search popularity. It's a reasonable proxy if you are asking people randomly selected from the U.S. population to list their home state. But if you're selecting a travel destination, Hawaii and the District of Columbia are much more popular than their populations would suggest. In general, the counts are going to be used to balance which results are most popular, and should reflect the results likely to be desired by users.

Running the Command-Line Demo

We've included a command-line demo, as well as a GUI demo we describe in the next section. To run the command-line demo, run the ant target:

> ant complete-cmd

|N|
 -3.95 New York
 -5.08 North Carolina
 -5.10 New Jersey
 -6.90 Nevada
 -7.26 New Mexico

|New|
 -3.95 New York
 -5.10 New Jersey
 -7.26 New Mexico
 -7.83 New Hampshire
-16.90 Nevada

|New Y|
 -3.95 New York
-15.10 New Jersey
-17.26 New Mexico
-17.83 New Hampshire

|New Yor|
 -3.95 New York

|Mew |
-13.95 New York
-15.10 New Jersey
-17.26 New Mexico
-17.83 New Hampshire

|U|
 -6.87 Utah
-13.04 California
-13.67 Texas
-13.95 New York
-14.05 Florida

|Uta|
 -6.87 Utah
-23.04 California

|ZebraFish|

The results report the queries (between vertical bars so the spaces are clear) followed by up to five auto-complete suggestions. The scores consist of log (base 2) numbers scaled as probabilities like for spelling (more on that below).

Note that we get auto-complete suggestions from non-matching prefixes, as in the query <New> getting the auto-complete suggestion "Nevada".

Running the GUI Demo

If you'd like to play around with queries, you can either add new command-line arguments to the command in ant, or you can run the GUI demo, which displays results in real-time as you type:

> ant complete-gui

which pops up an interface that looks like this:

LingPipe auto-completion GUI demo

It's configured to only suggest three completions per input.

The Search Problem

As explained in the javadoc for spell.AutoCompleter, the auto-completer finds the best scoring phrases for a given prefix. The score of a phrase versus a prefix is the sum of the score of the phrase and the maximum score of the prefix against any prefix of the phrase. The score for a phrase is just its log (base 2) maximum likelihood probability estimate; that is, the log of its count divided by the sum of all counts. The score for matching "Mew Y" against "New York" requires a substitution of the initial 'M' for 'N'. Scores are computed by weighted edit distance, just as for spelling.

Code Walkthrough

Configuring auto-completion is very much like configuring spelling, only with a fixed list of phrases and counts supplied to the constructor along with search parameters and a weighted edit distance.

We'll walk through the command-line demo, all of the code for which is in the main() method in the file src/AutoCompleteCommand.java.

The initial part of the code just reads the phrase counts from the file into a map from String to Float objects:

public static void main(String[] args) throws IOException {
    File wordsFile = new File(args[0]);
    String[] lines = FileLineReader.readLineArray(wordsFile,"ISO-8859-1");
    Map<String,Float> counter = new HashMap<String,Float>(200000);
    for (String line : lines) {
        int i = line.lastIndexOf(' ');
        if (i < 0) continue;
        String phrase = line.substring(0,i);
        String countString = line.substring(i+1);
        Float count = Float.valueOf(countString);
        counter.put(phrase,count);
    }

Note the use of the FileLineReader.readLinesArray() method, which ireads the lines from a file into an array using a specified character encoding. The code in the loop just parses the lines apart based on the last space and adds them to the map.

The next step is to configure the edit distance. This will measure how close a prefix of a target phrase is to the query prefix. This class uses a fixed-weight edit distance, but any edit distance may be used in general:

    double matchWeight = 0.0;
    double insertWeight = -10.0;
    double substituteWeight = -10.0;
    double deleteWeight = -10.0;
    double transposeWeight = Double.NEGATIVE_INFINITY;
    FixedWeightEditDistance editDistance
        = new FixedWeightEditDistance(matchWeight,
                                      deleteWeight,
                                      insertWeight,
                                      substituteWeight,
                                      transposeWeight);

Finally, we need to configure the auto-completer itself:

    int maxResults = 5;
    int maxQueueSize = 10000;
    double minScore = -25.0;
    AutoCompleter completer
        = new AutoCompleter(counter, editDistance,
                            maxResults, maxQueueSize, minScore);

In addition to the phrase counter (in general, a map from strings to numbers) and edit distance, the configuration specifies the maximum number of results to return, the maximum search queue size, and the minimum score of returned results (phrases with match scores below the minimum will be discarded rather than displayed).

Finally, we walk through the arguments and compute the auto-completions.

    for (int i = 1; i < args.length; ++i) {
        SortedSet<ScoredObject<String>> completions
            = completer.complete(args[i]);

which are returned as an instance of java.util.SortedSet containing ScoredObject<String> objects, where a scored string consists of a string and a score. We then just walk over the scored strings and print them out:

        System.out.println("\n|" + args[i] + "|");
        for (ScoredObject<String> so : completions)
            System.out.printf("%6.2f %s\n", so.score(), so.getObject());
    }
}

Tuning Auto-Completion

There are really only three parameters for tuning auto-completion, the edit distance, and the search parameters. The edit distance is tuned in exactly the same way as it is for spelling.

The maximum number of results to return is more of an application decision than a tuning decision. Having said that, smaller result sets are faster to compute.

The maximum queue size indicates how big the set of hypotheses can get inside the auto-completer before being pruned. To tune, this should be set as low as possible without causing search errors.

References

For further reading, check out these summaries:

Here are a few other alternatives, both explained by Tom White.

For the case restoration problem, a good reference evalaution is the following paper from IBM:

For accent restoration, David Yarowsky describes a classifier-based approach for Spanish and French (ironically, wrongly cased in the IBM bibliography due to human error with BibTeX).