com.aliasi.spell
Class AutoCompleter

java.lang.Object
  extended by com.aliasi.spell.AutoCompleter
All Implemented Interfaces:
Serializable

public class AutoCompleter
extends Object
implements Serializable

An AutoCompleter maintains a dictionary of phrases with counts and provides suggested completions based on prefix matching by weighted edit distance and phrase likelihood.

The maximum number of results cannot be changed dynamically, because it is used during construction to set up all of the compiled trie structures used for decoding. To change the maximum number of results, simply construct a new auto completer using the same phrase counts as edit distance as this completer; these may be retrieved using the methods phraseCountMap() and editDistance().

Scoring

For a given input, the complete(String) method returns a sorted set of scored objects, containing the most likely phrase completions and their score, up to the maximum number of results specified during construction.

Scores for the phrases are defined by the log (base 2) of their probability estimates:

 score(phrase) = log2 p'(phrase)
where probabilities are estimated using maximum likelihood:
 p'(phrase) = count(phrase) / Σphrase' count(phrase')

Additive smoothing may be easily carried out on the inputs, so it is not carried out by this class.

The score for a prefix matching a phrase is given by:

 score(prefix,phrase)
   = MAXphrase.startsWith(prefix') editDistance.distance(prefix,prefix') + log2 p'(phrase)
In words, the score for a prefix matching a phrase is the sum of log probability of the phrase plus the edit distance between the prefix and the best matching prefix of the phrase. The edit distances should thus be scaled as log probabilities in order to combine with the phrase probabilities properly. See the class documentation for TrainSpellChecker for general advice on combining the tuning of edit distance with that of phrase probabilities.

Thread Safety

After safe publication, an AutoCompleter is completely thread safe. Setting the maximum search queue size is safe because integer writes are atomic, but it is not declared volate, and hence may not be visible to other threads without synchronization.

Serialization

An AutoCompleter may be serialized if and only if its weighted edit distance is serializable. If so, the result of serializing and reconstituting an auto-completer will produce an auto-completer with the same behavior as the one serialized, modulo the edit distance serialization, which is under control of the edit distance implementation.

Since:
Lingpipe3.8
Version:
3.8
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
AutoCompleter(Map<String,? extends Number> phraseCounts, WeightedEditDistance editDistance, int maxResultsPerPrefix, int maxSearchQueueSize, double minScore)
          Construct an automatic completer from the specified phrases, phrase counts, edit distance, and search parameters.
 
Method Summary
 SortedSet<ScoredObject<String>> complete(String in)
          Returns a set of scored phrases sorted into decreasing order of score.
 WeightedEditDistance editDistance()
          Returns the weighted edit distance for this auto-completer.
 int maxResultsPerPrefix()
          Returns the maximum number of results returned by this auto completer for each input prefix.
 int maxSearchQueueSize()
          Returns the maximum number of elements on the search queue.
 Map<String,Float> phraseCountMap()
          Returns the phrase counter for this auto completer.
 void setMaxSearchQueueSize(int size)
          Sets the maximum search queue size to the specified value.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AutoCompleter

public AutoCompleter(Map<String,? extends Number> phraseCounts,
                     WeightedEditDistance editDistance,
                     int maxResultsPerPrefix,
                     int maxSearchQueueSize,
                     double minScore)
Construct an automatic completer from the specified phrases, phrase counts, edit distance, and search parameters.

Parameters:
phraseCounts - Map from phrases to counts.
editDistance - Distance used to compare mismatched suggestions.
maxResultsPerPrefix - The maximum number of results that can be returned.
maxSearchQueueSize - The beam size for searching for matches.
minScore - Minimum score for outcome to be retained in results.
Throws:
IllegalArgumentException - If any of the counts is not finite or negative, or if the max results or max queue sizes are not positive, or if the minimum score is not finite and negative.
Method Detail

maxResultsPerPrefix

public int maxResultsPerPrefix()
Returns the maximum number of results returned by this auto completer for each input prefix.

Returns:
The maximum number of results returned.

editDistance

public WeightedEditDistance editDistance()
Returns the weighted edit distance for this auto-completer.


phraseCountMap

public Map<String,Float> phraseCountMap()
Returns the phrase counter for this auto completer. The result will not be identical to the phrase counter used to construct this map because this method reconstitutes the map with Float values calculated from compiled probability estimates. Changes to the returned count map will not affect this class.

Returns:
The phrase counter for this auto completer.

maxSearchQueueSize

public int maxSearchQueueSize()
Returns the maximum number of elements on the search queue. This number is the size of the beam. The value may be set using setMaxSearchQueueSize(int).

Returns:
The maximum search queue size.

setMaxSearchQueueSize

public void setMaxSearchQueueSize(int size)
Sets the maximum search queue size to the specified value. Larger values may produce more accurate search, but may take longer to perform completions.

This operation is thread safe because integer sets are atomic. But changes may not be visible to other threads if not synchronized.

Parameters:
size - The new search queue size.
Throws:
IllegalArgumentException - If the size is zero or negative.

complete

public SortedSet<ScoredObject<String>> complete(String in)
Returns a set of scored phrases sorted into decreasing order of score. The scores are determined as described in the class documentation above.

To print out all the matches in descending order of scores, use:

for (ScoredObject<String> so : complete(String))
      println("phrase=" + so.getObject() + " score=" + so.score());
 

Parameters:
in - The string to complete.
Returns:
The best scoring completions of the string.