com.aliasi.spell
Class WeightedEditDistance

java.lang.Object
  extended by com.aliasi.spell.WeightedEditDistance
All Implemented Interfaces:
Distance<CharSequence>, Proximity<CharSequence>
Direct Known Subclasses:
FixedWeightEditDistance

public abstract class WeightedEditDistance
extends Object
implements Distance<CharSequence>, Proximity<CharSequence>

The WeightedEditDistance class implements both the proximity and distance interfaces based on the negative proximity weights assigned to independent atomic edit operations.

Weights Scaled as Log Probability

Weights on edit operations are scaled as log probabilities. Practically speaking, this means that the larger the weight, the more likely the edit operation; keep in mind that -1 is larger than -3, representing 2-1 = 1/2 and 2-3 = 1/8 respectively on a linear probability scale.

Proximity and Edit Sequences

The log probability of a sequence of independent edits is the sum of the log probabilities of the individual edits. Proximity between strings s1 and s2 is defined as the maximum sum of edit weights over sequences of edits that convert s1 to s2.

Like the individual edit weights, proximity is scaled as a log probability of the complete edit. The larger the proximity, the closer the strings; again, keep in mind that -10 is larger than -20, representing roughly 1/1000 and 1/1,000,000 on the linear probability scale.

Distance is Negative Proximity

Distance is just negative proximity. This scales edit distances in the usual way, with distance of 3 between strings indicating they are further away from each other than strings at distance 1.25.

Relation to Simple Edit Distance

This class generalizes the behavior of the class spell.EditDistance without extending it in the inheritance sense. Weighted edit distance agrees with edit distance (up to arithmetic precision) as a distance assuming the following weights: match weight is 0, substitute, insert and delete weights are -1, and the transposition weight is -1 if transpositions are allowed in the edit distance and Double.NEGATIVE_INFINITY otherwise.

Symmetry

If the substitution and transposition weights are symmetric and the insert and delete costs of a character are equal, then weighted edit distance will be symmetric.

Metricity

If the match weight of all characters is zero, then the distance between a character sequence and itself will be zero.

If transpose weights are negative infinity so that transposition is not allowed, and if the assignment of substitution weights forms a metric (see Distance for a definition), and if delete and insert weights are non-negative and equal for all characters, and if match weights are all zero, then weighted edit distance will form a proper metric. Other values may also form metrics, such as a weight of -1 for all edits other than transpose.

Probabilistic Channel

A probabilistic relational model between strings is defined if the weights are properly scaled as log probabilities. Because probabilities are between 0 and 1, log probabilities will be between negative infinity and zero. Proximity between two strings in and out is defined by:

 proximity(in,out)
 = Maxedit(in)=out log2 P(edit)
 
where the cost of the edit is defined to be:
log2 P(edit)
= log2 P(edit0,...,editn-1)
~ log2 P(edit0) + ... + log P(editn-1)
The last line is an approximation assuming edits are independent.

In order to create a proper probabilistic channel, exponentiated edit weights must sum to 1.0. This is not technically possible with a local model if transposition is allowed, because of boundary conditions and independence assumptions. It is possible to define a proper channel if transposition is off, and if all edit weights for a position (including all sequences of arbitrarily long insertions) sum to 1.0. In particular, if any edits at all are allowed (have finite weights), then there must be a non-zero weight assigned to matching, otherwise exponentiated edit weight sum would exceed 1.0. It is always possible to add an offset to normalize the values to a probability model (the offset will be negative if the sum exceeds 1.0 and positive if it falls below 1.0 and zero otherwise).

A fully probabilistic model would have to take the sum over all edits rather than the maximum. This class makes the so-called Viterbi approximation, assuming the full probability is close to that of the best probability, or at least proportional to it.

Since:
LingPipe2.0
Version:
3.0
Author:
Bob Carpenter

Constructor Summary
WeightedEditDistance()
          Construct a weighted edit distance.
 
Method Summary
abstract  double deleteWeight(char cDeleted)
          Returns the weight of deleting the specified character.
 double distance(CharSequence csIn, CharSequence csOut)
          Returns the weighted edit distance between the specified character sequences.
abstract  double insertWeight(char cInserted)
          Returns the weight of inserting the specified character.
abstract  double matchWeight(char cMatched)
          Returns the weight of matching the specified character.
 double proximity(CharSequence csIn, CharSequence csOut)
          Returns the weighted proximity between the specified character sequences.
abstract  double substituteWeight(char cDeleted, char cInserted)
          Returns the weight of substituting the inserted character for the deleted character.
abstract  double transposeWeight(char cFirst, char cSecond)
          Returns the weight of transposing the specified characters.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WeightedEditDistance

public WeightedEditDistance()
Construct a weighted edit distance.

Method Detail

distance

public double distance(CharSequence csIn,
                       CharSequence csOut)
Returns the weighted edit distance between the specified character sequences. If the edit distances are interpreted as entropies, this distance may be interpreted as the entropy of the best edit path converting the input character sequence to the output sequence. The first argument is taken to be the input and the second argument the output.

This method is thread safe and may be accessed concurrently if the abstract weighting methods are thread safe.

Specified by:
distance in interface Distance<CharSequence>
Parameters:
csIn - First character sequence.
csOut - Second character sequence.
Returns:
The edit distance between the sequences.

proximity

public double proximity(CharSequence csIn,
                        CharSequence csOut)
Returns the weighted proximity between the specified character sequences. The first argument is taken to be the input and the second argument the output.

This method is thread safe and may be accessed concurrently if the abstract weighting methods are thread safe.

Specified by:
proximity in interface Proximity<CharSequence>
Parameters:
csIn - First character sequence.
csOut - Second character sequence.
Returns:
The edit distance between the sequences.

matchWeight

public abstract double matchWeight(char cMatched)
Returns the weight of matching the specified character. For most weighted edit distances, the match weight is zero so that identical strings are total distance zero apart.

All weights should be less than or equal to zero, with heavier weights being larger absolute valued negatives. Basically, the weights may be treated as unscaled log probabilities. Thus valid values will range between 0.0 (probablity 1) and Double.NEGATIVE_INFINITY (probability 0). See the class documentation above for more information.

Parameters:
cMatched - Character matched.
Returns:
Weight of matching character.

deleteWeight

public abstract double deleteWeight(char cDeleted)
Returns the weight of deleting the specified character.

All weights should be less than or equal to zero, with heavier weights being larger absolute valued negatives. Basically, the weights may be treated as unscaled log probabilities. Thus valid values will range between 0.0 (probablity 1) and Double.NEGATIVE_INFINITY (probability 0). See the class documentation above for more information.

Parameters:
cDeleted - Character deleted.
Returns:
Weight of deleting character.

insertWeight

public abstract double insertWeight(char cInserted)
Returns the weight of inserting the specified character.

All weights should be less than or equal to zero, with heavier weights being larger absolute valued negatives. Basically, the weights may be treated as unscaled log probabilities. Thus valid values will range between 0.0 (probablity 1) and Double.NEGATIVE_INFINITY (probability 0). See the class documentation above for more information.

Parameters:
cInserted - Character inserted.
Returns:
Weight of inserting character.

substituteWeight

public abstract double substituteWeight(char cDeleted,
                                        char cInserted)
Returns the weight of substituting the inserted character for the deleted character.

All weights should be less than or equal to zero, with heavier weights being larger absolute valued negatives. Basically, the weights may be treated as unscaled log probabilities. Thus valid values will range between 0.0 (probablity 1) and Double.NEGATIVE_INFINITY (probability 0). See the class documentation above for more information.

Parameters:
cDeleted - Deleted character.
cInserted - Inserted character.
Returns:
The weight of substituting the inserted character for the deleted character.

transposeWeight

public abstract double transposeWeight(char cFirst,
                                       char cSecond)
Returns the weight of transposing the specified characters. Note that the order of arguments follows that of the input.

All weights should be less than or equal to zero, with heavier weights being larger absolute valued negatives. Basically, the weights may be treated as unscaled log probabilities. Thus valid values will range between 0.0 (probablity 1) and Double.NEGATIVE_INFINITY (probability 0). See the class documentation above for more information.

Parameters:
cFirst - First character in input.
cSecond - Second character in input.
Returns:
The weight of transposing the specified characters.