com.aliasi.spell
Class EditDistance

java.lang.Object
  extended by com.aliasi.spell.EditDistance
All Implemented Interfaces:
Distance<CharSequence>, Proximity<CharSequence>

public class EditDistance
extends Object
implements Distance<CharSequence>, Proximity<CharSequence>

The EditDistance class implements the standard notion of edit distance, with or without transposition. The distance without transposition is known as Levenshtein distance, and with transposition as Damerau-Levenstein distance (see below about distance-like metric properties).

The edit distance between two strings is defined to be the minimum number of non-matching substring edits that is required to turn one string into the other. The available edits and their corresponding input and output substrings are:

Operation Input Output Notation
Match"a""a"m(a)
Insert"""a"i(a)
Delete"a"""d(a)
Substitute"a""b"s(b,a)
Transpose"ab""ba"t(ab)
Examples of minimal edit sequences are as follows, with distances as indicated
Input Output Edit Sequence (w. Transp) Dist (w. Transp)
ititm(i) m(t)0
gagegaugem(g) m(a) i(u) m(g) m(e)1
thhethem(t) d(h) m(h) m(e)1
tensertensorm(t) m(e) m(n) m(s) s(o,e) m(r)1
htethed(h) m(t) i(h) m(e) [t(ht) m(e)]2 [1]
htnethend(h) m(t) s(h,n) m(e) i(n) [t(ht) t(ne)]3 [2]
pwn4gownages(o,p) m(w) m(n) s(a,4) m(g) i(e)3
Note that, in general, there will be more than one way to edit a string into another string. For instance, a delete and insert may replace a transposition, so that "hte" becomes "the" through edits "d(h) m(t) i(h) m(e)" or "i(t) m(h) d(t) m(e)", as well as many many more, such as "d(h) d(t) d(e) i(t) i(h) i(e)".

Distance as Metric

Edit distance without transposition defines a proper metric. Recall that a distance measure d(x,y) forms a metric if for all x, y, z, we have (1) d(x,y) >= 0, (2) d(x,y) = d(y,x), (3) d(x,x) = 0, and (4) d(x,y) + d(y,z) >= d(x,z). All of these properties are easy to verify. But with transposition, we have strings such as AB, BA and ACB. With transposition, d(AB,BA)=1, d(BA,BCA)=1, but d(AB,BCA)= 3 >= d(AB,BA) + d(BA,BCA) = 1 + 1 = 2.

Implementation Note: This class implements edit distance using dynamic programming in time O(n*m) where n and m are the length of the sequences being compared. Using a sliding window of three lattice slices rather than allocating the entire lattice at once, the space required is that for three arrays of integers as long as the shorter of the two character sequences being compared. For details, see section 12.1.1 of:

Since:
LingPipe2.0
Version:
3.0
Author:
Bob Carpenter

Field Summary
static Distance<CharSequence> NON_TRANSPOSING
          Edit distance disallowing transposition.
static Distance<CharSequence> TRANSPOSING
          Edit distance allowing transposition.
 
Constructor Summary
EditDistance(boolean allowTransposition)
          Construct an edit distance with or without transposition based on the specified flag.
 
Method Summary
 double distance(CharSequence cSeq1, CharSequence cSeq2)
          Returns the edit distance between the specified character sequences.
static int editDistance(CharSequence cSeq1, CharSequence cSeq2, boolean allowTransposition)
          Returns the edit distance between the character sequences with or without transpositions as specified.
 double proximity(CharSequence cSeq1, CharSequence cSeq2)
          Returns the proximity between the character sequences.
 String toString()
          Returns a string representation of this edit distance.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

TRANSPOSING

public static final Distance<CharSequence> TRANSPOSING
Edit distance allowing transposition. The implementation is thread safe and may be accessed concurrently.


NON_TRANSPOSING

public static final Distance<CharSequence> NON_TRANSPOSING
Edit distance disallowing transposition. The implementation is thread safe and may be accessed concurrently.

Constructor Detail

EditDistance

public EditDistance(boolean allowTransposition)
Construct an edit distance with or without transposition based on the specified flag.

Parameters:
allowTransposition - Set to true to allow transposition edits in the constructed distance.
Method Detail

distance

public double distance(CharSequence cSeq1,
                       CharSequence cSeq2)
Returns the edit distance between the specified character sequences. Whether transposition is allowed or not is set at construction time. This method may be accessed concurrently without synchronization.

Specified by:
distance in interface Distance<CharSequence>
Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
Returns:
Edit distance between the character sequences.

proximity

public double proximity(CharSequence cSeq1,
                        CharSequence cSeq2)
Returns the proximity between the character sequences. Proximity is defined as the negation of the distance:
 proximity(cs1,cs2) = -distance(cs1,cs2)
 
and thus proximities will all be negative or zero.

Specified by:
proximity in interface Proximity<CharSequence>
Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
Returns:
Proximity between the character sequences.

toString

public String toString()
Returns a string representation of this edit distance.

Overrides:
toString in class Object
Returns:
A string representation of this edit distance.

editDistance

public static int editDistance(CharSequence cSeq1,
                               CharSequence cSeq2,
                               boolean allowTransposition)
Returns the edit distance between the character sequences with or without transpositions as specified. This distance is symmetric. This method is thread safe and may be accessed concurrently.

Parameters:
cSeq1 - First character sequence.
cSeq2 - Second character sequence.
allowTransposition - Set to true to allow transposition edits.
Returns:
Edit distance between the character sequences.