com.aliasi.spell
Class JaroWinklerDistance

java.lang.Object
  extended by com.aliasi.spell.JaroWinklerDistance
All Implemented Interfaces:
Distance<CharSequence>, Proximity<CharSequence>

public class JaroWinklerDistance
extends Object
implements Distance<CharSequence>, Proximity<CharSequence>

The JaroWinklerDistance class implements the original Jaro string comparison as well as Winkler's modifications. As a distance measure, Jaro-Winkler returns values between 0 (exact string match) and 1 (no matching characters). Note that this is reversed from the original definitions of Jaro and Winkler in order to produce a distance-like ordering. The original Jaro-Winkler string comparator returned 1 for a perfect match and 0 for complete mismatch; our method returns one minus the Jaro-Winkler measure.

The Jaro-Winkler distance measure was developed for name comparison in the U.S. Census. It is designed to compae surnames to surnames and given names to given names, not whole names to whole names. There is no character-specific information in this implementation, but assumptions are made about typical lengths and the significance of initial matches that may not apply to all languages.

The easiest way to understand the Jaro measure and the Winkler variants is procedurally. The Jaro measure involves two steps, first to compute the number of "matches" and second to compute the number of "transpositions". The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both strings.

Formal Definition of Jaro-Winkler Distance

Suppose we are comparing character sequences cs1 and cs2. The Jaro-Winkler distance is defined by the following steps. After the definitions, we consider some examples.

Step 1: Matches: The match phase is a greedy alignment step of characters in one string against the characters in another string. The maximum distance (measured by array index) at which characters may be matched is defined by:

   matchRange = max(cs1.length(), cs2.length()) / 2 - 1

The match phase is a greedy alignment that proceeds character by character through the first string, though the distance metric is symmetric (that, is reversing the order of arguments does not affect the result). For each character encountered in the first string, it is matched to the first unaligned character in the second string that is an exact character match. If there is no such character within the match range window, the character is left unaligned.

Step 2: Transpositions: After matching, the subsequence of characters actually matched in both strings is extracted. These subsequences will be the same length. The number of characters in one string that do not line up (by index in the matched subsequence) with identical characters in the other string is the number of "half transpositions". The total number of transpoisitons is the number of half transpositions divided by two, rounding down.

The Jaro distance is then defined in terms of the number of matching characters matches and the number of transpositions, transposes:

   jaroProximity(cs1,cs2)
     = ( matches(cs1,cs2) / cs1.length()
         + matches(cs1,cs2) / cs2.length()
         + (matches(cs1,cs2) - transposes(cs1,cs2)) / matches(cs1,cs2) ) / 3

   jaroDistance(cs1,cs2) = 1 - jaroProximity(cs1,cs2)

In words, the measure is the average of three values; (a) the percentage of the first string matched, (b) the percentage of the second string matched, and (c) the percentage of matches that were not transposed.

Step 3: Winkler Modification The Winkler modification to the Jaro comparison, resulting in the Jaro-Winkler comparison, boosts scores for strings that match character for character initially. Let boostThreshold be the minimum score for a string that gets boosted. This value was set to 0.7 in Winkler's papers (see references below). If the Jaro score is below the boost threshold, the Jaro score is returned unadjusted. The second parameter for the Winkler modification is the size of the initial prefix considered, prefixSize. The prefix size was set to 4 in Winkler's papers. Next, let prefixMatch(cs1,cs2,prefixSize) be the number of characters in the prefix of cs1 and cs2 that exactly match (by original index), up to a maximum of prefixSize. The modified distance is then defined to be:

   jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)
     = jaroMeasure(cs1,cs2) <= boostThreshold
     ? jaroMeasure(cs1,cs2)
     : jaroMeasure(cs1,cs2)
       + 0.1 * prefixMatch(cs1,cs2,prefixSize) * (1.0 - jaroDistance(cs1,cs2))

   jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize)
     = 1 - jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)

Examples: We will present the alignment steps in the form of tables, with offsets in the second string below the first string positions that match. For a simple example, consider comparing the given (nick)name AL to itself. Both strings are of length 2. Thus the maximum match distance is max(2,2)/2 - 1 = 0, meaning all matches must be exact. The matches are illustrated in the following table:

cs1AL
matches01
cs2AL

The notation in the matches row is meant to indicate that the A at index 0 in cs1 is matched to the A at index 0 in cs2. Similarlty for the L at index 1 in cs1, which matches the L at index 1 in cs2. This results in matches(AL,AL) = 2. There are no transpositions, so the Jaro distance is just:

   jaroProximity(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) = 1.0

Applying the Winkler modification yields the same result:

   jaroWinklerProximity(AL,AL) = 1 +  0.1 * 2 * (1.0 - 1) = 1.0

Next consider a more complex case, matching MARTHA and MARHTA. Here the match distance is max(6,6)/2 - 1 = 6/2 - 1 = 2, allowing matching characters to be up to one character away. This yields the following alignment.

cs1 MARTHA
matches 012435
cs2 MARHTA

Note that the T at index 3 in the first string aligns with the T at index 4 in the second string, whereas the H at index 4 in the first string alings with the H at index 3 in the second string. The strings that do not directly align are rendered in bold. This is an instance of a transposition. The number of half transpositions is determined by comparing the subsequences of the first and second string matched, namely MARTHA and MARHTA. There are two positions with mismatched characters, 3 and 4. This results in two half transpositions, or a single transposition, for a Jaro distance of:

   jaroProximity(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) = 0.944
Three initial characters match, MAR, for a Jaro-Winkler distance of:
   jaroWinklerProximity(MARTHA,MARHTA) = 0.944 + 0.1 * 3 * (1.0 - 0.944) = 0.961

Next, consider matching strings of different lengths, such as JONES and JOHNSON:

cs1 JONES
matches 013-5
cs2 JOHNSON

The unmatched characters are rendered in italics. Here the subsequence of matched characters for the two strings are JONS and JONS, so there are no transpositions. Thus the Jaro distance is:

   jaroProximity(JONES,JOHNSON)
     = 1/3 * (4/5 + 4/7 + (4 - 0)/4) = 0.790

The strings JONES and JOHNSON only match on their first two characters, JO, so the Jaro-Winkler distance is:

   jaroWinklerProximity(JONES,JOHNSON)
     = .790 + 0.1 * 2 * (1.0 - .790) = 0.832

We will now consider some artificial examples not drawn from (Winkler 2006). First, compare ABCVWXYZ and CABVWXYZ, which are of length 8, allowing alignments up to 8/4 - 1 = 3 positions away. This leads to the following alignment:

cs1 ABCVWXYZ
matches 12034567
cs2 CABVWXYZ

Here, there are 8/8 matches in both strings. There are only three half-transpositions, in the first three characters, because no position of CAB has an identical character to ABC. This yields a total of one transposition, for a Jaro score of:

   jaroProximity(ABCVWXYZ,CABVWXYZ)
     = 1/3 * (8/8 + 8/8 + (8-1)/8) = .958

There is no initial prefix match, so the Jaro-Winkler comparison produces the same result. Now consider matching ABCVWXYZ to CBAWXYZ. Here, the initial alignment is 2, 1, 0, which yields only two half transpositions. Thus under the Jaro distance, ABC is closer to CBA than to CAB, though due to integer rounding in computing the number of transpositions, this will only affect the final result if there is a further transposition in the strings.

Now consider the 10-character string ABCDUVWXYZ. This allows matches up to 10/2 - 1 = 4 positions away. If matched against DABCUVWXYZ, the result is 10 matches, and 4 half transposes, or 2 transposes. Now consider matching ABCDUVWXYZ against DBCAUVWXYZ. Here, index 0 in the first string (A) maps to index 3 in the second string, and index 3 in the first string (D) maps to index 0 in the second string, but positions 1 and 2 (B and C) map to themselves. Thus when comparing the output, there are only two half transpositions, thus making the second example DBCAUVWXYZ closer than DABCUVWXYZ to the first string ABCDUVWXYZ.

Note that the transposition count cannot be determined solely by the mapping. For instance, the string ABBBUVWXYZ matches BBBAUVWXYZ with alignment 4, 0, 1, 2, 5, 6, 7, 8, 9, 1. But there are only two half-transpositions, because only index 0 and index 3 mismatch in the subsequences of matching characters. Contrast this with ABCDUVWXYZ matching DABCUVWXYZ, which has the same alignment, but four half transpositions.

The greedy nature of the alignment phase in the Jaro-Winkler algorithm actually prevents the optimal alignments from being found in some cases. Consider the alignment of ABCAWXYZ with BCAWXYZ:

cs1 ABCAWXYZ
matches 201-3456
cs2 BCAWXYZ 

Here the first pair of A characters are matched, leading to three half transposes (the first three matched characters). A better scoring, though illegal, alignment would be the following, because it has the same number of matches, but no transposes:

cs1 ABCAWXYZ
matches -0123456
cs2 BCAWXYZ 

The illegal links are highlighted in yellow. Note that neither alignment matches in the initial character, so the Winkler adjustments do not apply.

Implementation Notes

This class's implementation is a literal translation of the C algorithm used in William E. Winkler's papers and for the 1995 U.S. Census Deduplication. The algorithm is the work of multiple authors and available from the folloiwng link:

Unlike the C version, the distance(CharSequence,CharSequence) and proximity(CharSequence,CharSequence) methods do not require its inputs to be padded with spaces. In addition, spaces are treated just like any other characters within the algorithm itself. There is also no case normalization in this class's version. Furthermore, the boundary conditions are changed so that two empty strings return a score of 1.0 rather than zero, as in the original algorithm.

Jaro's origial implementation is described in:

Winkler's modified algorithm, along with applications in record linkage, are described in the following highly readable survey article:

This document provides test cases in Table 6, which are the basis for the unit tests for this class (though note the three 0.0 results in the table do not agree with the return results of strcmp95.c or the results of this class, which matches strcmp95.c). The description of the matching procedure above is based on the actual strcmp95 code, the boundary conditions of which are not obvious from the text descriptions in the literature. An additional difference is that strcmp95, but not the algorithms in Winkler's papers nor the algorithm in this class, provides the possibility of partial matches with similar-sounding characters (e.g. c and k).

Acknowledgements

We'd like to thank Bill Winkler for helping us understand the versions of the algorithm and providing the strcmp95.c code as a reference implementation.

Since:
LingPipe2.4
Version:
3.0
Author:
Bob Carpenter

Field Summary
static JaroWinklerDistance JARO_DISTANCE
          A constant for the Jaro distance.
static JaroWinklerDistance JARO_WINKLER_DISTANCE
          A constant for the Jaro-Winkler distance with defaults set as in Winkler's papers.
 
Constructor Summary
JaroWinklerDistance()
          Construct a basic Jaro string distance without the Winkler modifications.
JaroWinklerDistance(double weightThreshold, int numChars)
          Construct a Winkler-modified Jaro string distance with the specified weight threshold for refinement and an initial number of characters over which to reweight.
 
Method Summary
 double distance(CharSequence cSeq1, CharSequence cSeq2)
          Returns the Jaro-Winkler distance between the specified character sequences.
 double proximity(CharSequence cSeq1, CharSequence cSeq2)
          Return the Jaro-Winkler comparison value between the specified character sequences.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

JARO_DISTANCE

public static final JaroWinklerDistance JARO_DISTANCE
A constant for the Jaro distance. The value is the same as would be returned by the nullary constructor JaroWinklerDistance().

Instances are thread safe, so this single distance instance may be used for all comparisons within an application.


JARO_WINKLER_DISTANCE

public static final JaroWinklerDistance JARO_WINKLER_DISTANCE
A constant for the Jaro-Winkler distance with defaults set as in Winkler's papers. The value is the same as would be returned by the nullary constructor JaroWinklerDistance(0.7,4).

Instances are thread safe, so this single distance instance may be used for all comparisons within an application.

Constructor Detail

JaroWinklerDistance

public JaroWinklerDistance()
Construct a basic Jaro string distance without the Winkler modifications. See the class documentation above for more information on the exact algorithm and its parameters.


JaroWinklerDistance

public JaroWinklerDistance(double weightThreshold,
                           int numChars)
Construct a Winkler-modified Jaro string distance with the specified weight threshold for refinement and an initial number of characters over which to reweight. See the class documentation above for more information on the exact algorithm and its parameters.

Method Detail

distance

public double distance(CharSequence cSeq1,
                       CharSequence cSeq2)
Returns the Jaro-Winkler distance between the specified character sequences. Teh distance is symmetric and will fall in the range 0 (perfect match) to 1 (no overlap). See the class definition above for formal definitions.

This method is defined to be:

   distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)

Specified by:
distance in interface Distance<CharSequence>
Parameters:
cSeq1 - First character sequence to compare.
cSeq2 - Second character sequence to compare.
Returns:
The Jaro-Winkler comparison value for the two character sequences.

proximity

public double proximity(CharSequence cSeq1,
                        CharSequence cSeq2)
Return the Jaro-Winkler comparison value between the specified character sequences. The comparison is symmetric and will fall in the range 0 (no match) to 1 (perfect match)inclusive. See the class definition above for an exact definition of Jaro-Winkler string comparison.

The method distance(CharSequence,CharSequence) returns a distance measure that is one minus the comparison value.

Specified by:
proximity in interface Proximity<CharSequence>
Parameters:
cSeq1 - First character sequence to compare.
cSeq2 - Second character sequence to compare.
Returns:
The Jaro-Winkler comparison value for the two character sequences.