About String Comparison

String comparison attempts to measure the similarity between strings. This is useful for applications ranging from database deduplication and record linkage to terminology extraction, spell checking, and k-nearest-neighbors classifiers.

In this tutorial, we demonstrate the ways in which string comparisons are used in LingPipe.

Distances and Proximities

There are two interfaces at the heart of comparisons in LingPipe, util.Distance<E> and util.Proximity<E>. Their signatures are both very simple. The distance interface specifies a single method which computes a distance between two objects of type E:

public interface Distance<E> {
    public double distance(E e1, E e2);
}

The proximity interface also specifies a single method:

public interface Proximity<E> {
    public double proximity(E e1, E e2);
}

String similarity can be specified in terms of distance. If two strings are more similar, the distance between them will be less. Similarity can also be specified in terms of proximity. If two strings are more similar, the proximity between them will be greater. It's always possible to convert a proximity into a distance or vice-versa by either negation (additive inverse) or inversion (multiplicative inverse).

Simple Edit Distance

There are a variety of distance measures defined in terms of edit operations of various kinds.

Damerau-Levenstein Distance

The simplest form of edit distances were introduced in the by Damerau (1964) and Levenshtein (1965). The distance between two strings is defined as the minimal number of edits required to convert one into the other.

In the simpler Levenshtein distance, the allowable edit operations are the deletion of a single character, the insertion of a single character and the substitutione of one character for another. In Damerau's version, transposition is also an allowable edit.

The distance between "Bob" and "Bob" is zero (0), because no edits are required to convert a string into itself. The edit distance between strings is only zero if the strings are identical.

The distance between "Brett" and "Brent" is one (1), because it requires a substitution of an 'n' for a 't'. The distance between "Brett" and Bret is one, requiring the deletion of one of the two 't' characters in "Brett". The sequence of edits must be minimal, but need not be unique. Further note that "Bret" can be converted to "Brett" with a single insertion of a 't' character.

The distance between "Bob" and "bob" is also 1, as it requires the substitution of a lowercase 'b' for its uppercase equivalent 'B'.

Running Edit Distance Demo

The edit distance demo may be run through ant: using the edit-distance target:

> ant edit-distance

 String1       String2  Dist1  Dist2   Prox1  Prox2
     hte           hte    0.0    0.0    -0.0   -0.0
     hte          htne    1.0    1.0    -1.0   -1.0
     hte          thhe    2.0    2.0    -2.0   -2.0
     hte           the    2.0    1.0    -2.0   -1.0
     hte          then    3.0    2.0    -3.0   -2.0
     hte           The    2.0    2.0    -2.0   -2.0
     hte           THE    3.0    3.0    -3.0   -3.0
    htne           hte    1.0    1.0    -1.0   -1.0
    htne          htne    0.0    0.0    -0.0   -0.0
    htne          thhe    3.0    2.0    -3.0   -2.0
    htne           the    2.0    2.0    -2.0   -2.0
    htne          then    3.0    2.0    -3.0   -2.0
    htne           The    3.0    3.0    -3.0   -3.0
    htne           THE    4.0    4.0    -4.0   -4.0
    thhe           hte    2.0    2.0    -2.0   -2.0
    thhe          htne    3.0    2.0    -3.0   -2.0
    thhe          thhe    0.0    0.0    -0.0   -0.0
    thhe           the    1.0    1.0    -1.0   -1.0
    thhe          then    2.0    2.0    -2.0   -2.0
    thhe           The    2.0    2.0    -2.0   -2.0
    thhe           THE    4.0    4.0    -4.0   -4.0
     the           hte    2.0    1.0    -2.0   -1.0
     the          htne    2.0    2.0    -2.0   -2.0
     the          thhe    1.0    1.0    -1.0   -1.0
     the           the    0.0    0.0    -0.0   -0.0
     the          then    1.0    1.0    -1.0   -1.0
     the           The    1.0    1.0    -1.0   -1.0
     the           THE    3.0    3.0    -3.0   -3.0
    then           hte    3.0    2.0    -3.0   -2.0
    then          htne    3.0    2.0    -3.0   -2.0
    then          thhe    2.0    2.0    -2.0   -2.0
    then           the    1.0    1.0    -1.0   -1.0
    then          then    0.0    0.0    -0.0   -0.0
    then           The    2.0    2.0    -2.0   -2.0
    then           THE    4.0    4.0    -4.0   -4.0
     The           hte    2.0    2.0    -2.0   -2.0
     The          htne    3.0    3.0    -3.0   -3.0
     The          thhe    2.0    2.0    -2.0   -2.0
     The           the    1.0    1.0    -1.0   -1.0
     The          then    2.0    2.0    -2.0   -2.0
     The           The    0.0    0.0    -0.0   -0.0
     The           THE    2.0    2.0    -2.0   -2.0
     THE           hte    3.0    3.0    -3.0   -3.0
     THE          htne    4.0    4.0    -4.0   -4.0
     THE          thhe    4.0    4.0    -4.0   -4.0
     THE           the    3.0    3.0    -3.0   -3.0
     THE          then    4.0    4.0    -4.0   -4.0
     THE           The    2.0    2.0    -2.0   -2.0
     THE           THE    0.0    0.0    -0.0   -0.0

The demo takes its command line arguments and returns the edit distances with (Dist2) and without (Dist1) transposition, and the proximities, with (Prox2) and without (Prox1) transposition.

First note that the distance between any string itself is zero (0). Second, note that the measure is symmetric, so that the order of the string arguments doesn't matter.

Looking at "The" and "the", we find an edit distance of 1.0, corresponding to the substitution of a lower-case 't' for an upper-case 'T'.

The distance with transposition is less than or equal to that without transposition. For instance, "the" and "hte" have a distance of one (1) with transposition, because "the" can be converted to "hte" by a single transposition of the first two characters. Without transposition, two (2) edits are required; this can either be two substitutions, or a delete and insert in either order.

Proximity is Inverse Distance

For simple edit distance, proximities are defined by additively inverting distances with negation:

proximity(x,y) = -distance(x,y)

This means that the greater the distance, the smaller the proximity and vice-versa.

Code Walk Through

The code for the edit distance demo is trivial. First, we assign edit distances to constants, including edit distance with and without transpose, and proximity, with and without transpose:

static final Distance<CharSequence> D1
    = new EditDistance(false);

static final Distance<CharSequence> D2
    = new EditDistance(true);

static final Proximity<CharSequence> P1
    = new EditDistance(false);

static final Proximity<CharSequence> P2
    = new EditDistance(true);

As these assginments make clear, the class spell.EditDistance implements both the util.Distance<CharSequence> and the util.Proximity<CharSequence> interfaces. The boolean argument to the constructor indicates whether transposition is considered an edit or not.

The rest of the code simply loops over the pair of strings in the input argumentws prints out the distances and proximities, with and without transpositions:

public static void main(String[] args) {
    for (String s1 : args)
        for (String s2 : args)
            System.out.printf("%12s  %12s  %5.1f  %5.1f   %5.1f  %5.1f\n",
                              s1,
                              s2,
                              D1.distance(s1,s2),
                              D2.distance(s1,s2),
                              P1.proximity(s1,s2),
                              P2.proximity(s1,s2));
}

The input strings may be changed in the ant target in build.xml.

Weighted Edit Distance

Edit distance has been generalized in many different ways. In LingPipe, we've considered simple weighted edit distance, where the cost of a substitution, insertion, deletion and transposition may vary and may depend on the actual characters being edited. This general setup subsumes the Needlman-Wunsch algorithm used in molecular biology.

Fixed Weight Edit Distance

The simplest form of weighted edit distance simply sets a constant cost for each of the edit operations: match, substitute, insert, delete, transpose. The LingPipe class spell.FixedWeightEditDistance implements this approach.

Running the Demo

The demo may be invoked from the ant target fixed-edit-distance:

> ant fixed-edit-distance
match= 0.0  del=-10.0  ins=-8.0  subst=-9.0  trans=-9.0

     String1       String2   Dist
         hte           hte   -0.0
         hte          htne    8.0
         hte          thhe   17.0
         hte           the    9.0
         hte          then   17.0
         hte           The   18.0
         hte           THE   27.0
        htne           hte   10.0
        htne          htne   -0.0
        htne          thhe   18.0
        htne           the   17.0
        htne          then   18.0
        htne           The   26.0
        htne           THE   35.0
        thhe           hte   17.0
        thhe          htne   18.0
        thhe          thhe   -0.0
        thhe           the   10.0
        thhe          then   18.0
        thhe           The   17.0
        thhe           THE   35.0
         the           hte    9.0
         the          htne   17.0
         the          thhe    8.0
         the           the   -0.0
         the          then    8.0
         the           The    9.0
         the           THE   27.0
        then           hte   19.0
        then          htne   18.0
        then          thhe   18.0
        then           the   10.0
        then          then   -0.0
        then           The   19.0
        then           THE   35.0
         The           hte   16.0
         The          htne   24.0
         The          thhe   17.0
         The           the    9.0
         The          then   17.0
         The           The   -0.0
         The           THE   18.0
         THE           hte   27.0
         THE          htne   35.0
         THE          thhe   35.0
         THE           the   27.0
         THE          then   35.0
         THE           The   18.0
         THE           THE   -0.0

We do not print proximities this time; they're just the negations of the edit distances.

Note that in these results, because the match score is 0.0, identical strings are still at distance 0.0. With match scores greater than 0.0, matching a string against itself will result ina distance that is its length times the negative match weight.

Further note that the results are not symmetric. The "distance" from "the" to "then" is 8.0, representing a single insertion, whereas the distance measured the other way is 10.0.

Code Walk Through

The code for this demo is in src/FixedEditDistanceDemo.java. A fixed-weight edit distance is an immutable object where weights are set in the constructor:

FixedWeightEditDistance(double matchWeight,
                        double deleteWeight,
                        double insertWeight,
                        double substituteWeight,
                        double transposeWeight);

In the demo, these values are read from the command line, parsed into doubles, and spuplied to the constructor. Otherwise, this program is just like the last one, only we do not report proximities, but only distances.

These weights should be negative. In applications like spell checking, they will be weighted as log probabilities. When used for distance, they'll be negated to give edit distances.

General Weighted Edit Distance

The class spell.WeightedEditDistance implements a more general notion of weighted edit distance that allows the weights to be set on a per-character basis. These weights are mediated through abstract methods such as:

abstract double deleteWeight(char cDeleted);
abstract double insertWeight(char cInserted);
abstract double substituteWeight(char cDeleted, char cInserted);
abstract double transposeWeight(char cFirst, char cSecond);

Simple Custom Weighted Edit Distance

We've implemented a customized edit disance as a static nested class in src/WeightedEditDistanceDemo.java. Here's the complete code for the weighted edit distance implementation:

static class CasePunctuationDistance extends WeightedEditDistance {

    public double deleteWeight(char c) {
        return (Character.isLetter(c) || Character.isDigit(c))
            ? -1
            : 0;
    }

    public double insertWeight(char c) {
        return deleteWeight(c);
    }

    public double substituteWeight(char cDeleted, char cInserted) {
        return (Character.toLowerCase(cDeleted) == Character.toLowerCase(cInserted))
            ? 0
            : -1;
    }

    public double matchWeight(char cMatched) {
        return 0;
    }

    public double transposeWeight(char cFirst, char cSecond) {
        return Double.NEGATIVE_INFINITY;
    }
}

Match weight is set to zero for all characters.

The weight for deleting alpha-numeric characters is -1, and the weight for deleting all other characters is 0. This includes space, punctuation, etc. The insertion weights are set to be equal to the deletion weights, so that insertion and deletion are symmetric.

Substitution is free (zero cost) for letters that are equal other than for case, and -1 if the letters are not the same modulo case. This actually introduces multiple paths, as substituting a character for itself has cost 0 either through a match or substitute operation. The match could actually bet set to Double.NEGATIVE_INFINITY without changing any scores here.

Finally, the transposition weight is set to negative infinity, effectively turning off transposition.

Running the Demo

The demo is run like all the others, with inputs specified as command-line arguments and outputs computing all pairwise distances.

> ant weighted-edit-distance

      the           the   -0.0
      the           The   -0.0
      the           THE   -0.0
      the         T H E   -0.0
      the           hte    2.0
      the          then    1.0
      The           the   -0.0
      The           The   -0.0
      The           THE   -0.0
      The         T H E   -0.0
      The           hte    2.0
      The          then    1.0
      THE           the   -0.0
      THE           The   -0.0
      THE           THE   -0.0
      THE         T H E   -0.0
      THE           hte    2.0
      THE          then    1.0
    T H E           the   -0.0
    T H E           The   -0.0
    T H E           THE   -0.0
    T H E         T H E   -0.0
    T H E           hte    2.0
    T H E          then    1.0
      hte           the    2.0
      hte           The    2.0
      hte           THE    2.0
      hte         T H E    2.0
      hte           hte   -0.0
      hte          then    3.0
     then           the    1.0
     then           The    1.0
     then           THE    1.0
     then         T H E    1.0
     then           hte    3.0
     then          then   -0.0

Note that case variation and space no longer matter. Note that any pair of spellings of "the" have distance zero, including "the", "The", "THE", and "T H E".

To see that transposition is indeed turned off, consider that the distance between "the" and "hte" is 2, because it requires a pair of substitutions.

Weighted Edit Distance in LingPipe

Tsuruoka and Tsujii Distance

LingPipe provides a pre-specified weighted edit distance for comparing protein names, following the specification found in Tsuruoka and Tsujii (2003).

The implementation of T & T eidt distance is defined as a constant in the approximate dictionary chunker package, namely dict.ApproxDictionaryChunker.TT_DISTANCE. It has special weights for spaces or hyphens, digit subsitution, capital for lowercase, and so on. Follow the link to the constant for a full version.

Spell Checking

The spell checker uses weighted edit distances to model likelihoods of errors in formulating queries (typos and brainos). The spell checking tutorial contains a section discussing the tuning of weighted edit distances for applications.

Chinese Word Segmentation

The Chinese word segmenter is an application of spell checking, using a custom weighted edit distance in which the only allowable edit is the insertion of a space (between tokens). See the chinese tokenization tutorial for more information on the role of weighted edit distance.

Case Restoration

Spell checking may be used for restoring case to uniform case data by employing weighted edit distance that charges no penalty for converting among case variants.

Jaccard Distance

Another common method for comparing strings, which is actually much more efficient to implement, is the so-called "Jaccard distance". The Jaccard distance implementation in spell.JaccardDistance operates at a token level, comparing two strings by first tokenizing them and then dividing the number of tokens shared by the strings by the total number of tokens.

Running the Demo

The demo may be run using the jaccard target in ant:

> ant jaccard

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
distance=0.00  proximity=1.00

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=It is important in multicellular organisms as it helps to suppress cancer.
distance=0.91  proximity=0.09

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The name p53 is in reference to its apparent molecular mass: it runs as a 53-kilodalton (kDa) protein on SDS-PAGE.
distance=0.79  proximity=0.21

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The human gene that encodes for p53 is TP53.
distance=0.83  proximity=0.17

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The gene is named TP53 after the protein it codes for (TP53 is another name for p53).
distance=0.76  proximity=0.24

...


String1=The gene is named TP53 after the protein it codes for (TP53 is another name for p53).
String2=The human gene that encodes for p53 is TP53.
distance=0.65  proximity=0.35

...

First note that the distance between a string and itself is zero. All distances are between zero (0) and one (1). Proximity for Jaccard distance is defined as one minus the distance, so proximity also ranges between zero and one.

To see why the distance is 0.35 for the last (and closest) example, consider the size of the set of words in both sentences divided by the total number of words in both:

size({The, gene, is, TP53, p53, for, . })
/ size({The, gene, is, named, TP53, after, the, protein, it, codes, for, (, another, name, p53, ), ., human, that, encodes})
= 7/20 = 0.35.

Code Walk Through

The code for this demo is in src/JaccardDistanceDemo.java. It is almost trivial, and only varies from the other examples in terms of how the distance object is set up:

   TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY;
   JaccardDistance jaccard = new JaccardDistance(tokenizerFactory);

The JaccardDistance class implements both the distance and proximity interfaces, with proximity defined to be one minus the distance.

Tuning Jaccard (and TF/IDF) Distances

Any tokenizer factory may be plugged into the Jaccard distance. As shown in the next section, a TF/IDF-based distance may use exactly the same variety of tokenizers with roughly the same effect.

Stemming, Stoplisting and Case Normalization

Recall that tokenizers may perform various filtering operations, such as stoplisting (removing common or function words), case normalization, and stemming. For instance, using a case normalizing tokenizer filter would produce results in which "The" and "the" were considered a single (matching) token. Similarly, using a stemming filter, the tokens "named" and "name" would be considered equivalent. With stoplisting, function words like "the" and usually punctuation like "," would be removed.

Character- and Line-Level Comparisons

By setting up a character tokenizer factory (tokenizer.CharacterTokenizer, the comparison will be at the character level rather than the token level (and whitespace will be ignored).

At the other extreme, a line tokenizer factory (tokenizer.LineTokenizerFactory), would compare documents for the number of matching lines rather than the number of matching words.

N-gram Comparisons

With an n-gram tokenizer factory (tokenizer.NGramTokenizerFactory), comparisons will be on the basis of character substrings rather than whole words.

Jaro-Winkler Distance

There are a family of distance measures defined by the U.S. Census Bureau for comparing single person names. The original metric was defined by Matt Jaro and later refined by Bill Winkler. A complete definition of the measure may be found in the javadoc for the implementing class, spell.JaroWinklerDistance, as well as in Winkler's (2006) survey paper.

Running the Demo

The demo may be run from the ant target jaro-winkler:

> ant jaro-winkler


          String1             String2   Dist   Prox
           MARTHA              MARHTA  0.039  0.961
            JONES             JOHNSON  0.168  0.832
       DUNNINGHAM          CUNNINGHAM  0.067  0.933
         MICHELLE             MICHAEL  0.079  0.921
        NICHLESON           NICHULSON  0.044  0.956
           MASSEY              MASSIE  0.067  0.933
           ABROMS              ABRAMS  0.078  0.922
           HARDIN            MARTINEZ  0.278  0.722
            ITMAN               SMITH  0.533  0.467
        JERALDINE           GERALDINE  0.074  0.926
           JULIES              JULIUS  0.067  0.933
            TANYA               TONYA  0.120  0.880
           DWAYNE               DUANE  0.160  0.840
             SEAN               SUSAN  0.195  0.805
              JON                JOHN  0.067  0.933
              JON                 JAN  0.200  0.800

The demo compares a sequence of name pairs, printing out distance and proximity information. Note that proximity is just one minus the distance, and both are scaled between zero and one.

Code Walk Through

The code is trivial. We simply use the distance constant in the Jaro-Winkler class, and then read in arguments to split and compare:

public static void main(String[] args) {
    JaroWinklerDistance jaroWinkler = JaroWinklerDistance.JARO_WINKLER_DISTANCE;
    for (String s : args) {
        String[] pair = s.split("\\|");
        String s1 = pair[0];
        String s2 = pair[1];
        System.out.printf("%18s  %18s  %5.3f  %5.3f\n",
                          s1, s2,
                          jaroWinkler.distance(s1,s2),
                          jaroWinkler.proximity(s1,s2));
    }
}

The command-line arguments are then provided as pairs of strings separated by a vertical bar, as in:

<target name="jaro-winkler"
  ...
  <java classname="JaroWinklerDemo" ...>
    <arg value="MARTHA|MARHTA"/>
    <arg value="JONES|JOHNSON"/>
    ...

TF/IDF Distance

LingPipe implements a second kind of token-based distance in the class spell.TfIdfDistance. By varying tokenizers, different behaviors may be had with the same underlying implementation.

TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.

Running the Demo

The TF/IDF distance demo may be run from ant using the tf-idf target:

> ant tfIdf


               Term  Doc Freq       IDF
                and         1      1.95
              known         1      1.95
            encodes         1      1.95
               gene         3      0.85
...
                p53         5      0.34
          Mutations         1      1.95
          molecular         1      1.95
...
                kDa         1      1.95
                  .         7      0.00
                 is         6      0.15
          reference         1      1.95

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
distance=0.00  proximity=1.00

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=It is important in multicellular organisms as it helps to suppress cancer.
distance=0.95  proximity=0.05

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The name p53 is in reference to its apparent molecular mass: it runs as a 53-kilodalton (kDa) protein on SDS-PAGE.
distance=0.82  proximity=0.18

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The human gene that encodes for p53 is TP53.
distance=0.87  proximity=0.13

String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The gene is named TP53 after the protein it codes for (TP53 is another name for p53).
distance=0.84  proximity=0.16

...

String1=The gene is named TP53 after the protein it codes for (TP53 is another name for p53).
String2=The human gene that encodes for p53 is TP53.
distance=0.60  proximity=0.40

...

The first part of the program is set up to dump out the complete set of terms found in the set of strings being compared, followed by their document frequency (the number of strings in which they appear), followed by their inverse document frequency weight. Note that the word "gene" shows up in 3 documents and has an IDF of 0.85, whereas the word "encodes" shows up in only one document and thus has a weight of 1.95. Finally note that the period (.) shows up in each string, so has an IDF weight of 0.0, effectively throwing it out of the comparison process. In practice, IDF weighting means that rarer words count for more in the overall comparison. Specifically, sharing the token "kDa" counts for more than sharing the word "is".

Following the dump of terms and IDFs is the comparison values. As with Jaccard distance, strings are distance zero from themselves, and all distances are between zero and one. Also like Jaccard distance, the proximity is defined as one minus the distance.

A further similarity with Jaccard distance is that various tokenization factories may be plugged in to form the basis of the comparison.

Code Walk Through

The code for this demo may be found in src/TfIdfDistanceDemo.java.

The code for TF/IDF distance is more involved than that for Jaccard distance, because the IDF values need to be computed from data. In the demo, the IDF values are populated with the strings being compared:

    TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY;
    TfIdfDistance tfIdf = new TfIdfDistance(tokenizerFactory);

    for (String s : args)
        tfIdf.handle(s);

A TF/IDF distance is constructed just like a Jaccard distance. But then it is given training instances for the IDF weighting through the method trainIdf(String). Each string sent to training counts as a document.

In general, the IDF values may be constructed from anywhere (for instance, by running over a large corpus of sentences for sentence-level IDF).

The next thing the code does is print out the terms, their document frequencies, and their inverse document frequency multipliers:

    for (String term : tfIdf.termSet())
        System.out.printf("  %18s  %8d  %8.2f\n",
                          term,
                          tfIdf.docFrequency(term),
                          tfIdf.idf(term));

The rest of the code just iterates through the arguments and prints out distances and proximities like all the other demos.

References