About String Comparison
String comparison attempts to measure the similarity between strings. This is useful for applications ranging from database deduplication and record linkage to terminology extraction, spell checking, and k-nearest-neighbors classifiers.
In this tutorial, we demonstrate the ways in which string comparisons are used in LingPipe.
Distances and Proximities
There are two interfaces at the heart of comparisons
in LingPipe, util.Distance<E>
and util.Proximity<E>.
Their signatures
are both very simple. The distance interface specifies a single
method which computes a distance between two objects of type
E:
public interface Distance<E> {
public double distance(E e1, E e2);
}
The proximity interface also specifies a single method:
public interface Proximity<E> {
public double proximity(E e1, E e2);
}
String similarity can be specified in terms of distance. If two strings are more similar, the distance between them will be less. Similarity can also be specified in terms of proximity. If two strings are more similar, the proximity between them will be greater. It's always possible to convert a proximity into a distance or vice-versa by either negation (additive inverse) or inversion (multiplicative inverse).
Simple Edit Distance
There are a variety of distance measures defined in terms of edit operations of various kinds.
Damerau-Levenstein Distance
The simplest form of edit distances were introduced in the by Damerau (1964) and Levenshtein (1965). The distance between two strings is defined as the minimal number of edits required to convert one into the other.
In the simpler Levenshtein distance, the allowable edit operations are the deletion of a single character, the insertion of a single character and the substitutione of one character for another. In Damerau's version, transposition is also an allowable edit.
The distance between "Bob" and
"Bob" is zero (0), because no edits are
required to convert a string into itself. The edit distance between
strings is only zero if the strings are identical.
The distance between "Brett" and
"Brent" is one (1), because it requires a
substitution of an 'n' for a 't'. The
distance between "Brett" and Bret
is one, requiring the deletion of one of the two 't'
characters in "Brett". The sequence of
edits must be minimal, but need not be unique. Further note
that "Bret" can be converted to
"Brett" with a single insertion of
a 't' character.
The distance between "Bob" and
"bob" is also 1, as it requires the
substitution of a lowercase 'b' for its uppercase
equivalent 'B'.
Running Edit Distance Demo
The edit distance demo may be run through ant:
using the edit-distance target:
> ant edit-distance
String1 String2 Dist1 Dist2 Prox1 Prox2
hte hte 0.0 0.0 -0.0 -0.0
hte htne 1.0 1.0 -1.0 -1.0
hte thhe 2.0 2.0 -2.0 -2.0
hte the 2.0 1.0 -2.0 -1.0
hte then 3.0 2.0 -3.0 -2.0
hte The 2.0 2.0 -2.0 -2.0
hte THE 3.0 3.0 -3.0 -3.0
htne hte 1.0 1.0 -1.0 -1.0
htne htne 0.0 0.0 -0.0 -0.0
htne thhe 3.0 2.0 -3.0 -2.0
htne the 2.0 2.0 -2.0 -2.0
htne then 3.0 2.0 -3.0 -2.0
htne The 3.0 3.0 -3.0 -3.0
htne THE 4.0 4.0 -4.0 -4.0
thhe hte 2.0 2.0 -2.0 -2.0
thhe htne 3.0 2.0 -3.0 -2.0
thhe thhe 0.0 0.0 -0.0 -0.0
thhe the 1.0 1.0 -1.0 -1.0
thhe then 2.0 2.0 -2.0 -2.0
thhe The 2.0 2.0 -2.0 -2.0
thhe THE 4.0 4.0 -4.0 -4.0
the hte 2.0 1.0 -2.0 -1.0
the htne 2.0 2.0 -2.0 -2.0
the thhe 1.0 1.0 -1.0 -1.0
the the 0.0 0.0 -0.0 -0.0
the then 1.0 1.0 -1.0 -1.0
the The 1.0 1.0 -1.0 -1.0
the THE 3.0 3.0 -3.0 -3.0
then hte 3.0 2.0 -3.0 -2.0
then htne 3.0 2.0 -3.0 -2.0
then thhe 2.0 2.0 -2.0 -2.0
then the 1.0 1.0 -1.0 -1.0
then then 0.0 0.0 -0.0 -0.0
then The 2.0 2.0 -2.0 -2.0
then THE 4.0 4.0 -4.0 -4.0
The hte 2.0 2.0 -2.0 -2.0
The htne 3.0 3.0 -3.0 -3.0
The thhe 2.0 2.0 -2.0 -2.0
The the 1.0 1.0 -1.0 -1.0
The then 2.0 2.0 -2.0 -2.0
The The 0.0 0.0 -0.0 -0.0
The THE 2.0 2.0 -2.0 -2.0
THE hte 3.0 3.0 -3.0 -3.0
THE htne 4.0 4.0 -4.0 -4.0
THE thhe 4.0 4.0 -4.0 -4.0
THE the 3.0 3.0 -3.0 -3.0
THE then 4.0 4.0 -4.0 -4.0
THE The 2.0 2.0 -2.0 -2.0
THE THE 0.0 0.0 -0.0 -0.0
The demo takes its command line arguments and returns the edit distances with (Dist2) and without (Dist1) transposition, and the proximities, with (Prox2) and without (Prox1) transposition.
First note that the distance between any string itself is zero (0). Second, note that the measure is symmetric, so that the order of the string arguments doesn't matter.
Looking at "The" and
"the", we find an edit distance of 1.0,
corresponding to the substitution of a lower-case 't'
for an upper-case 'T'.
The distance with transposition is less than or equal to that
without transposition. For instance, "the" and
"hte" have a distance of one (1) with
transposition, because "the" can be converted
to "hte" by a single transposition of the first
two characters. Without transposition, two (2) edits are required;
this can either be two substitutions, or a delete and insert in either
order.
Proximity is Inverse Distance
For simple edit distance, proximities are defined by additively inverting distances with negation:
proximity(x,y) = -distance(x,y)
This means that the greater the distance, the smaller the proximity and vice-versa.
Code Walk Through
The code for the edit distance demo is trivial. First, we assign edit distances to constants, including edit distance with and without transpose, and proximity, with and without transpose:
static final Distance<CharSequence> D1
= new EditDistance(false);
static final Distance<CharSequence> D2
= new EditDistance(true);
static final Proximity<CharSequence> P1
= new EditDistance(false);
static final Proximity<CharSequence> P2
= new EditDistance(true);
As these assginments make clear, the class
spell.EditDistance implements both the
util.Distance<CharSequence> and the
util.Proximity<CharSequence> interfaces. The
boolean argument to the constructor indicates whether transposition is
considered an edit or not.
The rest of the code simply loops over the pair of strings in the input argumentws prints out the distances and proximities, with and without transpositions:
public static void main(String[] args) {
for (String s1 : args)
for (String s2 : args)
System.out.printf("%12s %12s %5.1f %5.1f %5.1f %5.1f\n",
s1,
s2,
D1.distance(s1,s2),
D2.distance(s1,s2),
P1.proximity(s1,s2),
P2.proximity(s1,s2));
}
The input strings may be changed in the ant target in build.xml.
Weighted Edit Distance
Edit distance has been generalized in many different ways. In LingPipe, we've considered simple weighted edit distance, where the cost of a substitution, insertion, deletion and transposition may vary and may depend on the actual characters being edited. This general setup subsumes the Needlman-Wunsch algorithm used in molecular biology.
Fixed Weight Edit Distance
The simplest form of weighted edit distance simply sets a constant
cost for each of the edit operations: match, substitute, insert,
delete, transpose. The LingPipe class spell.FixedWeightEditDistance
implements this approach.
Running the Demo
The demo may be invoked from the ant target fixed-edit-distance:
> ant fixed-edit-distance
match= 0.0 del=-10.0 ins=-8.0 subst=-9.0 trans=-9.0
String1 String2 Dist
hte hte -0.0
hte htne 8.0
hte thhe 17.0
hte the 9.0
hte then 17.0
hte The 18.0
hte THE 27.0
htne hte 10.0
htne htne -0.0
htne thhe 18.0
htne the 17.0
htne then 18.0
htne The 26.0
htne THE 35.0
thhe hte 17.0
thhe htne 18.0
thhe thhe -0.0
thhe the 10.0
thhe then 18.0
thhe The 17.0
thhe THE 35.0
the hte 9.0
the htne 17.0
the thhe 8.0
the the -0.0
the then 8.0
the The 9.0
the THE 27.0
then hte 19.0
then htne 18.0
then thhe 18.0
then the 10.0
then then -0.0
then The 19.0
then THE 35.0
The hte 16.0
The htne 24.0
The thhe 17.0
The the 9.0
The then 17.0
The The -0.0
The THE 18.0
THE hte 27.0
THE htne 35.0
THE thhe 35.0
THE the 27.0
THE then 35.0
THE The 18.0
THE THE -0.0
We do not print proximities this time; they're just the negations of the edit distances.
Note that in these results, because the match score is 0.0, identical strings are still at distance 0.0. With match scores greater than 0.0, matching a string against itself will result ina distance that is its length times the negative match weight.
Further note that the results are not symmetric. The "distance"
from "the" to "then" is
8.0, representing a single insertion, whereas the distance measured
the other way is 10.0.
Code Walk Through
The code for this demo is in
src/FixedEditDistanceDemo.java.
A fixed-weight edit distance is an immutable object where weights are set in
the constructor:
FixedWeightEditDistance(double matchWeight,
double deleteWeight,
double insertWeight,
double substituteWeight,
double transposeWeight);
In the demo, these values are read from the command line, parsed into doubles, and spuplied to the constructor. Otherwise, this program is just like the last one, only we do not report proximities, but only distances.
These weights should be negative. In applications like spell checking, they will be weighted as log probabilities. When used for distance, they'll be negated to give edit distances.
General Weighted Edit Distance
The class spell.WeightedEditDistance
implements a more general notion of weighted edit distance that allows the
weights to be set on a per-character basis. These weights are mediated
through abstract methods such as:
abstract double deleteWeight(char cDeleted); abstract double insertWeight(char cInserted); abstract double substituteWeight(char cDeleted, char cInserted); abstract double transposeWeight(char cFirst, char cSecond);
Simple Custom Weighted Edit Distance
We've implemented a customized edit disance as a static nested class in
src/WeightedEditDistanceDemo.java. Here's the complete code for the weighted edit distance implementation:
static class CasePunctuationDistance extends WeightedEditDistance {
public double deleteWeight(char c) {
return (Character.isLetter(c) || Character.isDigit(c))
? -1
: 0;
}
public double insertWeight(char c) {
return deleteWeight(c);
}
public double substituteWeight(char cDeleted, char cInserted) {
return (Character.toLowerCase(cDeleted) == Character.toLowerCase(cInserted))
? 0
: -1;
}
public double matchWeight(char cMatched) {
return 0;
}
public double transposeWeight(char cFirst, char cSecond) {
return Double.NEGATIVE_INFINITY;
}
}
Match weight is set to zero for all characters.
The weight for deleting alpha-numeric characters is -1, and the weight for deleting all other characters is 0. This includes space, punctuation, etc. The insertion weights are set to be equal to the deletion weights, so that insertion and deletion are symmetric.
Substitution is free (zero cost) for letters that are equal other than for case, and -1 if the letters are not the same modulo case. This actually introduces multiple paths, as substituting a character for itself has cost 0 either through a match or substitute operation. The match could actually bet set to Double.NEGATIVE_INFINITY without changing any scores here.
Finally, the transposition weight is set to negative infinity, effectively turning off transposition.
Running the Demo
The demo is run like all the others, with inputs specified as command-line arguments and outputs computing all pairwise distances.
> ant weighted-edit-distance
the the -0.0
the The -0.0
the THE -0.0
the T H E -0.0
the hte 2.0
the then 1.0
The the -0.0
The The -0.0
The THE -0.0
The T H E -0.0
The hte 2.0
The then 1.0
THE the -0.0
THE The -0.0
THE THE -0.0
THE T H E -0.0
THE hte 2.0
THE then 1.0
T H E the -0.0
T H E The -0.0
T H E THE -0.0
T H E T H E -0.0
T H E hte 2.0
T H E then 1.0
hte the 2.0
hte The 2.0
hte THE 2.0
hte T H E 2.0
hte hte -0.0
hte then 3.0
then the 1.0
then The 1.0
then THE 1.0
then T H E 1.0
then hte 3.0
then then -0.0
Note that case variation and space no longer matter.
Note that any pair of spellings of "the"
have distance zero,
including "the",
"The",
"THE", and
"T H E".
To see that transposition is indeed turned off,
consider that the distance between "the"
and "hte" is 2, because it requires
a pair of substitutions.
Weighted Edit Distance in LingPipe
Tsuruoka and Tsujii Distance
LingPipe provides a pre-specified weighted edit distance for comparing protein names, following the specification found in Tsuruoka and Tsujii (2003).
The implementation of T & T eidt distance is defined as a
constant in the approximate dictionary chunker package, namely dict.ApproxDictionaryChunker.TT_DISTANCE.
It has special weights for spaces or hyphens, digit subsitution,
capital for lowercase, and so on. Follow the link to the constant for a full
version.
Spell Checking
The spell checker uses weighted edit distances to model likelihoods of errors in formulating queries (typos and brainos). The spell checking tutorial contains a section discussing the tuning of weighted edit distances for applications.
Chinese Word Segmentation
The Chinese word segmenter is an application of spell checking, using a custom weighted edit distance in which the only allowable edit is the insertion of a space (between tokens). See the chinese tokenization tutorial for more information on the role of weighted edit distance.
Case Restoration
Spell checking may be used for restoring case to uniform case data by employing weighted edit distance that charges no penalty for converting among case variants.
Jaccard Distance
Another common method for comparing strings, which is actually much
more efficient to implement, is the so-called "Jaccard
distance". The Jaccard distance implementation in spell.JaccardDistance
operates at a token level, comparing two strings by first
tokenizing them and then dividing the number of tokens shared
by the strings by the total number of tokens.
Running the Demo
The demo may be run using the jaccard target in
ant:
> ant jaccard String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor. String2=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor. distance=0.00 proximity=1.00 String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor. String2=It is important in multicellular organisms as it helps to suppress cancer. distance=0.91 proximity=0.09 String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor. String2=The name p53 is in reference to its apparent molecular mass: it runs as a 53-kilodalton (kDa) protein on SDS-PAGE. distance=0.79 proximity=0.21 String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor. String2=The human gene that encodes for p53 is TP53. distance=0.83 proximity=0.17 String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor. String2=The gene is named TP53 after the protein it codes for (TP53 is another name for p53). distance=0.76 proximity=0.24 ... String1=The gene is named TP53 after the protein it codes for (TP53 is another name for p53). String2=The human gene that encodes for p53 is TP53. distance=0.65 proximity=0.35 ...
First note that the distance between a string and itself is zero. All distances are between zero (0) and one (1). Proximity for Jaccard distance is defined as one minus the distance, so proximity also ranges between zero and one.
To see why the distance is 0.35 for the last (and closest) example, consider the size of the set of words in both sentences divided by the total number of words in both:
size({The, gene, is, TP53, p53, for, . })
/ size({The, gene, is, named, TP53, after, the, protein, it, codes, for, (, another, name, p53, ), ., human, that, encodes})
= 7/20 = 0.35.
Code Walk Through
The code for this demo is in src/JaccardDistanceDemo.java.
It is almost trivial, and only varies from the other examples in terms
of how the distance object is set up:
TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY; JaccardDistance jaccard = new JaccardDistance(tokenizerFactory);
The JaccardDistance class implements both the distance
and proximity interfaces, with proximity defined to be one minus
the distance.
Tuning Jaccard (and TF/IDF) Distances
Any tokenizer factory may be plugged into the Jaccard distance. As shown in the next section, a TF/IDF-based distance may use exactly the same variety of tokenizers with roughly the same effect.
Stemming, Stoplisting and Case Normalization
Recall that tokenizers may perform various filtering operations,
such as stoplisting (removing common or function words), case
normalization, and stemming. For instance, using a case normalizing
tokenizer filter would produce results in which "The"
and "the" were considered a single (matching)
token. Similarly, using a stemming filter, the tokens
"named"
and "name" would be considered equivalent.
With stoplisting, function words like "the"
and usually punctuation like "," would be
removed.
Character- and Line-Level Comparisons
By setting up a character tokenizer factory (tokenizer.CharacterTokenizer,
the comparison will be at the character level rather than the token
level (and whitespace will be ignored).
At the other extreme, a line tokenizer factory
(tokenizer.LineTokenizerFactory),
would compare
documents for the number of matching lines rather than
the number of matching words.
N-gram Comparisons
With an n-gram tokenizer factory
(tokenizer.NGramTokenizerFactory),
comparisons will be on the basis of character substrings rather
than whole words.
Jaro-Winkler Distance
There are a family of distance measures defined by the U.S. Census
Bureau for comparing single person names. The original metric was
defined by Matt Jaro and later refined by Bill Winkler. A complete
definition of the measure may be found in the javadoc for the
implementing class, spell.JaroWinklerDistance,
as well as in Winkler's
(2006) survey paper.
Running the Demo
The demo may be run from the ant target jaro-winkler:
> ant jaro-winkler
String1 String2 Dist Prox
MARTHA MARHTA 0.039 0.961
JONES JOHNSON 0.168 0.832
DUNNINGHAM CUNNINGHAM 0.067 0.933
MICHELLE MICHAEL 0.079 0.921
NICHLESON NICHULSON 0.044 0.956
MASSEY MASSIE 0.067 0.933
ABROMS ABRAMS 0.078 0.922
HARDIN MARTINEZ 0.278 0.722
ITMAN SMITH 0.533 0.467
JERALDINE GERALDINE 0.074 0.926
JULIES JULIUS 0.067 0.933
TANYA TONYA 0.120 0.880
DWAYNE DUANE 0.160 0.840
SEAN SUSAN 0.195 0.805
JON JOHN 0.067 0.933
JON JAN 0.200 0.800
The demo compares a sequence of name pairs, printing out distance and proximity information. Note that proximity is just one minus the distance, and both are scaled between zero and one.
Code Walk Through
The code is trivial. We simply use the distance constant in the Jaro-Winkler class, and then read in arguments to split and compare:
public static void main(String[] args) {
JaroWinklerDistance jaroWinkler = JaroWinklerDistance.JARO_WINKLER_DISTANCE;
for (String s : args) {
String[] pair = s.split("\\|");
String s1 = pair[0];
String s2 = pair[1];
System.out.printf("%18s %18s %5.3f %5.3f\n",
s1, s2,
jaroWinkler.distance(s1,s2),
jaroWinkler.proximity(s1,s2));
}
}
The command-line arguments are then provided as pairs of strings separated by a vertical bar, as in:
<target name="jaro-winkler"
...
<java classname="JaroWinklerDemo" ...>
<arg value="MARTHA|MARHTA"/>
<arg value="JONES|JOHNSON"/>
...
TF/IDF Distance
LingPipe implements a second kind of token-based distance
in the class spell.TfIdfDistance.
By varying tokenizers, different behaviors may be had with the same
underlying implementation.
TF/IDF distance is based on vector similarity (using the cosine measure of angular similarity) over dampened and discriminatively weighted term frequencies. The basic idea is that two strings are more similar if they contain many of the same tokens with the same relative number of occurrences of each. Tokens are weighted more heavily if they occur in few documents. See the class documentation for a full definition of TF/IDF distance.
Running the Demo
The TF/IDF distance demo may be run from ant using the tf-idf
target:
> ant tfIdf
Term Doc Freq IDF
and 1 1.95
known 1 1.95
encodes 1 1.95
gene 3 0.85
...
p53 5 0.34
Mutations 1 1.95
molecular 1 1.95
...
kDa 1 1.95
. 7 0.00
is 6 0.15
reference 1 1.95
String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
distance=0.00 proximity=1.00
String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=It is important in multicellular organisms as it helps to suppress cancer.
distance=0.95 proximity=0.05
String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The name p53 is in reference to its apparent molecular mass: it runs as a 53-kilodalton (kDa) protein on SDS-PAGE.
distance=0.82 proximity=0.18
String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The human gene that encodes for p53 is TP53.
distance=0.87 proximity=0.13
String1=p53, also known as protein 53 (TP53), is a transcription factor that regulates the cell cycle and hence functions as a tumor suppressor.
String2=The gene is named TP53 after the protein it codes for (TP53 is another name for p53).
distance=0.84 proximity=0.16
...
String1=The gene is named TP53 after the protein it codes for (TP53 is another name for p53).
String2=The human gene that encodes for p53 is TP53.
distance=0.60 proximity=0.40
...
The first part of the program is set up to dump out the complete
set of terms found in the set of strings being compared, followed
by their document frequency (the number of strings in which they
appear), followed by their inverse document frequency weight.
Note that the word "gene" shows up in 3 documents
and has an IDF of 0.85, whereas the word "encodes"
shows up in only one document and thus has a weight of 1.95.
Finally note that the period (.) shows up in
each string, so has an IDF weight of 0.0, effectively throwing it
out of the comparison process. In practice, IDF weighting
means that rarer words count for more in the overall comparison.
Specifically, sharing the token "kDa" counts for
more than sharing the word "is".
Following the dump of terms and IDFs is the comparison values. As with Jaccard distance, strings are distance zero from themselves, and all distances are between zero and one. Also like Jaccard distance, the proximity is defined as one minus the distance.
A further similarity with Jaccard distance is that various tokenization factories may be plugged in to form the basis of the comparison.
Code Walk Through
The code for this demo may be found in src/TfIdfDistanceDemo.java.
The code for TF/IDF distance is more involved than that for Jaccard distance, because the IDF values need to be computed from data. In the demo, the IDF values are populated with the strings being compared:
TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY;
TfIdfDistance tfIdf = new TfIdfDistance(tokenizerFactory);
for (String s : args)
tfIdf.handle(s);
A TF/IDF distance is constructed just like a Jaccard distance.
But then it is given training instances for the IDF weighting
through the method trainIdf(String). Each
string sent to training counts as a document.
In general, the IDF values may be constructed from anywhere (for instance, by running over a large corpus of sentences for sentence-level IDF).
The next thing the code does is print out the terms, their document frequencies, and their inverse document frequency multipliers:
for (String term : tfIdf.termSet())
System.out.printf(" %18s %8d %8.2f\n",
term,
tfIdf.docFrequency(term),
tfIdf.idf(term));
The rest of the code just iterates through the arguments and prints out distances and proximities like all the other demos.
References
- Damerau, Fred J. 1964. A technique for computer detection and correction of spelling errors, Communications of the ACM 7(3):171-176.
- Levenshtein, V. I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady.
- Navarro, Gonzalo. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33(1):31-88.
- Needleman, Saul and Christian Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 48(3):443-53.
- Winkler, William E. 2006. Overview of Record Linkage and Current Research Directions. Statistical Research Division, U.S. Census Bureau.