Class SoundexFilterTokenizer

  extended by com.aliasi.tokenizer.Tokenizer
      extended by com.aliasi.tokenizer.FilterTokenizer
          extended by com.aliasi.tokenizer.SoundexFilterTokenizer
All Implemented Interfaces:

Deprecated. Use SoundexTokenizerFactory instead.

public class SoundexFilterTokenizer
extends FilterTokenizer

The SoundexFilterTokenizer replaces each token with its Soundex encoding. Soundex replaces sequences of characters with a crude four-character approximation of their pronunciation plus initial letter.

The process for converting an input to its Soundex representation is fairly straighforward for inputs that are all ASCII letters. Soundex is case insensitive, but is only defined for strings of ASCII letters. Thus to begin, all characters that are not Latin1 letters are removed, and all Latin1 characters are stripped of their diacritics. The algorithm then proceeds according to its standard definition:

  1. Normalize input by removing all characters that are not Latin1 letters, and converting all other characters to uppercase ASCII after first removing any diacritics.
  2. If the input is empty, return "0000"
  3. Set the first letter of the output to the first letter of the input.
  4. While there are less than four letters of output do:
    1. If the next letter is a vowel, unset the last letter's code.
    2. If the next letter is A, E, I, O, U, H, W, Y, continue.
    3. If the next letter's code is equal to the previous letter's code, continue.
    4. Set the next letter of output to the current letter's code.
  5. If there are fewer than four characters of output, pad the output with zeros (0)
  6. Return the output string.

The table of individual character encodings is as follows:

B, F, P, V1
C, G, J, K, Q, S, X, Z2
D, T3
M, N5

Here are some examples of translations from the unit tests, drawn from the sources cited below.

TokensSoundex EncodingNotes
Robert, RupertR163
Euler, ElleryE460
Gauss, GhoshG200
Hilbert, HeilbronnH416
Knuth, KantK530
Lloyd, LiddyL300
Lukasiewicz, LissajousL222
Wachs, WaughW200

As a tokenizer filter, the SoundexFilterTokenizer simply replaces each token with its Soundex equivalent. Note that this may produce very many 0000 outputs if it is fed standard text with punctuation, numbers, etc.

Note: In order to produce a deterministic tokenizer filter, names with prefixes are coded with the prefix. Recall that Soundex considers the following set of words prefixes, and suggests providing both the Soundex computed with the prefix and the Soundex encoding computed without the prefix:

 Van, Con, De, Di, La, Le

These are not accorded any special treatment by this implementation.

References and Historical Notes

Soundex was invented and patented by Robert C. Russell in 1918. The original version involved eight categories, including one for vowels, without the initial character being treated specially as to coding. The first vowel was retained in the original Soundex. Furthermore, some positional information was added, such as the deletion of final s and z.

The version in this class is the one described by Donald Knuth in The Art of Computer Programming and the one described by the United States National Archives and Records Administration version, which has been used for the United States Census.

Bob Carpenter

Field Summary
Fields inherited from class com.aliasi.tokenizer.FilterTokenizer
Constructor Summary
SoundexFilterTokenizer(Tokenizer tokenizer)
          Deprecated. Use SoundexTokenizerFactory instead.
Method Summary
 String filter(String token)
          Deprecated. Returns the Soundex equivalent of the specified token.
static String soundexEncoding(String token)
          Deprecated. Use SoundexTokenizerFactory.soundexEncoding(String) instead.
Methods inherited from class com.aliasi.tokenizer.FilterTokenizer
baseTokenizer, lastTokenEndPosition, lastTokenStartPosition, nextToken, nextWhitespace, setTokenizer, toString
Methods inherited from class com.aliasi.tokenizer.Tokenizer
iterator, tokenize, tokenize
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Constructor Detail


public SoundexFilterTokenizer(Tokenizer tokenizer)
Deprecated. Use SoundexTokenizerFactory instead.

Construct a soundex filter for the specified tokenizer.

tokenizer - Tokenizer to filter.
Method Detail


public String filter(String token)
Returns the Soundex equivalent of the specified token. This method simply calls the static method soundexEncoding(String) on the specified token.

token - Token to be converted to Soundex.
The Soundex representation of the specified token.


public static String soundexEncoding(String token)
Deprecated. Use SoundexTokenizerFactory.soundexEncoding(String) instead.

Returns the Soundex encoding of the specified token.

token - Token to be encoded.
The Soundex encoding of the specified token.