Class SoundexTokenizerFactory

  extended by com.aliasi.tokenizer.ModifiedTokenizerFactory
      extended by com.aliasi.tokenizer.ModifyTokenTokenizerFactory
          extended by com.aliasi.tokenizer.SoundexTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Serializable

public class SoundexTokenizerFactory
extends ModifyTokenTokenizerFactory
implements Serializable

A SoundexTokenizerFactory modifies the output of a base tokenizer factory to produce tokens in soundex representation. Soundex replaces sequences of characters with a crude four-character approximation of their pronunciation plus initial letter.

Soundex Representations

The process for converting an input to its Soundex representation is fairly straighforward for inputs that are all ASCII letters. Soundex is case insensitive, but is only defined for strings of ASCII letters. Thus to begin, all characters that are not Latin1 letters are removed, and all Latin1 characters are stripped of their diacritics. The algorithm then proceeds according to its standard definition:

  1. Normalize input by removing all characters that are not Latin1 letters, and converting all other characters to uppercase ASCII after first removing any diacritics.
  2. If the input is empty, return "0000"
  3. Set the first letter of the output to the first letter of the input.
  4. While there are less than four letters of output do:
    1. If the next letter is a vowel, unset the last letter's code.
    2. If the next letter is A, E, I, O, U, H, W, Y, continue.
    3. If the next letter's code is equal to the previous letter's code, continue.
    4. Set the next letter of output to the current letter's code.
  5. If there are fewer than four characters of output, pad the output with zeros (0)
  6. Return the output string.

The table of individual character encodings is as follows:

B, F, P, V1
C, G, J, K, Q, S, X, Z2
D, T3
M, N5

Here are some examples of translations from the unit tests, drawn from the sources cited below.

TokensSoundex EncodingNotes
Robert, RupertR163
Euler, ElleryE460
Gauss, GhoshG200
Hilbert, HeilbronnH416
Knuth, KantK530
Lloyd, LiddyL300
Lukasiewicz, LissajousL222
Wachs, WaughW200

As a tokenizer filter, the SoundexFilterTokenizer simply replaces each token with its Soundex equivalent. Note that this may produce very many 0000 outputs if it is fed standard text with punctuation, numbers, etc.

Note: In order to produce a deterministic tokenizer filter, names with prefixes are coded with the prefix. Recall that Soundex considers the following set of words prefixes, and suggests providing both the Soundex computed with the prefix and the Soundex encoding computed without the prefix:

 Van, Con, De, Di, La, Le

These are not accorded any special treatment by this implementation.

Thread Safety

An English stop-listed tokenizer factory is thread safe if its base tokenizer factory is thread safe.


An EnglishStopTokenizerFactory is serializable if its base tokenizer factory is serializable.

References and Historical Notes

Soundex was invented and patented by Robert C. Russell in 1918. The original version involved eight categories, including one for vowels, without the initial character being treated specially as to coding. The first vowel was retained in the original Soundex. Furthermore, some positional information was added, such as the deletion of final s and z.

The version in this class is the one described by Donald Knuth in The Art of Computer Programming and the one described by the United States National Archives and Records Administration version, which has been used for the United States Census.

Bob Carpenter
See Also:
Serialized Form

Constructor Summary
SoundexTokenizerFactory(TokenizerFactory factory)
          Construct a Soundex-based tokenizer factory that converts tokens produced by the specified base factory into their soundex representations.
Method Summary
 String modifyToken(String token)
          Returns the Soundex encoding of the specified token.
static String soundexEncoding(String token)
          Returns the Soundex encoding of the specified token.
Methods inherited from class com.aliasi.tokenizer.ModifyTokenTokenizerFactory
modify, modifyWhitespace
Methods inherited from class com.aliasi.tokenizer.ModifiedTokenizerFactory
baseTokenizerFactory, tokenizer
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public SoundexTokenizerFactory(TokenizerFactory factory)
Construct a Soundex-based tokenizer factory that converts tokens produced by the specified base factory into their soundex representations.

factory - Base tokenizer factory.
Method Detail


public String modifyToken(String token)
Returns the Soundex encoding of the specified token.

See the class documentation above for more information on the encoding.

modifyToken in class ModifyTokenTokenizerFactory
token - Input token.
The soundex encoding of the input token.


public static String soundexEncoding(String token)
Returns the Soundex encoding of the specified token.

token - Token to be encoded.
The Soundex encoding of the specified token.