|PREV CLASS NEXT CLASS||FRAMES NO FRAMES|
|SUMMARY: NESTED | FIELD | CONSTR | METHOD||DETAIL: FIELD | CONSTR | METHOD|
java.lang.Object com.aliasi.tokenizer.Tokenizer com.aliasi.tokenizer.FilterTokenizer com.aliasi.tokenizer.SoundexFilterTokenizer
@Deprecated public class SoundexFilterTokenizer
SoundexFilterTokenizer replaces each token with
its Soundex encoding. Soundex replaces sequences of characters
with a crude four-character approximation of their pronunciation
plus initial letter.
The process for converting an input to its Soundex representation is fairly straighforward for inputs that are all ASCII letters. Soundex is case insensitive, but is only defined for strings of ASCII letters. Thus to begin, all characters that are not Latin1 letters are removed, and all Latin1 characters are stripped of their diacritics. The algorithm then proceeds according to its standard definition:
The table of individual character encodings is as follows:
Characters Code B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D, T 3 L 4 M, N 5 R 6
Here are some examples of translations from the unit tests, drawn from the sources cited below.
Tokens Soundex Encoding Notes Gutierrez G362 Pfister P236 Jackson J250 Tymczak T522 Ashcraft A261 Robert, Rupert R163 Euler, Ellery E460 Gauss, Ghosh G200 Hilbert, Heilbronn H416 Knuth, Kant K530 Lloyd, Liddy L300 Lukasiewicz, Lissajous L222 Wachs, Waugh W200
As a tokenizer filter, the
SoundexFilterTokenizersimply replaces each token with its Soundex equivalent. Note that this may produce very many
0000outputs if it is fed standard text with punctuation, numbers, etc.
Note: In order to produce a deterministic tokenizer filter, names with prefixes are coded with the prefix. Recall that Soundex considers the following set of words prefixes, and suggests providing both the Soundex computed with the prefix and the Soundex encoding computed without the prefix:Van, Con, De, Di, La, Le
These are not accorded any special treatment by this implementation.
References and Historical NotesSoundex was invented and patented by Robert C. Russell in 1918. The original version involved eight categories, including one for vowels, without the initial character being treated specially as to coding. The first vowel was retained in the original Soundex. Furthermore, some positional information was added, such as the deletion of final
The version in this class is the one described by Donald Knuth in The Art of Computer Programming and the one described by the United States National Archives and Records Administration version, which has been used for the United States Census.
- Knuth, D. 1973. The Art of Computer Programming Volum 3: Sorting and Searching. Addison-Wesley. 2nd Edition Pages 394-395.
- Wikipedia. Soundex.
- United States National Archives and Records Administration. Using the Census Soundex. General Information Leaflet 55.
- Robert C. Russell. 1918. United States Patent 1,261,167.
- Robert C. Russell. 1922. United States Patent 1,435,663.
- Bob Carpenter
Fields inherited from class com.aliasi.tokenizer.FilterTokenizer
Deprecated. Returns the Soundex equivalent of the specified token.
Methods inherited from class com.aliasi.tokenizer.FilterTokenizer
baseTokenizer, lastTokenEndPosition, lastTokenStartPosition, nextToken, nextWhitespace, setTokenizer, toString
Methods inherited from class com.aliasi.tokenizer.Tokenizer
iterator, tokenize, tokenize
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
SoundexFilterTokenizer@Deprecated public SoundexFilterTokenizer(Tokenizer tokenizer)
- Deprecated. Use
- Construct a soundex filter for the specified tokenizer.
tokenizer- Tokenizer to filter.
- Returns the Soundex equivalent of the specified token. This method simply calls the static method
soundexEncoding(String)on the specified token.
token- Token to be converted to Soundex.
- The Soundex representation of the specified token.
soundexEncoding@Deprecated public static String soundexEncoding(String token)
- Deprecated. Use
- Returns the Soundex encoding of the specified token.
token- Token to be encoded.
- The Soundex encoding of the specified token.
Overview Package Class Tree Deprecated Index Help PREV CLASS NEXT CLASS FRAMES NO FRAMES SUMMARY: NESTED | FIELD | CONSTR | METHOD DETAIL: FIELD | CONSTR | METHOD