Class IndoEuropeanTokenCategorizer

  extended by com.aliasi.tokenizer.IndoEuropeanTokenCategorizer
All Implemented Interfaces:
TokenCategorizer, Compilable

public final class IndoEuropeanTokenCategorizer
extends Object
implements Compilable, TokenCategorizer

A IndoEuropeanTokenCategorizer is a generic token categorizer for Indo-European languages that is based on character "shape".

The token categories returned by categorize(String) are as follows. To find the category for a given token, the first category that matches in the following list is chosen.

Category Description
NULL-TOK Zero-length string
1-DIG A single digit.
2-DIG A two-digit string.
3-DIG A three digit string.
4-DIG A four digit string.
5+-DIG String of all digits five or more digits long.
DIG-LET Contains digits and letters.
DIG-- Contains digits and hyphens
DIG-/ Contains digits and slashes.
DIG-, Contains digits and commas.
DIG-. Contains digits and periods.
1-LET-UP A single uppercase letter.
1-LET-LOW One lowercase letter
LET-UP Uppercase letters only.
LET-LOW Lowercase letters only.
LET-CAP Uppercase letter followed by one or more lowercase letters.
LET-MIX Letters only, containing both uppercase and lettercase.
PUNC- A sequence of punctuation characters.
OTHER Anything else.

Bob Carpenter

Field Summary
static IndoEuropeanTokenCategorizer CATEGORIZER
          This is a constant Indo-European token categorizer.
Constructor Summary
          Deprecated. Use singleton CATEGORIZER object instead.
Method Summary
 String[] categories()
          Returns a copy of the array of strings representing all the categories produced by this categorizer.
 String categorize(String token)
          Returns the type of a token, based on its structure or other information.
 void compileTo(ObjectOutput objOut)
          Compiles this token categorizer to the specified object output.
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail


public static final IndoEuropeanTokenCategorizer CATEGORIZER
This is a constant Indo-European token categorizer. Because the categorizer is thread safe, this can be used in lieu of creating a new instance with the zero-argument constructor.

Constructor Detail


public IndoEuropeanTokenCategorizer()
Deprecated. Use singleton CATEGORIZER object instead.

Construct an Indo-European token categorizer. Because the token categorizer is thread safe and every instance of it is the same, the constant CATEGORIZER may be used in place of any instance constructed through this constructor.

Method Detail


public String categorize(String token)
Returns the type of a token, based on its structure or other information. The returned type is a string that is used as a proxy for the token. Estimates are stored for tokens and for their classes. The class based estimates are interpolated with the word-based estimates once the most specific matching context is found.

Specified by:
categorize in interface TokenCategorizer
token - Token whose class is returned.
String representing the class of a token.


public String[] categories()
Returns a copy of the array of strings representing all the categories produced by this categorizer.

Specified by:
categories in interface TokenCategorizer
Copy of the categories for this categorizer.


public void compileTo(ObjectOutput objOut)
               throws IOException
Compiles this token categorizer to the specified object output. The categorizer read back in is reference identical to the static constant CATEGORIZER.

Specified by:
compileTo in interface Compilable
objOut - Object output to which this categorizer is written.
IOException - If there is an underlying I/O exception during the write.a