Class IndoEuropeanTokenizerFactory

  extended by com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable, Serializable

public class IndoEuropeanTokenizerFactory
extends Object
implements Compilable, TokenizerFactory, Serializable

An IndoEuropeanTokenizerFactory creates tokenizers with built-in support for alpha-numerics, numbers, and other common constructs in Indo-European langauges.

The tokenization rules are roughly based on those used in MUC-6, but are necessarily finer grained, because the MUC tokenizers were based on lexical and semantic information such as whether a string was an abbreviation.

A token is any sequence of characters satisfying one of the following patterns.

Pattern Description
AlphaNumeric Any sequence of upper or lowercase letters or digits, as defined by Character.isDigit(char) and Character.isLetter(char), and including the Devanagari characters (unicode 0x0900 to 0x097F)
Numerical Any sequence of numbers, commas, and periods.
Hyphen Sequence Any number of hyphens (-)
Equals Sequence Any number of equals signs (=)
Double Quotes Double forward quotes (``) or double backward quotes('')
Whitespaces are defined as any sequence of whitespace characters, including the unicode non-breakable space (unicode 160). The tokenizer operates in a longest-leftmost fashion, returning the longest possible token starting at the current position in the underlying character array.

Thread Safety

The Indo-European tokenizer factory is completely thread safe.


All instances of Indo-European tokenizer factories behave the same way. Because they are thread safe, use the singleton INSTANCE instead of constructing a fresh instance.

Serialization and Compilation

The serialized and compiled versions of this class deserialize to a new instance and the factory instance respectively.

Bob Carpenter
See Also:
Serialized Form

Field Summary
static TokenizerFactory FACTORY
          Deprecated. Use INSTANCE instead.
static IndoEuropeanTokenizerFactory INSTANCE
          The singleton instance of an Indo-European tokenizer factory.
Constructor Summary
          Deprecated. Use singleton instance INSTANCE instead.
Method Summary
 void compileTo(ObjectOutput objOut)
          Deprecated. Use the Serializable interface instead.
 Tokenizer tokenizer(char[] ch, int start, int length)
          Returns a tokenizer for Indo-European for the specified subsequence of characters.
 String toString()
          Returns tha name of this class.
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Detail


public static final IndoEuropeanTokenizerFactory INSTANCE
The singleton instance of an Indo-European tokenizer factory.


public static final TokenizerFactory FACTORY
Deprecated. Use INSTANCE instead.
An instance of an Indo-European tokenizer factory. This is the same instance as provided by INSTANCE.

Constructor Detail


public IndoEuropeanTokenizerFactory()
Deprecated. Use singleton instance INSTANCE instead.

Construct a tokenizer for Indo-European languages.

Implementation Note: All Indo-European tokenizer factories behave the same way, and they are thread safe, so the constant FACTORY may be used anywhere a freshly constructed character tokenizer factory is used, without loss of performance.

Method Detail


public Tokenizer tokenizer(char[] ch,
                           int start,
                           int length)
Returns a tokenizer for Indo-European for the specified subsequence of characters.

Specified by:
tokenizer in interface TokenizerFactory
ch - Characters to tokenize.
start - Index of first character to tokenize.
length - Number of characters to tokenize.


public String toString()
Returns tha name of this class.

toString in class Object
The name of this class.


public void compileTo(ObjectOutput objOut)
               throws IOException
Deprecated. Use the Serializable interface instead.

Compiles this tokenizer factory to the specified object output. The tokenizer factory read back in is reference identical to the static constant FACTORY.

Specified by:
compileTo in interface Compilable
objOut - Object output to which this tokenizer factory is compiled.
IOException - If there is an I/O error during the write.