com.aliasi.tokenizer
Class IndoEuropeanTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Serializable

public class IndoEuropeanTokenizerFactory
extends Object
implements TokenizerFactory, Serializable

An IndoEuropeanTokenizerFactory creates tokenizers with built-in support for alpha-numerics, numbers, and other common constructs in Indo-European langauges.

The tokenization rules are roughly based on those used in MUC-6, but are necessarily finer grained, because the MUC tokenizers were based on lexical and semantic information such as whether a string was an abbreviation.

A token is any sequence of characters satisfying one of the following patterns.

Pattern Description
AlphaNumeric Any sequence of upper or lowercase letters or digits, as defined by Character.isDigit(char) and Character.isLetter(char), and including the Devanagari characters (unicode 0x0900 to 0x097F)
Numerical Any sequence of numbers, commas, and periods.
Hyphen Sequence Any number of hyphens (-)
Equals Sequence Any number of equals signs (=)
Double Quotes Double forward quotes (``) or double backward quotes('')
Whitespaces are defined as any sequence of whitespace characters, including the unicode non-breakable space (unicode 160). The tokenizer operates in a longest-leftmost fashion, returning the longest possible token starting at the current position in the underlying character array.

Thread Safety

The Indo-European tokenizer factory is completely thread safe.

Singleton versus Construction

All instances of Indo-European tokenizer factories behave the same way. Because they are thread safe, use the singleton INSTANCE. There is no public constructor provided.

Serialization

The serialized versions of this class deserialize to the same singleton as produced by INSTANCE.

Since:
LingPipe1.0
Version:
4.0.0
Author:
Bob Carpenter
See Also:
Serialized Form

Field Summary
static IndoEuropeanTokenizerFactory INSTANCE
          The singleton instance of an Indo-European tokenizer factory.
 
Constructor Summary
IndoEuropeanTokenizerFactory()
          Construct a tokenizer for Indo-European languages.
 
Method Summary
 Tokenizer tokenizer(char[] ch, int start, int length)
          Returns a tokenizer for Indo-European for the specified subsequence of characters.
 String toString()
          Returns tha name of this class.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

INSTANCE

public static final IndoEuropeanTokenizerFactory INSTANCE
The singleton instance of an Indo-European tokenizer factory.

Constructor Detail

IndoEuropeanTokenizerFactory

public IndoEuropeanTokenizerFactory()
Construct a tokenizer for Indo-European languages.

Implementation Note: All Indo-European tokenizer factories behave the same way, and they are thread safe, so the constant INSTANCE may be used anywhere a freshly constructed character tokenizer factory is used, without loss of performance.

Method Detail

tokenizer

public Tokenizer tokenizer(char[] ch,
                           int start,
                           int length)
Returns a tokenizer for Indo-European for the specified subsequence of characters.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
ch - Characters to tokenize.
start - Index of first character to tokenize.
length - Number of characters to tokenize.

toString

public String toString()
Returns tha name of this class.

Overrides:
toString in class Object
Returns:
The name of this class.