com.aliasi.tokenizer
Class CharacterTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.CharacterTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Serializable

public class CharacterTokenizerFactory
extends Object
implements Serializable, TokenizerFactory

A CharacterTokenizerFactory considers each non-whitespace character in the input to be a distinct token. This factory is useful for handling languages such as Chinese, which includes thousands of characters and presents a difficult tokenization problem for standard tokenizers.

Thread Safety

Character tokenizer factories are completely thread safe.

Singleton

Because the tokenizer factory is thread safe and immutable, the recommended usage is through the static singleton instance INSTANCE.

Serialization and Compilation

Character tokenizer factories may be serialized. The deserialized version will be equal to the singleton INSTANCE.

Since:
LingPipe1.0
Version:
4.0.0
Author:
Bob Carpenter
See Also:
Serialized Form

Field Summary
static TokenizerFactory INSTANCE
          An instance of a character tokenizer factory, which may be used wherever a character tokenizer factory is needed.
 
Method Summary
 Tokenizer tokenizer(char[] ch, int start, int length)
          Returns a character tokenizer for the specified character array slice.
 String toString()
          Returns a string representation of this tokenizer factory, which is just its name.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

INSTANCE

public static final TokenizerFactory INSTANCE
An instance of a character tokenizer factory, which may be used wherever a character tokenizer factory is needed. This instance is returned by compilation.

Method Detail

tokenizer

public Tokenizer tokenizer(char[] ch,
                           int start,
                           int length)
Returns a character tokenizer for the specified character array slice.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
ch - Characters to tokenize.
start - Index of first character to tokenize.
length - Number of characters to tokenize.

toString

public String toString()
Returns a string representation of this tokenizer factory, which is just its name.

Overrides:
toString in class Object
Returns:
The string representation of this tokenizer factory.