com.aliasi.tokenizer
Class CharacterTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.CharacterTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable, Serializable

public class CharacterTokenizerFactory
extends Object
implements Compilable, Serializable, TokenizerFactory

A CharacterTokenizerFactory considers each non-whitespace character in the input to be a distinct token. This factory is useful for handling languages such as Chinese, which includes thousands of characters and presents a difficult tokenization problem for standard tokenizers.

Thread Safety

Character tokenizer factories are completely thread safe.

Singleton

Because the tokenizer factory is thread safe and immutable, the recommended usage is through the static singleton instance INSTANCE.

Serialization and Compilation

Character tokenizer factories may be serialized. The deserialized version will be equal to the singleton INSTANCE.

Since:
LingPipe1.0
Version:
3.8
Author:
Bob Carpenter
See Also:
Serialized Form

Field Summary
static TokenizerFactory FACTORY
          Deprecated. Use INSTANCE instead.
static TokenizerFactory INSTANCE
          An instance of a character tokenizer factory, which may be used wherever a character tokenizer factory is needed.
 
Constructor Summary
CharacterTokenizerFactory()
          Deprecated. Use singleton instance INSTANCE instead.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Deprecated. Use Serializable interface instead.
 Tokenizer tokenizer(char[] ch, int start, int length)
          Returns a character tokenizer for the specified character array slice.
 String toString()
          Returns a string representation of this tokenizer factory, which is just its name.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

INSTANCE

public static final TokenizerFactory INSTANCE
An instance of a character tokenizer factory, which may be used wherever a character tokenizer factory is needed. This instance is returned by compilation.


FACTORY

@Deprecated
public static final TokenizerFactory FACTORY
Deprecated. Use INSTANCE instead.
This constant refers to the same factory as INSTANCE.

Constructor Detail

CharacterTokenizerFactory

@Deprecated
public CharacterTokenizerFactory()
Deprecated. Use singleton instance INSTANCE instead.

Construct a character tokenizer factory.

Implementation Note: All character tokenizer factories behave the same way, and they are thread safe, so the constant INSTANCE may be used anywhere a freshly constructed character tokenizer factory is used, without loss of performance.

Method Detail

tokenizer

public Tokenizer tokenizer(char[] ch,
                           int start,
                           int length)
Returns a character tokenizer for the specified character array slice.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
ch - Characters to tokenize.
start - Index of first character to tokenize.
length - Number of characters to tokenize.

toString

public String toString()
Returns a string representation of this tokenizer factory, which is just its name.

Overrides:
toString in class Object
Returns:
The string representation of this tokenizer factory.

compileTo

@Deprecated
public void compileTo(ObjectOutput objOut)
               throws IOException
Deprecated. Use Serializable interface instead.

Compiles this tokenizer factory to the specified object output. The tokenizer factory read back in is reference identical to the static constant FACTORY.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this tokenizer factory is compiled.
Throws:
IOException - If there is an I/O error during the write.