com.aliasi.tokenizer
Class NGramTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.NGramTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable, Serializable

public class NGramTokenizerFactory
extends Object
implements TokenizerFactory, Serializable, Compilable

An NGramTokenizerFactory creates n-gram tokenizers of a specified minimum and maximun length.

An NGramTokenizer is a tokenizer that returns the character n-grams from a specified sequence between a minimum and maximum length. Whitespace takes the default behavior from Tokenizer.nextWhitespace(), returning a string consisting of a single space character.

For example, the result of

new NGramTokenizer("abcd".toCharArray(),0,4,2,3).tokenize()
is the string array:
{ "ab", "bc", "cd", "abc", "bcd" }

Thread Safety

N-gram tokenizer factories are completely thread safe.

Serialization

N-gram tokenizer factories are serializable.

Since:
LingPipe1.0
Version:
3.8
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
NGramTokenizerFactory(int minNGram, int maxNGram)
          Create an n-gram tokenizer factory with the specified minimum and maximum n-gram lengths.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Deprecated. Use the Serializable interface instead.
 int maxNGram()
          Returns the maximum n-gram length returned by this tokenizer factory.
 int minNGram()
          Returns the minimum n-gram length returned by this tokenizer factory.
 Tokenizer tokenizer(char[] cs, int start, int length)
          Returns an n-gram tokenizer for the specified characters with the minimum and maximum n-gram lengths as specified in the constructor.
 String toString()
          Returns a description of this n-gram tokenizer factory, including minimum and maximum token lengths.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

NGramTokenizerFactory

public NGramTokenizerFactory(int minNGram,
                             int maxNGram)
Create an n-gram tokenizer factory with the specified minimum and maximum n-gram lengths.

Parameters:
minNGram - Minimum n-gram length.
maxNGram - Maximum n-gram length.
Throws:
IllegalArgumentException - If the minimum is greater than the maximum or if the maximum is less than one.
Method Detail

compileTo

@Deprecated
public void compileTo(ObjectOutput objOut)
               throws IOException
Deprecated. Use the Serializable interface instead.

Compiles this n-gram tokenizer factory to the specified object output stream.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Output stream to which to write the tokenizer factory.
Throws:
IOException - If there is an exception writing the parameters.

minNGram

public int minNGram()
Returns the minimum n-gram length returned by this tokenizer factory.

Returns:
The minimum n-gram length.

maxNGram

public int maxNGram()
Returns the maximum n-gram length returned by this tokenizer factory.

Returns:
The maximum n-gram length.

tokenizer

public Tokenizer tokenizer(char[] cs,
                           int start,
                           int length)
Returns an n-gram tokenizer for the specified characters with the minimum and maximum n-gram lengths as specified in the constructor.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
cs - Underlying character array.
start - Index of first character in array to tokenize.
length - Number of characters to tokenize.

toString

public String toString()
Returns a description of this n-gram tokenizer factory, including minimum and maximum token lengths.

Overrides:
toString in class Object
Returns:
A description of this n-gram tokenizer factory.