com.aliasi.tokenizer
Class TokenNGramTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.TokenNGramTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Serializable

public class TokenNGramTokenizerFactory
extends Object
implements TokenizerFactory, Serializable

A TokenNGramTokenizerFactory wraps a base tokenizer to produce token n-gram tokens of a specified size.

For example, suppose we have a regex tokenizer factory that generates tokens based on contiguous letter characters. We can use it to build a token n-gram tokenizer factory that generates token bigrams and trigrams made up of the tokens from the base tokenizer.

 TokenizerFactory tf 
    = new RegExTokenizerFactory("\\S+");
 TokenizerFactory ntf 
    = new TokenNGramTokenizerFactory(2,3,tf);
The sequences of tokens produced by tf for some inputs are as follows.
StringTokens
"a"
"a b" "a b"
"a b c" "a b", "b c", "a b c"
"a b c d" "a b", "b c", "c d", "a b c", "b c d"
The start and end positions are calculated based on the positions for the base tokens provided by the base tokenizer.

Thread Safety

A token n-gram tokenizer factory is thread safe if its embedded tokenizer factory is thread safe.

Serializability

A token n-gram tokenizer factory is serializable if its embedded tokenizer factory is serializable. The reconstituted object will be of this same class with the same parameters.

Since:
LingPipe4.0.1
Version:
4.0.1
Author:
Bob Carpenter, Breck Baldwin
See Also:
Serialized Form

Constructor Summary
TokenNGramTokenizerFactory(TokenizerFactory factory, int min, int max)
          Construct a token n-gram tokenizer factory using the specified base factory that produces n-grams within the specified minimum and maximum length bounds.
 
Method Summary
 TokenizerFactory baseTokenizerFactory()
          Return the base tokenizer factory used to generate the underlying tokens from which n-grams are generated.
 int maxNGram()
          Return the maximum n-gram length.
 int minNGram()
          Return the minimum n-gram length.
 Tokenizer tokenizer(char[] cs, int start, int len)
          Returns a tokenizer for the specified subsequence of characters.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TokenNGramTokenizerFactory

public TokenNGramTokenizerFactory(TokenizerFactory factory,
                                  int min,
                                  int max)
Construct a token n-gram tokenizer factory using the specified base factory that produces n-grams within the specified minimum and maximum length bounds.

Parameters:
factory - Base tokenizer factory.
min - Minimum n-gram length (inclusive).
max - Maximum n-gram length (inclusive).
Throws:
IllegalArgumentException - If the minimum is less than 1 or the maximum is less than the minimum.
Method Detail

minNGram

public int minNGram()
Return the minimum n-gram length.

Returns:
Minimum n-gram length.

maxNGram

public int maxNGram()
Return the maximum n-gram length.

Returns:
Maximum n-gram length.

baseTokenizerFactory

public TokenizerFactory baseTokenizerFactory()
Return the base tokenizer factory used to generate the underlying tokens from which n-grams are generated.

Returns:
Underlying tokenizer factory.

tokenizer

public Tokenizer tokenizer(char[] cs,
                           int start,
                           int len)
Description copied from interface: TokenizerFactory
Returns a tokenizer for the specified subsequence of characters.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
cs - Characters to tokenize.
start - Index of first character to tokenize.
len - Number of characters to tokenize.

toString

public String toString()
Overrides:
toString in class Object