|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.tokenizer.TokenNGramTokenizerFactory
public class TokenNGramTokenizerFactory
A TokenNGramTokenizerFactory wraps a base tokenizer to
produce token n-gram tokens of a specified size.
For example, suppose we have a regex tokenizer factory that generates tokens based on contiguous letter characters. We can use it to build a token n-gram tokenizer factory that generates token bigrams and trigrams made up of the tokens from the base tokenizer.
TokenizerFactory tf
= new RegExTokenizerFactory("\\S+");
TokenizerFactory ntf
= new TokenNGramTokenizerFactory(2,3,tf);
The sequences of tokens produced by tf for some
inputs are as follows.
The start and end positions are calculated based on the positions for the base tokens provided by the base tokenizer.
String Tokens "a""a b""a b""a b c""a b", "b c", "a b c""a b c d""a b", "b c", "c d", "a b c", "b c d"
| Constructor Summary | |
|---|---|
TokenNGramTokenizerFactory(TokenizerFactory factory,
int min,
int max)
Construct a token n-gram tokenizer factory using the specified base factory that produces n-grams within the specified minimum and maximum length bounds. |
|
| Method Summary | |
|---|---|
TokenizerFactory |
baseTokenizerFactory()
Return the base tokenizer factory used to generate the underlying tokens from which n-grams are generated. |
int |
maxNGram()
Return the maximum n-gram length. |
int |
minNGram()
Return the minimum n-gram length. |
Tokenizer |
tokenizer(char[] cs,
int start,
int len)
Returns a tokenizer for the specified subsequence of characters. |
String |
toString()
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Constructor Detail |
|---|
public TokenNGramTokenizerFactory(TokenizerFactory factory,
int min,
int max)
factory - Base tokenizer factory.min - Minimum n-gram length (inclusive).max - Maximum n-gram length (inclusive).
IllegalArgumentException - If the minimum is less than 1 or
the maximum is less than the minimum.| Method Detail |
|---|
public int minNGram()
public int maxNGram()
public TokenizerFactory baseTokenizerFactory()
public Tokenizer tokenizer(char[] cs,
int start,
int len)
TokenizerFactory
tokenizer in interface TokenizerFactorycs - Characters to tokenize.start - Index of first character to tokenize.len - Number of characters to tokenize.public String toString()
toString in class Object
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||