|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||
See:
Description
| Interface Summary | |
|---|---|
| TokenCategorizer | A TokenCategorizer supplies a string-based
category for string-based tokens. |
| TokenizerFactory | A TokenizerFactory constructors tokenizers from
subsequences of character arrays. |
| Class Summary | |
|---|---|
| CharacterTokenCategorizer | Returns a category for tokens made up out of a single character. |
| CharacterTokenizerFactory | A CharacterTokenizerFactory considers each
non-whitespace character in the input to be a distinct token. |
| EnglishStopTokenizerFactory | An EnglishStopTokenizerFactory applies an English stop
list to a contained base tokenizer factory. |
| IndoEuropeanTokenCategorizer | A IndoEuropeanTokenCategorizer is a generic token
categorizer for Indo-European languages that is based on character
"shape". |
| IndoEuropeanTokenizerFactory | An IndoEuropeanTokenizerFactory creates tokenizers
with built-in support for alpha-numerics, numbers, and other
common constructs in Indo-European langauges. |
| LineTokenizerFactory | A LineTokenizerFactory treats each line of an input as
a token. |
| LowerCaseTokenizerFactory | A LowerCaseTokenizerFactory filters the tokenizers produced
by a base tokenizer factory to produce lower case output. |
| ModifiedTokenizerFactory | A ModifiedTokenizerFactory is an abstract tokenizer factory
that modifies a tokenizer returned by a base tokenizer factory. |
| ModifyTokenTokenizerFactory | The abstract base class ModifyTokenTokenizerFactory
adapts token and whitespace modifiers to modify tokenizer
factories. |
| NGramTokenizerFactory | An NGramTokenizerFactory creates n-gram tokenizers
of a specified minimum and maximun length. |
| PorterStemmerTokenizerFactory | A PorterStemmerTokenizerFactory applies Porter's stemmer
to the tokenizers produced by a base tokenizer factory. |
| RegExFilteredTokenizerFactory | A RegExFilteredTokenizerFactory modifies the tokens
returned by a base tokenizer factory's tokizer by removing
those that do not match a regular expression pattern. |
| RegExTokenizerFactory | A RegExTokenizerFactory creates a tokenizer factory
out of a regular expression. |
| SoundexTokenizerFactory | A SoundexTokenizerFactory modifies the output of a base
tokenizer factory to produce tokens in soundex representation. |
| StopTokenizerFactory | A StopTokenizerFactory modifies a base tokenizer factory
by removing tokens in a specified stop set. |
| TokenChunker | A TokenChunker provides an implementationg of the Chunker interface based on an underlying tokenizer factory. |
| TokenFeatureExtractor | A TokenFeatureExtractor produces feature vectors from
character sequences representing token counts. |
| Tokenization | A Tokenization represents the result of tokenizing a
string. |
| Tokenizer | The abstract class Tokenizer serves as a base for tokenizer
implementations, which provide streams of tokens, whitespaces,
and positions. |
| TokenLengthTokenizerFactory | A TokenLengthTokenizerFactory filters the tokenizers produced
by a base tokenizer to only return tokens between specified lower and
upper length limits. |
| TokenNGramTokenizerFactory | A TokenNGramTokenizerFactory wraps a base tokenizer to
produce token n-gram tokens of a specified size. |
| WhitespaceNormTokenizerFactory | A WhitespaceNormTokenizerFactory filters the tokenizers produced
by a base tokenizer factory to convert non-empty whitespaces to a single
space and leave empty (zero-length) whitespaces alone. |
Classes for tokenizing character sequences.
|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||