Package com.aliasi.tokenizer

Classes for tokenizing character sequences.

See:
          Description

Interface Summary
TokenCategorizer A TokenCategorizer supplies a string-based category for string-based tokens.
TokenizerFactory A TokenizerFactory constructors tokenizers from subsequences of character arrays.
 

Class Summary
CharacterTokenCategorizer Returns a category for tokens made up out of a single character.
CharacterTokenizerFactory A CharacterTokenizerFactory considers each non-whitespace character in the input to be a distinct token.
EnglishStopTokenizerFactory An EnglishStopTokenizerFactory applies an English stop list to a contained base tokenizer factory.
IndoEuropeanTokenCategorizer A IndoEuropeanTokenCategorizer is a generic token categorizer for Indo-European languages that is based on character "shape".
IndoEuropeanTokenizerFactory An IndoEuropeanTokenizerFactory creates tokenizers with built-in support for alpha-numerics, numbers, and other common constructs in Indo-European langauges.
LineTokenizerFactory A LineTokenizerFactory treats each line of an input as a token.
LowerCaseTokenizerFactory A LowerCaseTokenizerFactory filters the tokenizers produced by a base tokenizer factory to produce lower case output.
ModifiedTokenizerFactory A ModifiedTokenizerFactory is an abstract tokenizer factory that modifies a tokenizer returned by a base tokenizer factory.
ModifyTokenTokenizerFactory The abstract base class ModifyTokenTokenizerFactory adapts token and whitespace modifiers to modify tokenizer factories.
NGramTokenizerFactory An NGramTokenizerFactory creates n-gram tokenizers of a specified minimum and maximun length.
PorterStemmerTokenizerFactory A PorterStemmerTokenizerFactory applies Porter's stemmer to the tokenizers produced by a base tokenizer factory.
RegExFilteredTokenizerFactory A RegExFilteredTokenizerFactory modifies the tokens returned by a base tokenizer factory's tokizer by removing those that do not match a regular expression pattern.
RegExTokenizerFactory A RegExTokenizerFactory creates a tokenizer factory out of a regular expression.
SoundexTokenizerFactory A SoundexTokenizerFactory modifies the output of a base tokenizer factory to produce tokens in soundex representation.
StopTokenizerFactory A StopTokenizerFactory modifies a base tokenizer factory by removing tokens in a specified stop set.
TokenChunker A TokenChunker provides an implementationg of the Chunker interface based on an underlying tokenizer factory.
TokenFeatureExtractor A TokenFeatureExtractor produces feature vectors from character sequences representing token counts.
Tokenization A Tokenization represents the result of tokenizing a string.
Tokenizer The abstract class Tokenizer serves as a base for tokenizer implementations, which provide streams of tokens, whitespaces, and positions.
TokenLengthTokenizerFactory A TokenLengthTokenizerFactory filters the tokenizers produced by a base tokenizer to only return tokens between specified lower and upper length limits.
TokenNGramTokenizerFactory A TokenNGramTokenizerFactory wraps a base tokenizer to produce token n-gram tokens of a specified size.
WhitespaceNormTokenizerFactory A WhitespaceNormTokenizerFactory filters the tokenizers produced by a base tokenizer factory to convert non-empty whitespaces to a single space and leave empty (zero-length) whitespaces alone.
 

Package com.aliasi.tokenizer Description

Classes for tokenizing character sequences.