Package com.aliasi.tokenizer

Classes for tokenizing character sequences.


Interface Summary
TokenCategorizer A TokenCategorizer supplies a string-based category for string-based tokens.
TokenizerFactory A TokenizerFactory constructors tokenizers from subsequences of character arrays.

Class Summary
CharacterTokenCategorizer Returns a category for tokens made up out of a single character.
CharacterTokenizerFactory A CharacterTokenizerFactory considers each non-whitespace character in the input to be a distinct token.
EnglishStopListFilterTokenizer Deprecated. Use EnglishStopTokenizerFactory instead.
EnglishStopTokenizerFactory An EnglishStopTokenizerFactory applies an English stop list to a contained base tokenizer factory.
FilterTokenizer Deprecated. Use ModifiedTokenizerFactory instead.
IndoEuropeanTokenCategorizer A IndoEuropeanTokenCategorizer is a generic token categorizer for Indo-European languages that is based on character "shape".
IndoEuropeanTokenizerFactory An IndoEuropeanTokenizerFactory creates tokenizers with built-in support for alpha-numerics, numbers, and other common constructs in Indo-European langauges.
LengthStopFilterTokenizer Deprecated. Use TokenLengthTokenizerFactory or ModifyTokenTokenizerFactory.modify(Tokenizer) instead.
LineTokenizerFactory A LineTokenizerFactory treats each line of an input as a token.
LowerCaseFilterTokenizer Deprecated. Use LowerCaseTokenizerFactory instead.
LowerCaseTokenizerFactory A LowerCaseTokenizerFactory filters the tokenizers produced by a base tokenizer factory to produce lower case output.
ModifiedTokenizerFactory A ModifiedTokenizerFactory is an abstract tokenizer factory that modifies a tokenizer returned by a base tokenizer factory.
ModifyTokenTokenizerFactory The abstract base class ModifyTokenTokenizerFactory adapts token and whitespace modifiers to modify tokenizer factories.
NGramTokenizerFactory An NGramTokenizerFactory creates n-gram tokenizers of a specified minimum and maximun length.
NormalizeWhiteSpaceFilterTokenizer Deprecated. Use WhitespaceNormTokenizerFactory instead.
PorterStemmer Deprecated. Use PorterStemmerTokenizerFactory.stem(String) instead.
PorterStemmerFilterTokenizer Deprecated. Use PorterStemmerTokenizerFactory instead.
PorterStemmerTokenizerFactory A PorterStemmerTokenizerFactory applies Porter's stemmer to the tokenizers produced by a base tokenizer factory.
PunctuationStopListTokenizer Deprecated. Use RegExFilteredTokenizerFactory with a pattern matching the characters specified in Strings.allPunctuation(String).
RegExFilteredTokenizerFactory A RegExFilteredTokenizerFactory modifies the tokens returned by a base tokenizer factory's tokizer by removing those that do not match a regular expression pattern.
RegExTokenizerFactory A RegExTokenizerFactory creates a tokenizer factory out of a regular expression.
SoundexFilterTokenizer Deprecated. Use SoundexTokenizerFactory instead.
SoundexTokenizerFactory A SoundexTokenizerFactory modifies the output of a base tokenizer factory to produce tokens in soundex representation.
StopFilterTokenizer Deprecated. Use ModifyTokenTokenizerFactory instead.
StopListFilterTokenizer Deprecated. Use StopTokenizerFactory instead.
StopTokenizerFactory A StopTokenizerFactory modifies a base tokenizer factory by removing tokens in a specified stop set.
TokenChunker A TokenChunker provides an implementationg of the Chunker interface based on an underlying tokenizer factory.
TokenFeatureExtractor A TokenFeatureExtractor produces feature vectors from character sequences representing token counts.
TokenFilterTokenizer Deprecated. Use ModifyTokenTokenizerFactory instead.
Tokenization A Tokenization represents the result of tokenizing a string.
Tokenizer The abstract class Tokenizer serves as a base for tokenizer implementations, which provide streams of tokens, whitespaces, and positions.
TokenLengthTokenizerFactory A TokenLengthTokenizerFactory filters the tokenizers produced by a base tokenizer to only return tokens between specified lower and upper length limits.
WhitespaceNormTokenizerFactory A WhitespaceNormTokenizerFactory filters the tokenizers produced by a base tokenizer factory to convert non-empty whitespaces to a single space and leave empty (zero-length) whitespaces alone.

Package com.aliasi.tokenizer Description

Classes for tokenizing character sequences.