|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.tokenizer.IndoEuropeanTokenizerFactory
public class IndoEuropeanTokenizerFactory
An IndoEuropeanTokenizerFactory creates tokenizers for
subsequences of character arrays.
A tokenizer for Indo-European languages. The tokenization rules
are roughly based on those used in MUC-6, but are necessarily finer
grained, because the MUC tokenizers were based on lexical and
semantic information such as whether a string was an abbreviation.
A token is any sequence of characters satisfying one of the following patterns.
Whitespaces are defined as any sequence of whitespace characters, including the unicode non-breakable space (unicode
Pattern Description AlphaNumeric Any sequence of upper or lowercase letters or digits, as defined by Character.isDigit(char)andCharacter.isLetter(char), and including the Devanagari characters (unicode0x0900to0x097F)Numerical Any sequence of numbers, commas, and periods. Hyphen Sequence Any number of hyphens ( -)Equals Sequence Any number of equals signs ( =)Double Quotes Double forward quotes ( ``) or double backward quotes('')
160). The tokenizer operates in a longest-leftmost
fashion, returning the longest possible token starting at the
current position in the underlying character array.
The serialized and compiled versions of this class deserialize to a new instance and the factory instance respectively.
| Field Summary | |
|---|---|
static TokenizerFactory |
FACTORY
An instance of an Indo-European tokenizer factory. |
| Constructor Summary | |
|---|---|
IndoEuropeanTokenizerFactory()
Construct a tokenizer for Indo-European languages. |
|
| Method Summary | |
|---|---|
void |
compileTo(ObjectOutput objOut)
Compiles this tokenizer factory to the specified object output. |
Tokenizer |
tokenizer(char[] ch,
int start,
int length)
Returns a tokenizer for Indo-European for the specified subsequence of characters. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final TokenizerFactory FACTORY
| Constructor Detail |
|---|
public IndoEuropeanTokenizerFactory()
Implementation Note: All Indo-European tokenizer
factories behave the same way, and they are thread safe, so the
constant FACTORY may be used anywhere a freshly
constructed character tokenizer factory is used, without loss
of performance.
| Method Detail |
|---|
public Tokenizer tokenizer(char[] ch,
int start,
int length)
tokenizer in interface TokenizerFactorych - Characters to tokenize.start - Index of first character to tokenize.length - Number of characters to tokenize.
public void compileTo(ObjectOutput objOut)
throws IOException
FACTORY.
compileTo in interface CompilableobjOut - Object output to which this tokenizer factory is
compiled.
IOException - If there is an I/O error during the write.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||