|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.tokenizer.IndoEuropeanTokenizerFactory
public class IndoEuropeanTokenizerFactory
An IndoEuropeanTokenizerFactory creates tokenizers
with built-in support for alpha-numerics, numbers, and other
common constructs in Indo-European langauges.
The tokenization rules are roughly based on those used in MUC-6, but are necessarily finer grained, because the MUC tokenizers were based on lexical and semantic information such as whether a string was an abbreviation.
A token is any sequence of characters satisfying one of the following patterns.
Whitespaces are defined as any sequence of whitespace characters, including the unicode non-breakable space (unicode
Pattern Description AlphaNumeric Any sequence of upper or lowercase letters or digits, as defined by Character.isDigit(char)andCharacter.isLetter(char), and including the Devanagari characters (unicode0x0900to0x097F)Numerical Any sequence of numbers, commas, and periods. Hyphen Sequence Any number of hyphens ( -)Equals Sequence Any number of equals signs ( =)Double Quotes Double forward quotes ( ``) or double backward quotes('')
160). The tokenizer operates in a longest-leftmost
fashion, returning the longest possible token starting at the
current position in the underlying character array.
INSTANCE instead of constructing a fresh instance.
The serialized and compiled versions of this class deserialize to a new instance and the factory instance respectively.
| Field Summary | |
|---|---|
static TokenizerFactory |
FACTORY
Deprecated. Use INSTANCE instead. |
static IndoEuropeanTokenizerFactory |
INSTANCE
The singleton instance of an Indo-European tokenizer factory. |
| Constructor Summary | |
|---|---|
IndoEuropeanTokenizerFactory()
Deprecated. Use singleton instance INSTANCE instead. |
|
| Method Summary | |
|---|---|
void |
compileTo(ObjectOutput objOut)
Deprecated. Use the Serializable interface instead. |
Tokenizer |
tokenizer(char[] ch,
int start,
int length)
Returns a tokenizer for Indo-European for the specified subsequence of characters. |
String |
toString()
Returns tha name of this class. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final IndoEuropeanTokenizerFactory INSTANCE
@Deprecated public static final TokenizerFactory FACTORY
INSTANCE instead.INSTANCE.
| Constructor Detail |
|---|
@Deprecated public IndoEuropeanTokenizerFactory()
INSTANCE instead.
Implementation Note: All Indo-European tokenizer
factories behave the same way, and they are thread safe, so the
constant FACTORY may be used anywhere a freshly
constructed character tokenizer factory is used, without loss
of performance.
| Method Detail |
|---|
public Tokenizer tokenizer(char[] ch,
int start,
int length)
tokenizer in interface TokenizerFactorych - Characters to tokenize.start - Index of first character to tokenize.length - Number of characters to tokenize.public String toString()
toString in class Object
@Deprecated
public void compileTo(ObjectOutput objOut)
throws IOException
Serializable interface instead.
FACTORY.
compileTo in interface CompilableobjOut - Object output to which this tokenizer factory is
compiled.
IOException - If there is an I/O error during the write.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||