com.aliasi.tokenizer
Class RegExTokenizerFactory

java.lang.Object
  extended by com.aliasi.tokenizer.RegExTokenizerFactory
All Implemented Interfaces:
TokenizerFactory, Compilable, Serializable
Direct Known Subclasses:
LineTokenizerFactory

public class RegExTokenizerFactory
extends Object
implements Compilable, Serializable, TokenizerFactory

A RegExTokenizerFactory creates a tokenizer factory out of a regular expression. The regular expression is presented as an instance of Pattern and matching is carried out with the java.util.regex package. The pattern provided when the factory is constructed is used to create instances of Matcher for use in tokenizers. The method Matcher.find(int) is called to find the next token in an input sequence.

For instance, consider a regular expression which takes a token to be a sequence of alphabetic characters, a sequence of numeric characters, or a single non-alphanumeric character:

      [a-zA-Z]+|[0-9]+|\S
This can be used to construct a tokenizer factory:
     String regex = "[a-zA-Z]+|[0-9]+|\\S";
     TokenizerFactory tf = new RegExTokenizerFactory(regex);
     char[] cs = "abc de 123. ".toCharArray();
     Tokenizer tokenizer = tf.tokenizer(cs,0,cs.length);
Note the escaping of the backslash character (\) in the Java string regex with a backslash (\), resulting in \\. For the regular expression there are no spaces within any of the disjuncts because the matched tokens should not contain whitespaces. Finally note the use of Kleene plus (+) rather than Kleene star (*) to ensure that tokens are at least a single character long. In fact, the constructor will throw an exception if the pattern matches the empty string.

The tokenizer above will return the following tokens, whitespaces and character offsets:

     whitespaces: "", " ", " ", "", " "
          tokens: "abc", "de", "123", "."
    token starts: 0, 4, 7, 10

Thread Safety

A regular-expression-based tokenizer factory is completely thread safe.

Serialization

A regular-expression-based tokenizer factory may be serialized.

Since:
LingPipe2.1
Version:
3.8
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
RegExTokenizerFactory(Pattern pattern)
          Construct a regular expression tokenizer factory with the specified pattern for matching.
RegExTokenizerFactory(String regex)
          Construct a regular expression tokenizer factory using the specified regular expression for matching.
RegExTokenizerFactory(String regex, int flags)
          Construct a regular expression tokenizer factory using the specified regular expression for matching according to the specified flags.
 
Method Summary
 void compileTo(ObjectOutput objOut)
          Deprecated. Use the Serializable interface instead.
 Pattern pattern()
          Returns the regular expression pattern backing this tokenizer factory.
 Tokenizer tokenizer(char[] cs, int start, int length)
          Returns a tokenizer for the specified subsequence of characters.
 String toString()
          Return a description of this regex-based tokenizer factory including its pattern's regular expression and flags.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RegExTokenizerFactory

public RegExTokenizerFactory(String regex)
Construct a regular expression tokenizer factory using the specified regular expression for matching.

Parameters:
regex - The regular expression.
Throws:
PatternSyntaxException - If the expression's syntax is invalid.

RegExTokenizerFactory

public RegExTokenizerFactory(String regex,
                             int flags)
Construct a regular expression tokenizer factory using the specified regular expression for matching according to the specified flags. The value of the lag should be a bitwise disjunction (single vertical bar "|") of the following flags: Pattern.CASE_INSENSITIVE, Pattern.MULTILINE, Pattern.DOTALL, Pattern.UNICODE_CASE and Pattern.CANON_EQ. See Pattern.compile(String,int) for more information.

Parameters:
regex - The regular expression.
flags - The match flags.
Throws:
PatternSyntaxException - If the expression's syntax is invalid.
IllegalArgumentException - If bit values other than those corresponding to defined match flags are set in the flags.

RegExTokenizerFactory

public RegExTokenizerFactory(Pattern pattern)
Construct a regular expression tokenizer factory with the specified pattern for matching.

Parameters:
pattern - Pattern to use for matching.
Method Detail

pattern

public Pattern pattern()
Returns the regular expression pattern backing this tokenizer factory.

Returns:
The pattern for this factory.

tokenizer

public Tokenizer tokenizer(char[] cs,
                           int start,
                           int length)
Description copied from interface: TokenizerFactory
Returns a tokenizer for the specified subsequence of characters.

Specified by:
tokenizer in interface TokenizerFactory
Parameters:
cs - Characters to tokenize.
start - Index of first character to tokenize.
length - Number of characters to tokenize.

compileTo

@Deprecated
public void compileTo(ObjectOutput objOut)
               throws IOException
Deprecated. Use the Serializable interface instead.

Description copied from interface: Compilable
Compile this object to the specified object output.

Specified by:
compileTo in interface Compilable
Parameters:
objOut - Object output to which this object is compiled.
Throws:
IOException - If there is an I/O error compiling the object.

toString

public String toString()
Return a description of this regex-based tokenizer factory including its pattern's regular expression and flags.

Overrides:
toString in class Object
Returns:
A description of this regex-based tokenizer factory.