com.aliasi.chunk
Class RegExChunker

java.lang.Object
  extended by com.aliasi.chunk.RegExChunker
All Implemented Interfaces:
Chunker, Compilable, Serializable

public class RegExChunker
extends Object
implements Chunker, Compilable, Serializable

A RegExChunker finds chunks that matches regular expressions. Specifically, a matcher is created and its Matcher.find() method is used to iterate over matching text segments and convert them to chunks.

The behavior of the find method is largely determined by the specific instance of Pattern) on which the chunker is based. For more information, see Sun's RegEx Tutorial.

All found chunks will receive a type and score that is specified at construction time.

Warning: Java uses the same regular expression matching as Perl. Perl uses a greedy strategy for quantifiers, taking something like .* to match as many characters as possible. In constrast, disjunction uses a first-match strategy. For example, the regular expression ab|abc will not produce the same chunker as abc|ab; for input abcde, the former will return ab as a chunk, whereas the latter will return abc. This first-best matching through disjunctions takes precedence over any quantifiers applied to the strings.

Compilation and Serialization

For convenience, this class implements both the util.Compilable and java.io.Serializable interfaces. These both store the same thing, namely the string underlying the regex pattern, the chunk type and the score. The reconstituted object will also be an instance of this class.

Since:
LingPipe2.3
Version:
3.8
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
RegExChunker(Pattern pattern, String chunkType, double chunkScore)
          Construct a chunker based on the specified regular expression pattern, producing the specified chunk type and score.
RegExChunker(String regex, String chunkType, double chunkScore)
          Construct a chunker based on the specified regular expression, producing the specified chunk type and score.
 
Method Summary
 Chunking chunk(char[] cs, int start, int end)
          Return the chunking of the specified character slice.
 Chunking chunk(CharSequence cSeq)
          Return the chunking of the specified character sequence.
 void compileTo(ObjectOutput out)
          Compiles this regular-expression chunker to the specified object output.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RegExChunker

public RegExChunker(String regex,
                    String chunkType,
                    double chunkScore)
Construct a chunker based on the specified regular expression, producing the specified chunk type and score. The regular expression is compiled using the default method Pattern.compile(String).

Parameters:
regex - Regular expression for chunks.
chunkType - Type for all found chunks.
chunkScore - Score for all found chunks.

RegExChunker

public RegExChunker(Pattern pattern,
                    String chunkType,
                    double chunkScore)
Construct a chunker based on the specified regular expression pattern, producing the specified chunk type and score.

Parameters:
pattern - Regular expression patternfor chunks.
chunkType - Type for all found chunks.
chunkScore - Score for all found chunks.
Method Detail

chunk

public Chunking chunk(CharSequence cSeq)
Return the chunking of the specified character sequence. Chunkings are defined by the behavior of Matcher.find() as applied to the regular expression pattern underlying this chunker.

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence to chunk.
Returns:
A chunking of the character sequence.

compileTo

public void compileTo(ObjectOutput out)
               throws IOException
Compiles this regular-expression chunker to the specified object output. When read back in, the object will be an instance of this class.

Specified by:
compileTo in interface Compilable
Parameters:
out - Object output to which this chunker is compiled.
Throws:
IOException - If there is an underlying I/O error during the write.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Return the chunking of the specified character slice.

Specified by:
chunk in interface Chunker
Parameters:
cs - Underlying character sequence.
start - Index of first character in slice.
end - Index of one past the last character in the slice.
Returns:
The chunking over the specified character slice.