com.aliasi.corpus.parsers
Class RegexLineTagParser

java.lang.Object
  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.StringParser<TagHandler>
          extended by com.aliasi.corpus.parsers.RegexLineTagParser

Deprecated. Use LineTaggingParser instead.

@Deprecated
public class RegexLineTagParser
extends StringParser<TagHandler>

Provides a means of generating a tag parser based on a extracting zone boundaries and token/tag pairs from lines of data using regular expressions. This provides a useful base implementation of implementing the CoNLL text parsers, which zone inputs by sentence.

The parser is specified by means of three regular expressions. If the ignore regular expression is matched, an input line is ignored. This is useful for ignoring empty lines and comments in some inputs. The eos regular expression recognizes lines that are ends of sentences. Whenever such a line is found, the zone currently being processed is sent to the handler. Finally, the match regular expression is used to extract tags and tokens from input lines, with the token index and tag index specifying the subgroup matched in the regular expression.

Here is a worked example for the CoNLL 2002 data set, a subsequence of which looks like:

 -DOCSTART- -DOCSTART- O
 Met Prep O
 tien Num O
 miljoen Num O
 komen V O
 we Pron O
 , Punc O
 denk V O
 ik Pron O
 , Punc O
 al Adv O
 een Art O
 heel Adj O
 eind N O
 . Punc O

 Dirk N B-PER
 ...
 
And here's the regular expressions used to parse it:
 String TOKEN_TAG_LINE_REGEX
     = "(\\S+)\\s(\\S+\\s)?(O|[B|I]-\\S+)"; // token ?posTag entityTag

 int TOKEN_GROUP = 1; // token
 int TAG_GROUP = 3;   // entityTag

 String IGNORE_LINE_REGEX
     = "-DOCSTART(.*)";  // lines that start with "-DOCSTART"

 String EOS_REGEX
     = "\\A\\Z";         // empty/blank lines

 Parser parser
     = new RegexLineTagParser(TOKEN_TAG_LINE_REGEX,
                              TOKEN_GROUP, TAG_GROUP,
                              IGNORE_LINE_REGEX,
                              EOS_REGEX);
 
Lines starting with "-DOCSTART" are ignored, blank lines end sentences; tokens and entity tags are extracted by matching the regular expression and pulling out match group 1 as the token and match group 3 as the tag. An optional part-of-speech tag between the token and tag on the line is ignored.

Since:
LingPipe2.4.0
Version:
3.8
Author:
Bob Carpenter

Constructor Summary
RegexLineTagParser(String matchRegex, int tokenGroup, int tagGroup, String ignoreRegex, String eosRegex)
          Deprecated. Construct a regular expression tag parser from the specified regular expressions and indexes.
RegexLineTagParser(TagHandler handler, String matchRegex, int tokenGroup, int tagGroup, String ignoreRegex, String eosRegex)
          Deprecated. Being rewritten for new types in 4.0.
 
Method Summary
 TagHandler getTagHandler()
          Deprecated. Use generic Parser.getHandler().
 void parseString(char[] cs, int start, int end)
          Deprecated. Parse the specified character slice as a string input.
 
Methods inherited from class com.aliasi.corpus.StringParser
parse
 
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RegexLineTagParser

public RegexLineTagParser(String matchRegex,
                          int tokenGroup,
                          int tagGroup,
                          String ignoreRegex,
                          String eosRegex)
Deprecated. 
Construct a regular expression tag parser from the specified regular expressions and indexes. See the class documentation for further information.

Parameters:
matchRegex - Regular expression for matching tokens and tags.
tokenGroup - Index of group in regular expression for token.
tagGroup - Index of group in regular expression for tag.
ignoreRegex - Lines matching this regular expression are skipped.
eosRegex - Matches end of sentence for grouping handle events.

RegexLineTagParser

@Deprecated
public RegexLineTagParser(TagHandler handler,
                                     String matchRegex,
                                     int tokenGroup,
                                     int tagGroup,
                                     String ignoreRegex,
                                     String eosRegex)
Deprecated. Being rewritten for new types in 4.0.

Construct a regular expression tag parser from the specified regular expressions and indexes. See the class documentation for further information.

Parameters:
handler - Tag handler for this parser.
matchRegex - Regular expression for matching tokens and tags.
tokenGroup - Index of group in regular expression for token.
tagGroup - Index of group in regular expression for tag.
ignoreRegex - Lines matching this regular expression are skipped.
eosRegex - Matches end of sentence for grouping handle events.
Method Detail

parseString

public void parseString(char[] cs,
                        int start,
                        int end)
Deprecated. 
Description copied from class: Parser
Parse the specified character slice as a string input. Extracted content is passed to the current handler.

Specified by:
parseString in class Parser<TagHandler>
Parameters:
cs - Characters underlying slice.
start - Index of first character in slice.
end - One past the index of the last character in slice.

getTagHandler

@Deprecated
public TagHandler getTagHandler()
Deprecated. Use generic Parser.getHandler().

Returns the tag handler for this tag parser. This is just a convenience cast of Parser.getHandler().

Returns:
Tag handler.