com.aliasi.tag
Class LineTaggingParser
java.lang.Object
com.aliasi.corpus.Parser<H>
com.aliasi.corpus.StringParser<ObjectHandler<Tagging<String>>>
com.aliasi.tag.LineTaggingParser
public class LineTaggingParser
- extends StringParser<ObjectHandler<Tagging<String>>>
Provides a means of generating a parser for taggings based on a
extracting zone boundaries and token/tag pairs from lines of data
using regular expressions. This provides a useful base
implementation for CoNLL and other formats, which zone inputs by
sentence and provide a single token per line.
The parser is specified by means of three regular expressions.
If the ignore regular expression is matched, an input line is
ignored. This is useful for ignoring empty lines and comments in
some inputs. The eos regular expression recognizes lines that are
ends of sentences. Whenever such a line is found, the zone
currently being processed is sent to the handler. Finally, the
match regular expression is used to extract tags and tokens from
input lines, with the token index and tag index specifying the
subgroup matched in the regular expression.
Here is a worked example for the CoNLL 2002 data set, a subsequence
of which looks like:
-DOCSTART- -DOCSTART- O
Met Prep O
tien Num O
miljoen Num O
komen V O
we Pron O
, Punc O
denk V O
ik Pron O
, Punc O
al Adv O
een Art O
heel Adj O
eind N O
. Punc O
Dirk N B-PER
...
And here's the regular expressions used to parse it:
String TOKEN_TAG_LINE_REGEX
= "(\\S+)\\s(\\S+\\s)?(O|[B|I]-\\S+)"; // token ?posTag entityTag
int TOKEN_GROUP = 1; // token
int TAG_GROUP = 3; // entityTag
String IGNORE_LINE_REGEX
= "-DOCSTART(.*)"; // lines that start with "-DOCSTART"
String EOS_REGEX
= "\\A\\Z"; // empty/blank lines
Parser parser
= new RegexLineTagParser(TOKEN_TAG_LINE_REGEX,
TOKEN_GROUP, TAG_GROUP,
IGNORE_LINE_REGEX,
EOS_REGEX);
Lines starting with "-DOCSTART" are
ignored, blank lines end sentences; tokens and entity tags
are extracted by matching the regular expression and pulling
out match group 1 as the token and match group 3 as the tag.
An optional part-of-speech tag between the token and tag
on the line is ignored.
- Since:
- LingPipe3.9.1
- Version:
- 3.9.1
- Author:
- Bob Carpenter
|
Constructor Summary |
LineTaggingParser(String matchRegex,
int tokenGroup,
int tagGroup,
String ignoreRegex,
String eosRegex)
Construct a regular expression tagging parser from the
specified regular expressions and indexes. |
|
Method Summary |
void |
parseString(char[] cs,
int start,
int end)
Parse the specified character slice as a string input. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LineTaggingParser
public LineTaggingParser(String matchRegex,
int tokenGroup,
int tagGroup,
String ignoreRegex,
String eosRegex)
- Construct a regular expression tagging parser from the
specified regular expressions and indexes. See the class
documentation for further information.
- Parameters:
matchRegex - Regular expression for matching tokens and tags.tokenGroup - Index of group in regular expression for token.tagGroup - Index of group in regular expression for tag.ignoreRegex - Lines matching this regular expression are
skipped.eosRegex - Matches end of sentence for grouping handle
events.
parseString
public void parseString(char[] cs,
int start,
int end)
- Description copied from class:
Parser
- Parse the specified character slice as a string input. Extracted
content is passed to the current handler.
- Specified by:
parseString in class Parser<ObjectHandler<Tagging<String>>>
- Parameters:
cs - Characters underlying slice.start - Index of first character in slice.end - One past the index of the last character in slice.