Class AbstractMedTagParser

  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.StringParser<TagHandler>
          extended by com.aliasi.corpus.parsers.AbstractMedTagParser
Direct Known Subclasses:
GeneTagParser, MedPostPosParser

Deprecated. This class will move to the demos in 4.0.

public abstract class AbstractMedTagParser
extends StringParser<TagHandler>

The AbstractMedTagParser class provides an adapter for NCBI's MedTag corpora, including GeneTag and MedPost. The MedTag format is sentence based, consisting of a number of pairs of lines of the following form:

 tok_tag tok_tag ... tok_tag
 tok_tag tok_tag ... tok_tag
The initial part of the first line, P00073344, provides the PubMed identifier from which the text was abstracted. The second part of the first line, A0367 indicates that the sentence was from the abstract, beginning at character offset 367. The text may be extracted from titles or abstracts; the third line indicates a line beginning with the first character (index 0000) of the title (T) of the citation with PubMed ID 83846.

The second (and fourth) line consist of a sequence of tokens and tags, separated by an underscore. The tags are part-of-speech tags in the MedPost corpus and chunk entity tags in the GeneTag corpus. Note that with this format, whitespace information is lost.

Subclasses must override the parseTokensTags(String[],String[],String[]) method to actually do the parsing of a sentence once its tags are extracted.

For more information on the MedTag project, see:

Bob Carpenter

Constructor Summary
          Deprecated. Construct an abstract MedTag parser with no handler specified.
AbstractMedTagParser(TagHandler handler)
          Deprecated. Moving in 4.0.
Method Summary
 void parseString(char[] cs, int start, int end)
          Deprecated. Parse the specified input source and send extracted taggings to the current handler.
protected abstract  void parseTokensTags(String[] tokens, String[] whitespaces, String[] tags)
          Deprecated. This method handles the raw tokens and tags pulled from a MedTag corpus.
 TagHandler tagHandler()
          Deprecated. Use generic Parser.getHandler() instead.
Methods inherited from class com.aliasi.corpus.StringParser
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public AbstractMedTagParser()
Construct an abstract MedTag parser with no handler specified.


public AbstractMedTagParser(TagHandler handler)
Deprecated. Moving in 4.0.

Construct an abstract MedTag parser with the specified tag handler.

handler - Tag handler.
Method Detail


public TagHandler tagHandler()
Deprecated. Use generic Parser.getHandler() instead.

Returns the tag handler for this parser.

The tag handler for this parser.


public void parseString(char[] cs,
                        int start,
                        int end)
Parse the specified input source and send extracted taggings to the current handler. This string should correspond to the contents of an input file.

Specified by:
parseString in class Parser<TagHandler>
cs - Character array underlying string.
start - First character of string.
end - Index of one past the last character in the string.


protected abstract void parseTokensTags(String[] tokens,
                                        String[] whitespaces,
                                        String[] tags)
This method handles the raw tokens and tags pulled from a MedTag corpus. This method must be implemented by subclasses, and it must call the contained handler on the result.

tokens - Raw tokens to handle.
whitespaces - Raw whitespaces to handle.
tags - Raw tags to handle.