com.aliasi.corpus.parsers
Class GeneTagChunkParser

java.lang.Object
  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.StringParser<ObjectHandler<Chunking>>
          extended by com.aliasi.corpus.parsers.GeneTagChunkParser

Deprecated. This class will move to the demos in 4.0.

@Deprecated
public class GeneTagChunkParser
extends StringParser<ObjectHandler<Chunking>>

The GeneTagChunkParser class is designed to parse the offset-annotated first-best GeneTag named entity corpus into a chunk-based representation. GeneTag was created at the United States national Center for Biotechnology Information (NCBI) and is a part of their MedTag distribution, which also includes a part-of-speech corpus (see MedPostPosParser).

NCBI distributes the GeneTag corpus freely for public use as a "United States Government Work" (see included README file for more information):

The GeneTag corpus is distributed in both a tagged version and in an offset representation. The offset representation supplies raw sentences in medtag/genetag/genetag.sent:

 P00010943A0733
 Flurazepam thus appears to be an effective hypnotic drug with the optimum dose for use in general practice being 15 mg at night.
 P00013683A0210
 When extracorporeal CO2 removal approximated CO2 production (VCO2), alveolar ventilation almost ceased.
 ...
 P00001606T0076"
 Comparison with alkaline phosphatases and 5-nucleotidase"
 ...
 
Every other line is an identifier containing the character 'P', the PubMed identifier for the MEDLINE citation, either 'A' or 'T' depending on whether the sentence is from the abstract or title, and then a character offset into the abstract.

First-best gold-standard taggings are in medtag/genetag/Gold.format:

 P00001606T0076|14 33|alkaline phosphatases
 P00001606T0076|37 50|5-nucleotidase
 P00015731A0090|36 52|carbonic anhydrase
 ...
 
These are arranged one per line, beginning with the sentence identifier, then the character offsets.

Warning: The offsets only count non-whitespace characters. Thus the term "alkaline phosphates" does not show up between characters 14 and 33 as one might expect. Consider the following numbering:

 Comparison with alkaline phosphatases and 5-nucleotidase
 01234567890123456789012345678901234567890123456789012345
 0         1         2         3         4         5
 
The characters between 14 and 33 inclusive are "h alkaline phosph", not the phrase "alkaline phosphates" we are looking for. Instead, the terms are indexed not counting whitespace. Thus the appropriate numbering is actually:
 Comparison with alkaline phosphatases and 5-nucleotidase
 0123456789 0123 45678901 234567890123 456 7890123456789012345
 0          1          2          3           4         5
 
and it's evident that the desired phrase is now runs from characters numbered 14 to 33 inclusive.

Because there are two files, this parser cannot be implemented as neatly as the other ones. Instead, the gold format file must be provided in the constructor so that when parsing happens, it has the chunks.

The authors of the corpus do not indicate its character set, but creating a histogram over the bytes shows that the data set contains only 87 distinct ASCII characters. In general, MEDLINE titles and abstracts may contain non-ASCII Latin characters (see the description in MEDLINE characters overview and the full character set in MEDLINE character database).

Since:
LingPipe2.1
Version:
3.9.1
Author:
Bob Carpenter

Field Summary
static String GENE_CHUNK_TYPE
          Deprecated. The type assigned to the chunks extracted by this parser, namely "GENE".
 
Constructor Summary
GeneTagChunkParser(File goldFormatFile)
          Deprecated. Construct a GeneTag chunk parser with the specified gold standard file and no specified handler.
GeneTagChunkParser(File goldFormatFile, ObjectHandler<Chunking> handler)
          Deprecated. Construct a GeneTag chunk parser with the specified gold standard file and the specified chunk handler.
 
Method Summary
 ObjectHandler<Chunking> getChunkHandler()
          Deprecated. Use generic Parser.getHandler() instead.
 void parseString(char[] cs, int start, int end)
          Deprecated. Parse the specified character slice as a string input.
 
Methods inherited from class com.aliasi.corpus.StringParser
parse
 
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GENE_CHUNK_TYPE

public static final String GENE_CHUNK_TYPE
Deprecated. 
The type assigned to the chunks extracted by this parser, namely "GENE".

See Also:
Constant Field Values
Constructor Detail

GeneTagChunkParser

public GeneTagChunkParser(File goldFormatFile)
                   throws IOException
Deprecated. 
Construct a GeneTag chunk parser with the specified gold standard file and no specified handler.

Parameters:
goldFormatFile - The gold standard format file.
Throws:
IOException - If there is an I/O error reading the gold standard file.

GeneTagChunkParser

public GeneTagChunkParser(File goldFormatFile,
                          ObjectHandler<Chunking> handler)
                   throws IOException
Deprecated. 
Construct a GeneTag chunk parser with the specified gold standard file and the specified chunk handler.

Parameters:
goldFormatFile - The gold standard format file.
handler - Chunk handler for this parser.
Throws:
IOException - If there is an I/O error reading the gold standard file.
Method Detail

getChunkHandler

@Deprecated
public ObjectHandler<Chunking> getChunkHandler()
Deprecated. Use generic Parser.getHandler() instead.

Returns the chunk handler for this parser.

Returns:
The chunk handler.

parseString

public void parseString(char[] cs,
                        int start,
                        int end)
Deprecated. 
Description copied from class: Parser
Parse the specified character slice as a string input. Extracted content is passed to the current handler.

Specified by:
parseString in class Parser<ObjectHandler<Chunking>>
Parameters:
cs - Characters underlying slice.
start - Index of first character in slice.
end - One past the index of the last character in slice.