com.aliasi.corpus.parsers
Class GeneTagParser

java.lang.Object
  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.StringParser<TagHandler>
          extended by com.aliasi.corpus.parsers.AbstractMedTagParser
              extended by com.aliasi.corpus.parsers.GeneTagParser

Deprecated. This class will move to the demos in 4.0.

@Deprecated
public class GeneTagParser
extends AbstractMedTagParser

The GeneTagParser class provides a tag parser for the GeneTag named-entity corpus. GeneTag was created at the United States national Center for Biotechnology Information (NCBI) and is a part of their MedTag distribution, which also includes a part-of-speech corpus (see MedPostPosParser).

NCBI distributes the GeneTag corpus freely for public use as a "United States Government Work" (see included README file for more information):

The GeneTag corpus is in the single file /medtag/genetag/genetag.tag relative to the directory into which the distribution is unpacked.

An excerpt of two training sentences in the file is:

 P00073344A0367
 In_TAG 2_TAG subjects_TAG the_TAG phytomitogen_TAG reactivity_TAG of_TAG the_TAG lymphocytes_TAG was_TAG improved_TAG after_TAG treatment_TAG ._TAG
 P00083846T0000
 Albumin_GENE2 and_TAG cyclic_TAG AMP_TAG levels_TAG in_TAG peritoneal_TAG fluids_TAG in_TAG the_TAG child_TAG
 P00088391A0181
 On_TAG the_TAG other_TAG hand_TAG factor_GENE1 IX_GENE1 activity_TAG is_TAG decreased_TAG in_TAG coumarin_TAG treatment_TAG with_TAG factor_GENE2 IX_GENE2 antigen_TAG remaining_TAG normal_TAG ._TAG
 
GeneTag marks up individual sentences with a combination of GENE1, GENE2 and TAG tags. A chunk is a contiguous sequence of GENE1 or GENE2 tags. The indices 1 and 2 are not to differentiate types, but to allow two genes in a row. In fact, the corpus is annotateed such that the gene references alternate even across sentences.

The output tagging is in the standard LingPipe BIO format.

The primary reference for GeneTag is:

Since:
LingPipe2.1
Version:
3.9.1
Author:
Bob Carpenter

Field Summary
static String B_GENE_TAG
          Deprecated. The tag used to start gene spans, "B-GENE".
static String GENE_TYPE
          Deprecated. The type of gene chunks, namely "GENE".
static String I_GENE_TAG
          Deprecated. The tag used to continue gene spans, "I-GENE".
 
Constructor Summary
GeneTagParser()
          Deprecated. Construct a GeneTag corpus parser with no handler specified.
GeneTagParser(TagHandler handler)
          Deprecated. Moving to demos in 4.0
 
Method Summary
protected  void parseTokensTags(String[] tokens, String[] whitespaces, String[] tags)
          Deprecated. Implementation of the tag normalizer for the GeneTag corpus.
 
Methods inherited from class com.aliasi.corpus.parsers.AbstractMedTagParser
parseString, tagHandler
 
Methods inherited from class com.aliasi.corpus.StringParser
parse
 
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GENE_TYPE

public static String GENE_TYPE
Deprecated. 
The type of gene chunks, namely "GENE".


B_GENE_TAG

public static final String B_GENE_TAG
Deprecated. 
The tag used to start gene spans, "B-GENE".


I_GENE_TAG

public static final String I_GENE_TAG
Deprecated. 
The tag used to continue gene spans, "I-GENE".

Constructor Detail

GeneTagParser

public GeneTagParser()
Deprecated. 
Construct a GeneTag corpus parser with no handler specified.


GeneTagParser

@Deprecated
public GeneTagParser(TagHandler handler)
Deprecated. Moving to demos in 4.0

Construct a GeneTag corpus parser with the specified handler.

Parameters:
handler - Tag handler for taggings.
Method Detail

parseTokensTags

protected void parseTokensTags(String[] tokens,
                               String[] whitespaces,
                               String[] tags)
Deprecated. 
Implementation of the tag normalizer for the GeneTag corpus. This method converts the tags in the corpus-specific format into LingPipe's BIO format: B-GENE, I-GENE and O).

Specified by:
parseTokensTags in class AbstractMedTagParser
Parameters:
tokens - Raw tokens to handle.
whitespaces - Raw whitespaces to handle.
tags - Raw tags to handle.