com.aliasi.corpus.parsers
Class GeniaSentenceParser

java.lang.Object
  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.InputSourceParser<H>
          extended by com.aliasi.corpus.XMLParser
              extended by com.aliasi.corpus.parsers.GeniaSentenceParser

public class GeniaSentenceParser
extends XMLParser

A GeniaSentenceParser provides a chunk parser for the XML version of the GENIA corpus. The type assigned to sentence chunks is the constant SentenceChunker.SENTENCE_CHUNK_TYPE. It only returns the sentences from citation abstracts, not sentences in citation titles.

The following example is drawn from the initial part of the merged 3.02 version of the GENIA corpus (with some content ellided and replaced by ellipses (..., but all spaces/linebreaks left as is):

<set>
<article>
<articleinfo>
<bibliomisc>MEDLINE:95369245</bibliomisc>
</articleinfo>
<title>
<sentence>...</sentence>
</title>
<abstract>
<sentence><w c="NN">Activation</w> <w c="IN">of</w> <w c="DT">the</w> <cons lex="CD28_surface_receptor" sem="G#protein_family_or_group"><cons lex="CD28" sem="G#protein_molecule"><w c="NN">CD28</w></cons> <w c="NN">surface</w> <w c="NN">receptor</w></cons> <w c="VBZ">provides</w> <w c="DT">a</w> <w c="JJ">major</w> <w c="JJ">costimulatory</w> <w c="NN">signal</w> <w c="IN">for</w> <cons lex="T_cell_activation" sem="G#other_name"><w c="NN">T</w> <w c="NN">cell</w> <w c="NN">activation</w></cons> <w c="VBG">resulting</w> <w c="IN">in</w> <w c="VBN">enhanced</w> <w c="NN">production</w> <w c="IN">of</w> <cons lex="interleukin-2" sem="G#protein_molecule"><w c="NN">interleukin-2</w></cons> <w c="(">(</w><cons lex="IL-2" sem="G#protein_molecule"><w c="NN">IL-2</w></cons><w c=")">)</w> <w c="CC">and</w> <cons lex="cell_proliferation" sem="G#other_name"><w c="NN">cell</w> <w c="NN">proliferation</w></cons><w c=".">.</w></sentence>
<sentence>...</sentence>
...
 
All that is required is to pull all of the text content (including informative spaces) from the sentence elements.

The GENIA corpus is available free of charge from:

Since:
LingPipe2.1.1
Version:
2.1.1
Author:
Bob Carpenter

Field Summary
static String GENIA_ABSTRACT_ELT
          The tag used for abstract elements in GENIA, namely abstract.
static String GENIA_SENTENCE_ELT
          The tag used for sentence elements in GENIA, namely sentence.
 
Constructor Summary
GeniaSentenceParser()
          Construct a GENIA sentence chunk parser with no designated chunk handler.
GeniaSentenceParser(ChunkHandler handler)
          Construct a GENIA sentence chunk parser with the specified chunk handler.
 
Method Summary
 ChunkHandler getChunkHandler()
          Returns the chunk handler for this sentence parser.
protected  DefaultHandler getXMLHandler()
          Returns the embedded XML handler.
 void setHandler(Handler handler)
          Sets the handler to the specified chunk handler.
 
Methods inherited from class com.aliasi.corpus.XMLParser
parse
 
Methods inherited from class com.aliasi.corpus.InputSourceParser
parseString
 
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

GENIA_SENTENCE_ELT

public static final String GENIA_SENTENCE_ELT
The tag used for sentence elements in GENIA, namely sentence.

See Also:
Constant Field Values

GENIA_ABSTRACT_ELT

public static final String GENIA_ABSTRACT_ELT
The tag used for abstract elements in GENIA, namely abstract.

See Also:
Constant Field Values
Constructor Detail

GeniaSentenceParser

public GeniaSentenceParser()
                    throws SAXException
Construct a GENIA sentence chunk parser with no designated chunk handler. Chunk handlers may be later set using the method setHandler(Handler).

Throws:
SAXException - If there is an error configuring the SAX XML reader required for parsing.

GeniaSentenceParser

public GeniaSentenceParser(ChunkHandler handler)
                    throws SAXException
Construct a GENIA sentence chunk parser with the specified chunk handler.

Parameters:
handler - The chunk handler used to process sentences found by this parser.
Throws:
SAXException - If there is an error configuring the SAX XML reader required for parsing.
Method Detail

getXMLHandler

protected DefaultHandler getXMLHandler()
Returns the embedded XML handler. This method implements the required method for the abstract superclass XMLParser.

Specified by:
getXMLHandler in class XMLParser
Returns:
The XML handler for this class.

setHandler

public void setHandler(Handler handler)
Sets the handler to the specified chunk handler. If the handler is not a chunk handler, an illegal argument exception will be raised.

Overrides:
setHandler in class Parser
Parameters:
handler - New chunk handler.
Throws:
IllegalArgumentException - If the handler is not a chunk handler.

getChunkHandler

public ChunkHandler getChunkHandler()
Returns the chunk handler for this sentence parser. The result will be the same as calling the superclass method Parser.getHandler(), but the result in this case is cast to type ChunkHandler.

Returns:
The chunk handler for this sentence parser.