Class Muc6ChunkParser

  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.InputSourceParser<H>
          extended by com.aliasi.corpus.XMLParser<ObjectHandler<Chunking>>
              extended by com.aliasi.corpus.parsers.Muc6ChunkParser

Deprecated. This class will move to the demos in 4.0.

public class Muc6ChunkParser
extends XMLParser<ObjectHandler<Chunking>>

A Muc6ChunkParser parses MUC6-formatted named-entity corpora in XML.

SGML to XML Munging

Because the MUC corpora are formatted using SGML, we employed a program to munge the actual data by replacing unknown entity references with simple equivalents, as follows:

We also added a DTD declaration with the UTF-8 character format (the original data is all in the ASCII range, 0-127). Finally, we removed STORYID and SLUG elements and all of their content.

Corpus Format Requirements

The data files must be well-formed XML, as an XML parser is used to parse them. Training is restricted to the sentence (s) elements, the entities in which are wrapped in an ENAMEX element. An example is: only requirements for this format is that it is organized by sentence with named-entities marked with the ENAMEX element, as in:

 <s> After 20 years of pushing labor proposals to
 overhaul the nation's health-care system, <ENAMEX
 TYPE="ORGANIZATION">the AFL-CIO</ENAMEX> is finding interest from
 an unlikely quarter: big business.  </s>

Any other containing elements, such as the paragraph (p) elements in the MUC6 data, will be ignored. There should be no additional element markup within the s elements other than the ENAMEX elements. These ENAMEX elements must have an attribute TYPE whose value is the entity type of the element. For most of the chunkers, extra whitespace does not matter; the extra whitespace above is courtesy of the original corpus.

Bob Carpenter

Constructor Summary
          Deprecated. Construct a MUC6 chunk parser with no handler specified.
Muc6ChunkParser(ObjectHandler<Chunking> handler)
          Deprecated. Construct a MUC6 chunk parser with the specified chunk handler.
Method Summary
protected  DefaultHandler getXMLHandler()
          Deprecated. Return the default handler for SAX events.
 void setSentenceTag(String tag)
          Deprecated. Sets the value of the sentence tag to be the specified value.
Methods inherited from class com.aliasi.corpus.XMLParser
Methods inherited from class com.aliasi.corpus.InputSourceParser
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public Muc6ChunkParser()
Construct a MUC6 chunk parser with no handler specified.


public Muc6ChunkParser(ObjectHandler<Chunking> handler)
Construct a MUC6 chunk parser with the specified chunk handler.

handler - Chunk handler for the parser.
Method Detail


protected DefaultHandler getXMLHandler()
Description copied from class: XMLParser
Return the default handler for SAX events. This default handler should wrap the Handler specified for this class and pass events to it extracted from the XML. Typical concrete implementations of this method will extract the underlying handler using Parser.getHandler() and wrap it in a default handler.

This method is called exactly once in each parse method in this class. Thus dynamic updates to the underlying handler may be picked up by this adapter method.

Specified by:
getXMLHandler in class XMLParser<ObjectHandler<Chunking>>
SAX handler for XML parsing.


public void setSentenceTag(String tag)
Sets the value of the sentence tag to be the specified value. Only elements within sentences will be picked up by the parser.

tag - Tag marking sentence elements.