Class GigawordTextParser

  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.StringParser<TextHandler>
          extended by com.aliasi.corpus.parsers.GigawordTextParser

Deprecated. This class will move to the demos in 4.0.

public class GigawordTextParser
extends StringParser<TextHandler>

A text parser for the Linguistic Data Consortium's English Gigaword Corpus. This parser extracts the text from the articles tagged as stories, passing it to the handler one story at a time.

There is an extensive read-me for the corpus available online:

README File for the Gigaword English Text Corpus

The gigaword corpus contains over 1.75 billion words of English. Roughly 1.5 billion words of that appear in documents of type story. The distribution directory is organized by news source: Agence France, AP, New York Times, and Xinhua. In addition to stories, there are three other types of document: multi-part blurbs, advisories to editors, and other. The "other" category includes lists like sports scores, stock prices, etc. Unfortunately, this division is only approximate, and stock listings and scores appear in some stories, too. This is expected given the corpus README, which states:

... the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type.

The corpus is distributed in files organized by source. Each file is roughly 12MB gzipped and 36MB unzipped. The format is as a sequence of SGML documents. Here's an example document from the distribution's read-me:

 <DOC id="..." type="...">
 The Headline Element is Optional -- not all DOCs have one
 The Dateline Element is Optional -- not all DOCs have one
 Paragraph tags are only used if the 'type' attribute of the DOC happens
 to be "story"
 Note that all data files use the UNIX-standard "\n" form of line
 termination, and text lines are generally wrapped to a width of 80
 characters or less.

This parser extracts the text content of the TEXT elements with documents of type story. The only characters appearing in the corpus are printable ASCII characters including whitespace. Newlines are replaced with spaces, a tab character is inserted to statrt each paragraph, and the two variations of the single entity used, &AMP; and &amp;, are replaced with the ampersand (&) character. No other transformations on the text are performed by this parser. The creators of the corpus note somewhat confusingly:

All other [besides ampersand escape] specialized control characters have been filtered out, and unusual punctuation (such as the underscore character, used in NYT and APW to represent an "em-dash" character) has been left as-is, or converted to simple equivalents (e.g. hyphens).

Links to the sources for the corpus are:

Bob Carpenter

Constructor Summary
          Deprecated. Construct a Gigaword text parser with a null handler.
GigawordTextParser(TextHandler handler)
          Deprecated. See class documentation.
Method Summary
 void parseString(char[] cs, int start, int end)
          Deprecated. See class documentation.
Methods inherited from class com.aliasi.corpus.StringParser
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public GigawordTextParser()
Construct a Gigaword text parser with a null handler.


public GigawordTextParser(TextHandler handler)
Deprecated. See class documentation.

Construct a Gigaword text parser with the specified text handler as the current handler.

handler - Text handler for extracted text.
Method Detail


public void parseString(char[] cs,
                                   int start,
                                   int end)
                 throws IOException
Deprecated. See class documentation.

Parse the specified character slice as a Gigaword document, passing the text content of stories to the contained handler. See the class documentation above for more information.

Specified by:
parseString in class Parser<TextHandler>
cs - Underlying characters.
start - Index of first character.
end - Index of one past the last character.
IOException - If there is an exception reading from the specified input stream.