|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.corpus.Parser<H>
com.aliasi.corpus.StringParser<TextHandler>
com.aliasi.corpus.parsers.GigawordTextParser
public class GigawordTextParser
A text parser for the Linguistic Data Consortium's English Gigaword Corpus. This parser extracts the text from the articles tagged as stories, passing it to the handler one story at a time.
There is an extensive read-me for the corpus available online:
README File for the Gigaword English Text Corpus
The gigaword corpus contains over 1.75 billion words of English.
Roughly 1.5 billion words of that appear in documents of type
story. The distribution directory is organized by
news source: Agence France, AP, New York Times, and Xinhua. In
addition to stories, there are three other types of document:
multi-part blurbs, advisories to editors, and other. The
"other" category includes lists like sports scores, stock
prices, etc. Unfortunately, this division is only approximate,
and stock listings and scores appear in some stories, too. This
is expected given the corpus README, which states:
... the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type.
The corpus is distributed in files organized by source. Each file is roughly 12MB gzipped and 36MB unzipped. The format is as a sequence of SGML documents. Here's an example document from the distribution's read-me:
<DOC id="..." type="..."> <HEADLINE> The Headline Element is Optional -- not all DOCs have one </HEADLINE> <DATELINE> The Dateline Element is Optional -- not all DOCs have one </DATELINE> <TEXT> <P> Paragraph tags are only used if the 'type' attribute of the DOC happens to be "story" </P> <P> Note that all data files use the UNIX-standard "\n" form of line termination, and text lines are generally wrapped to a width of 80 characters or less. </P> </TEXT> </DOC>
This parser extracts the text content of the TEXT
elements with documents of type story. The only
characters appearing in the corpus are printable ASCII characters
including whitespace. Newlines are replaced with spaces, a tab
character is inserted to statrt each paragraph, and the two
variations of the single entity used, &AMP; and
&amp;, are replaced with the ampersand
(&) character. No other transformations on the
text are performed by this parser. The creators of the corpus note
somewhat confusingly:
All other [besides ampersand escape] specialized control characters have been filtered out, and unusual punctuation (such as the underscore character, used in NYT and APW to represent an "em-dash" character) has been left as-is, or converted to simple equivalents (e.g. hyphens).
Links to the sources for the corpus are:
| Constructor Summary | |
|---|---|
GigawordTextParser()
Construct a Gigaword text parser with a null
handler. |
|
GigawordTextParser(TextHandler handler)
Construct a Gigaword text parser with the specified text handler as the current handler. |
|
| Method Summary | |
|---|---|
void |
parseString(char[] cs,
int start,
int end)
Parse the specified character slice as a Gigaword document, passing the text content of stories to the contained handler. |
| Methods inherited from class com.aliasi.corpus.StringParser |
|---|
parse |
| Methods inherited from class com.aliasi.corpus.Parser |
|---|
getHandler, parse, parse, parseString, setHandler |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public GigawordTextParser()
null
handler.
public GigawordTextParser(TextHandler handler)
handler - Text handler for extracted text.| Method Detail |
|---|
public void parseString(char[] cs,
int start,
int end)
throws IOException
parseString in class Parser<TextHandler>cs - Underlying characters.start - Index of first character.end - Index of one past the last character.
IOException - If there is an exception reading from the
specified input stream.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||