|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.corpus.Parser<H>
com.aliasi.corpus.StringParser<ClassificationHandler<CharSequence,Classification>>
com.aliasi.corpus.parsers.Reuters21578Parser
@Deprecated public class Reuters21578Parser
A Reuters21578Parser provides a parser for the
Reuters-21578 text categorization test collection. The Reuters
collection consists of business stories from Reuters published in
the 1980s. There are a total of 123 topics, all having to do with
business (e.g. "money-supply" and "earn"), but
the count of their training documents ranges from almost 4000 down
to 1; there are only 26 topics with 100 or more training documents.
The parser produces classifications for the handler that are
binary, relative to a specified topic. That is, a topic such as
earn is fixed, and the classifications are binary,
assigning the accept category to a document that is tagged as
belonging to the earn topic in the corpus, and
assigning a reject category to documents not assigned to the
earn topic. The categories are the default accept and
reject categories in BinaryLMClassifier, namely BinaryLMClassifier.DEFAULT_ACCEPT_CATEGORY and BinaryLMClassifier.DEFAULT_REJECT_CATEGORY.
With a parser, the typical usage scenario would involve setting
up a parser, then parsing all of the SGML documents making up the
corpus. It's also possible to encapsulate this logic using a
higher-order interface, that of Corpus. There is a static
factory method corpus(String,File) that constructs an
implementation of Corpus from the Reuters collection for a
specified topic. This higher-order interface is necessary useful
for batch learning algorithms that require corpora input, like the
perceptron classifier.
Here is the complete list of topics found in the Reuters corpus. This is just the result of scraping the topics, not necessarily the count of topics in the Mod-Apte split. The topics are drawn from general subjects, economic indicators, corporate reports, currencies, and commodities (including energy). The topics are described further in the following file distributed with the corpus:
cat-descriptions_120396.txt
|
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This class distinguishes between training and test data based on the encoding in the corpus itself. Each document in the corpus is marked as being either a test or training document (or neither). This class parses either or both of the test and training documents from the corpus based on the boolean flags provided to the constructor.
This class uses the "Modified Apte" (ModApte) split of the corpus into training and test segments, which is defined as follows (see the README cited below for more details):
| Category | Number of Documents | SGML Pattern |
|---|---|---|
| Training | 9,603 | LEWISSPLIT="TRAIN"; TOPICS="YES" |
| Test | 3,299 | LEWISSPLIT="TEST"; TOPICS="YES" |
| Unused | 8,676 | LEWISSPLIT="NOT-USED"; TOPICS="YES"
or TOPICS="NO"
or TOPICS="BYPASS" |
Note that some of the listed topics occur only in the unused portion of the corpus; see the README cited below for more information.
The corpus is distributed as 22 SGML files encoded in ASCII
(reut2-000.sgm through reut2-021.sgm). It is
these SGML files which are parsed by this parser.
The Reuters-21578 collection may be downloaded for research purposes from:
| Constructor Summary | |
|---|---|
Reuters21578Parser(String topic,
boolean includeTrainingDocuments,
boolean includeTestDocuments)
Deprecated. Construct a Reuters-21578 test collection parser for the specified topic that includes test and/or training documents as specified. |
|
| Method Summary | |
|---|---|
static String[] |
availableTopics()
Deprecated. Returns an array consisting of all of the available topics in the Reuters collection. |
static Corpus<ClassificationHandler<CharSequence,Classification>> |
corpus(String topic,
File directory)
Deprecated. See class documentation. |
static boolean |
isAvailableTopic(String topic)
Deprecated. Returns true if the specified topic is
available in the Reuters collection. |
void |
parseString(char[] cs,
int start,
int end)
Deprecated. Implements the parser for character array slices. |
| Methods inherited from class com.aliasi.corpus.StringParser |
|---|
parse |
| Methods inherited from class com.aliasi.corpus.Parser |
|---|
getHandler, parse, parse, parseString, setHandler |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public Reuters21578Parser(String topic,
boolean includeTrainingDocuments,
boolean includeTestDocuments)
The topic specified must be available as part of the
Reuters classification. If it isn't, the constructor will raise
an illegal argument exception. The set of legal topics is
available through availableTopics(), and a topic
may be tested through isAvailableTopic(String).
topic - One of the topics in the Reuters collection.includeTrainingDocuments - Set to true to handle
training documents.includeTestDocuments - Set to true to handle
test documents.
IllegalArgumentException - If the topic isn't an available
topic for the Reuters collection.| Method Detail |
|---|
public void parseString(char[] cs,
int start,
int end)
parseString in class Parser<ClassificationHandler<CharSequence,Classification>>cs - Underlying character array.start - Index of first character in the slice.end - Index of the last character in the slice plus 1.public static String[] availableTopics()
The list is a copy, so changing it has no effect on this class.
public static boolean isAvailableTopic(String topic)
true if the specified topic is
available in the Reuters collection.
topic - Topic to test.
true if it available for classification.
@Deprecated
public static Corpus<ClassificationHandler<CharSequence,Classification>> corpus(String topic,
File directory)
throws IOException
The directory specified is read each time the methods of the returned corpus are called. This streams the relevant parts of the corpus as needed, which requires less memory, but more time. It also requires the directory to stick around until needed.
topic - Topic for the corpus.directory - Directory in which to find the corpus files.
IOException - If there is an underlying I/O error reading
the corpus data.
IllegalArgumentException - If the topic is not available
in the Reuters collection.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||