com.aliasi.corpus.parsers
Class Reuters21578Parser

java.lang.Object
  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.StringParser<ClassificationHandler<CharSequence,Classification>>
          extended by com.aliasi.corpus.parsers.Reuters21578Parser

Deprecated. This class will move to the demos in 4.0.

@Deprecated
public class Reuters21578Parser
extends StringParser<ClassificationHandler<CharSequence,Classification>>

A Reuters21578Parser provides a parser for the Reuters-21578 text categorization test collection. The Reuters collection consists of business stories from Reuters published in the 1980s. There are a total of 123 topics, all having to do with business (e.g. "money-supply" and "earn"), but the count of their training documents ranges from almost 4000 down to 1; there are only 26 topics with 100 or more training documents.

The parser produces classifications for the handler that are binary, relative to a specified topic. That is, a topic such as earn is fixed, and the classifications are binary, assigning the accept category to a document that is tagged as belonging to the earn topic in the corpus, and assigning a reject category to documents not assigned to the earn topic. The categories are the default accept and reject categories in BinaryLMClassifier, namely BinaryLMClassifier.DEFAULT_ACCEPT_CATEGORY and BinaryLMClassifier.DEFAULT_REJECT_CATEGORY.

Corpus Factory

With a parser, the typical usage scenario would involve setting up a parser, then parsing all of the SGML documents making up the corpus. It's also possible to encapsulate this logic using a higher-order interface, that of Corpus. There is a static factory method corpus(String,File) that constructs an implementation of Corpus from the Reuters collection for a specified topic. This higher-order interface is necessary useful for batch learning algorithms that require corpora input, like the perceptron classifier.

Available Topics

Here is the complete list of topics found in the Reuters corpus. This is just the result of scraping the topics, not necessarily the count of topics in the Mod-Apte split. The topics are drawn from general subjects, economic indicators, corporate reports, currencies, and commodities (including energy). The topics are described further in the following file distributed with the corpus:

TopicCount
1earn3987
2acq2448
3money991
4fx801
5crude634
6grain628
7trade552
8interest513
9wheat306
10ship305
11corn255
12oil238
13dlr217
14gas195
15oilseed192
16supply190
17sugar184
18gnp163
19coffee145
20veg137
21gold135
22nat130
23soybean120
24bop116
25livestock114
TopicCount
26cpi112
27reserves84
28meal82
29copper78
30cocoa76
31jobs76
32carcass75
33yen69
34iron67
35rice67
36steel67
37cotton66
38ipi65
39alum63
40barley54
41soy52
42feed51
43rubber51
44zinc44
45palm43
46chem41
47pet41
48silver37
49lead35
50rapeseed35
TopicCount
51sorghum35
52tin33
53metal32
54strategic32
55wpi32
56orange29
57fuel28
58hog27
59retail27
60heat25
61housing21
62stg21
63income18
64lei17
65lumber17
66sunseed17
67dmk15
68tea15
69oat14
70coconut13
71cattle12
72groundnut12
73platinum12
74nickel11
75sun10
TopicCount
76l9
77rape9
78jet8
79debt7
80instal7
81inventories7
82naphtha7
83potato6
84propane6
85austdlr4
86belly4
87cpu4
88nzdlr4
89plywood4
90pork4
91tapioca4
92cake3
93can3
94copra3
95dfl3
96f3
97lin3
98lit3
99nkr3
100palladium3
TopicCount
101palmkernel3
102rand3
103saudriyal3
104sfr3
105castor2
106cornglutenfeed2
107fishmeal2
108linseed2
109rye2
110wool2
111bean1
112bfr1
113castorseed1
114citruspulp1
115cottonseed1
116cruzado1
117dkr1
118hk1
119peseta1
120red1
121ringgit1
122rupiah1
123skr1

Modified Apte Split

This class distinguishes between training and test data based on the encoding in the corpus itself. Each document in the corpus is marked as being either a test or training document (or neither). This class parses either or both of the test and training documents from the corpus based on the boolean flags provided to the constructor.

This class uses the "Modified Apte" (ModApte) split of the corpus into training and test segments, which is defined as follows (see the README cited below for more details):

CategoryNumber of DocumentsSGML Pattern
Training 9,603 LEWISSPLIT="TRAIN"; TOPICS="YES"
Test 3,299 LEWISSPLIT="TEST"; TOPICS="YES"
Unused 8,676 LEWISSPLIT="NOT-USED"; TOPICS="YES"
or   TOPICS="NO"
or   TOPICS="BYPASS"

Note that some of the listed topics occur only in the unused portion of the corpus; see the README cited below for more information.

Corpus Organization

The corpus is distributed as 22 SGML files encoded in ASCII (reut2-000.sgm through reut2-021.sgm). It is these SGML files which are parsed by this parser.

Obtaining the Corpus

The Reuters-21578 collection may be downloaded for research purposes from:

It is distributed with the following read-me file, which provides the (1) the exact licensing terms, (2) the format of the corpus, and (3) a set of references.

Since:
LingPipe3.2.1
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
Reuters21578Parser(String topic, boolean includeTrainingDocuments, boolean includeTestDocuments)
          Deprecated. Construct a Reuters-21578 test collection parser for the specified topic that includes test and/or training documents as specified.
 
Method Summary
static String[] availableTopics()
          Deprecated. Returns an array consisting of all of the available topics in the Reuters collection.
static Corpus<ClassificationHandler<CharSequence,Classification>> corpus(String topic, File directory)
          Deprecated. See class documentation.
static boolean isAvailableTopic(String topic)
          Deprecated. Returns true if the specified topic is available in the Reuters collection.
 void parseString(char[] cs, int start, int end)
          Deprecated. Implements the parser for character array slices.
 
Methods inherited from class com.aliasi.corpus.StringParser
parse
 
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Reuters21578Parser

public Reuters21578Parser(String topic,
                          boolean includeTrainingDocuments,
                          boolean includeTestDocuments)
Deprecated. 
Construct a Reuters-21578 test collection parser for the specified topic that includes test and/or training documents as specified. See the corpus documentation above for a description of the corpus itself.

The topic specified must be available as part of the Reuters classification. If it isn't, the constructor will raise an illegal argument exception. The set of legal topics is available through availableTopics(), and a topic may be tested through isAvailableTopic(String).

Parameters:
topic - One of the topics in the Reuters collection.
includeTrainingDocuments - Set to true to handle training documents.
includeTestDocuments - Set to true to handle test documents.
Throws:
IllegalArgumentException - If the topic isn't an available topic for the Reuters collection.
Method Detail

parseString

public void parseString(char[] cs,
                        int start,
                        int end)
Deprecated. 
Implements the parser for character array slices. All other parse methods eventually call this implementation.

Specified by:
parseString in class Parser<ClassificationHandler<CharSequence,Classification>>
Parameters:
cs - Underlying character array.
start - Index of first character in the slice.
end - Index of the last character in the slice plus 1.

availableTopics

public static String[] availableTopics()
Deprecated. 
Returns an array consisting of all of the available topics in the Reuters collection. The complete list is shown in the class javadoc above.

The list is a copy, so changing it has no effect on this class.

Returns:
The topics for the Reuters collection.

isAvailableTopic

public static boolean isAvailableTopic(String topic)
Deprecated. 
Returns true if the specified topic is available in the Reuters collection.

Parameters:
topic - Topic to test.
Returns:
true if it available for classification.

corpus

@Deprecated
public static Corpus<ClassificationHandler<CharSequence,Classification>> corpus(String topic,
                                                                                           File directory)
                                                                         throws IOException
Deprecated. See class documentation.

Returns the corpus representation of the Reuters collection, for the specified topic, reading the SGML files from the specified directory.

The directory specified is read each time the methods of the returned corpus are called. This streams the relevant parts of the corpus as needed, which requires less memory, but more time. It also requires the directory to stick around until needed.

Parameters:
topic - Topic for the corpus.
directory - Directory in which to find the corpus files.
Returns:
The corpus for the specified topic.
Throws:
IOException - If there is an underlying I/O error reading the corpus data.
IllegalArgumentException - If the topic is not available in the Reuters collection.