com.aliasi.corpus.parsers
Class SvmLightClassificationParser

java.lang.Object
  extended by com.aliasi.corpus.Parser<H>
      extended by com.aliasi.corpus.InputSourceParser<H>
          extended by com.aliasi.corpus.LineParser<ClassificationHandler<Vector,Classification>>
              extended by com.aliasi.corpus.parsers.SvmLightClassificationParser

Deprecated. This class will move to the demos in 4.0.

@Deprecated
public class SvmLightClassificationParser
extends LineParser<ClassificationHandler<Vector,Classification>>

The SvmLightClassificationParser class parses (a generalization of) the widely-used SVMlight format for vector classification.

The Format

The SVMlight format is line-based, with each line representing a classification instance. A line consists of a category followed by an arbitrary number of feature/value pairs, followed by an optional comment.

The following example is drawn from the SVMlight documentation:

 -1 1:0.43 3:0.12 9284:0.2 # abcdef

The category is -1, the feature 1 has value 0.43, the feature 3 has value 0.12 and the feature 9284 has value 0.2, with # abcdef being a comment.

This class generalizes the format to allow arbitrary string-based categories in addition to the -1, 0 and 1 allowed by SVMlight.

This class also generalizes the format to allow the features to appear in any order and to treat features occurring more than once as having a value equal to the sum of their specified values.

This class does not parse numerical regression data files in which the category is a floating point value, nor does it deal with the ranking mode of SVMlight.

No spaces are allowed around the colons separating dimensions and their values; all other whitespace in a line may be one or more spaces or tabs.

Blank lines are ignored.

Target Vectors and Dimensionality

The vectors produced by this parser will be instances of SparseFloatVector. The dimensionality of these vectors must be specified in the constructor for the parser.

References

The home page for and primary reference for SVMlight is:

Many other packages use (some variant of) the SVMlight basic classification format, including:

Since:
LingPipe3.5
Version:
3.9.1
Author:
Bob Carpenter

Constructor Summary
SvmLightClassificationParser(boolean addIntercept, int dataDimensionality)
          Deprecated. Construct a classification parser for the SVMlight format for data of the specified dimensionality and specified automatic intercept flag.
SvmLightClassificationParser(ClassificationHandler<Vector,Classification> handler, boolean addIntercept, int dataDimensionality)
          Deprecated. See the class doc.
 
Method Summary
 Set<String> categoriesFound()
          Deprecated. Returns an immutable set of the categories that have been found so far.
 int maxDimensionFound()
          Deprecated. Returns the maximum dimension index found so far.
protected  void parseLine(String line, int lineNumber)
          Deprecated. Parses a line of the data, pulling out category and dimensiona/value pairs.
 
Methods inherited from class com.aliasi.corpus.LineParser
parse
 
Methods inherited from class com.aliasi.corpus.InputSourceParser
parseString
 
Methods inherited from class com.aliasi.corpus.Parser
getHandler, parse, parse, parseString, setHandler
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SvmLightClassificationParser

public SvmLightClassificationParser(boolean addIntercept,
                                    int dataDimensionality)
Deprecated. 
Construct a classification parser for the SVMlight format for data of the specified dimensionality and specified automatic intercept flag.

The dimensionality may be chosen to be Integer.MAX_VALUE without any efficiency problems because the created vectors are sparse.

If the intercept flag is set to true, the value of dimension 0 will always be 1.0. Note that this will override any other value of the parameter 0 found in the line.

Parameters:
dataDimensionality - Number of dimensions in the data.
addIntercept - Flag indicating whether an intercept should be automatically added.

SvmLightClassificationParser

@Deprecated
public SvmLightClassificationParser(ClassificationHandler<Vector,Classification> handler,
                                               boolean addIntercept,
                                               int dataDimensionality)
Deprecated. See the class doc.

Construct a classification parser for the SVMlight format for data of the specified dimensionality, specified automatic intercept flag, and specified classification handler.

The dimensionality may be chosen to be Integer.MAX_VALUE without any efficiency problems because the created vectors are sparse.

If the intercept flag is set to true, the value of dimension 0 will always be 1.0.

Parameters:
handler - Classification handler for data.
dataDimensionality - Number of dimensions in the data.
addIntercept - Flag indicating whether an intercept should be automatically added.
Method Detail

parseLine

protected void parseLine(String line,
                         int lineNumber)
Deprecated. 
Parses a line of the data, pulling out category and dimensiona/value pairs.

Specified by:
parseLine in class LineParser<ClassificationHandler<Vector,Classification>>
Parameters:
line - Line of data to parse.
lineNumber - Number of line being parsed.
Throws:
NumberFormatException - If there is a dimension that is not parsable as an integer or a value that is not parsable as a double.
IllegalArgumentException - For other ill-formed line exceptions.

maxDimensionFound

public int maxDimensionFound()
Deprecated. 
Returns the maximum dimension index found so far. If the parser runs over all data of interest, the dimensionality may be set to one plus the returned value.

Returns:
The maximum dimension index found in the data so far.

categoriesFound

public Set<String> categoriesFound()
Deprecated. 
Returns an immutable set of the categories that have been found so far.

Returns:
An immutable set of the categories that have been found so far.