com.aliasi.corpus
Class XValidatingObjectCorpus<E>

java.lang.Object
  extended by com.aliasi.corpus.Corpus<ObjectHandler<E>>
      extended by com.aliasi.corpus.XValidatingObjectCorpus<E>
Type Parameters:
E - the type of objects handled.
All Implemented Interfaces:
Handler, ObjectHandler<E>, Serializable

public class XValidatingObjectCorpus<E>
extends Corpus<ObjectHandler<E>>
implements ObjectHandler<E>, Serializable

An XValidatingObjectCorpus holds a list of items which it uses to provide training and testing items using cross-validation.

Handler Implementation

The method handle(Object) is used to add items to the corpus. The items will be stored in the order in which they are received (though they may be permuted later).

When used as a handler, this class simply collects the items and stores them in a list. This allows an instance of this class to be used like any other object handler.

Cross Validation

Cross-validation divides a corpus up into roughly equal sized parts, called folds, assigning one of the parts as the test section and the other parts as training sections. A typical number of folds is 10, with 90% of the data being used for training and 10% for testing. The number of folds is set in the constructor.

Initially, the fold will be set to 0, but the fold may be reset later using setFold(int). Iterating between 0 and the number of folds minus 1 will work through all folds. The method size() returns the size of the corpus and fold() is the current fold.

For cases where numFolds() is greater than zero, the start and end of a fold are defined by:

 start(fold) = (int) (size() * fold() / (double) numFolds())
 
 end(fold) = start(fold+1)
If numFolds() is 0, the start and end for the fold are 0, so that visiting the training part of the corpus visits the entire corpus.

Permuting the Corpus

The randomization method permuteCorpus(Random) randomizes the list of items. This can be useful for removing local dependencies. See the section on thread safety below for more information on the interaction of permutation and thread safety.

Use Without Cross Validation

No matter how the folds are set, using Corpus.visitCorpus(Handler) will run the specified handler over all of the data collected in this corpus.

If the number of folds is set to 1, then Corpus.visitTest(Handler) visits the entire corpus.

If the number of folds is set to 0, Corpus.visitTrain(Handler) visits the entire corpus.

Thead Safety

This class must be used with external read/write synchronization. The write operations include the constructor, set-fold, set number of folds, permute corpus, and handle methods. The read operations include the visit num instances and fold reporting methods.

Specifically, if the corpus is not being written to, folds may be visited concurrently.

Thread Safety

A cross-validating object corpus must be concurrent read/sigle write synchronized, with handle(), setFold(), setNumFolds(), and permuteCorpus() being the writers.

Multi-Threaded Cross-Validation

After the items in a cross-validating corpus are added and optionally permuted, it is possible to carry out multi-threaded cross-validation with views of the corpus. The method itemView() returns a view of a corpus with an immutable item list. But it allows the number of folds and fold to be set. In particular, as long as the underlying corpus is not modified, a view for each fold may be created and run concurrently.

If a common evaluator is used, access to it must be synchronized to set the appropriate model and run the evaluation. If a separate evaluation is used per thread, there is no need for synchronization.

Serialization

An XValidatingObjectCorpus may be serialized. The corpus read back in will have the same items in the same permutatino, with the same number of folds and the same fold set as the corpus at the point it was serialized.

Since:
LingPipe3.9
Version:
3.9.2
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
XValidatingObjectCorpus(int numFolds)
          Construct a cross-validating corpus with the specified number of folds.
 
Method Summary
 int fold()
          Returns the current fold.
 void handle(E e)
          Add the specified item to the end of the corpus.
 XValidatingObjectCorpus<E> itemView()
          Returns a cross-validating corpus whose items are an immutable view of the items in this corpus, but whose number of folds or fold may be changed.
 int numFolds()
          Return the number of folds for this cross-validating corpus.
 void permuteCorpus(Random random)
          Randomly permutes the corpus using the specified randomizer.
 void setFold(int fold)
          Set the current fold to the specified value.
 void setNumFolds(int numFolds)
          Sets the number of folds to the specified value.
 int size()
          Return the number of items in this corpus.
 void visitCorpus(ObjectHandler<E> handler)
          Visit the entire corpus, sending all extracted events to the specified handler.
 void visitCorpus(ObjectHandler<E> trainHandler, ObjectHandler<E> testHandler)
          Visit the entire corpus, first sending training events to the specified training handler and then sending testing events to the test handler.
 void visitTest(ObjectHandler<E> handler)
          Send all of the test items to the specified handler.
 void visitTest(ObjectHandler<E> handler, int fold)
          Visit the test portion of the specified fold with the specified handler.
 void visitTrain(ObjectHandler<E> handler)
          Send all of the training items to the specified handler.
 void visitTrain(ObjectHandler<E> handler, int fold)
          Visit the training portion of the specified fold with the specified handler.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XValidatingObjectCorpus

public XValidatingObjectCorpus(int numFolds)
Construct a cross-validating corpus with the specified number of folds. The initial fold is set to 0.

See the class documentation above for information on how the number of folds is used.

Parameters:
numFolds - Number of folds in the corpus.
Throws:
IllegalArgumentException - If the number of folds is negative.
Method Detail

itemView

public XValidatingObjectCorpus<E> itemView()
Returns a cross-validating corpus whose items are an immutable view of the items in this corpus, but whose number of folds or fold may be changed. The mani purpose of this method is to allow thread-safe cross-validationg. See the class documentation for examples.

Attempts to modify the items or their order using handle() or permuteCorpus() will raise an UnsupportedOperationException (note that permuting a zero-length or length one list does not modify it, so permuting an unmodifiable length one list does not raise an unsupported opration exception.

Returns:
View of this cross-validating corpus with a list of items defined as an immutable view of the items in this corpus.

numFolds

public int numFolds()
Return the number of folds for this cross-validating corpus.

Returns:
Current number of folds.

setNumFolds

public void setNumFolds(int numFolds)
Sets the number of folds to the specified value.

See the class documentation above for information on how the number of folds is used.

Parameters:
numFolds - Number of folds.
Throws:
IllegalArgumentException - If the number of folds is negative.

fold

public int fold()
Returns the current fold.

Returns:
The current fold.

permuteCorpus

public void permuteCorpus(Random random)
Randomly permutes the corpus using the specified randomizer.

Parameters:
random - Randomizer to use for permutation.

setFold

public void setFold(int fold)
Set the current fold to the specified value.

Warning: If the number of folds is set to zero, this method will throw an exception.

Throws:
IllegalArgumentException - If the fold is not greater than or equal to 0 and less than the number of folds.

size

public int size()
Return the number of items in this corpus.

Returns:
Number of items.

handle

public void handle(E e)
Add the specified item to the end of the corpus.

Specified by:
handle in interface ObjectHandler<E>
Parameters:
e - Item to add to corpus.

visitTrain

public void visitTrain(ObjectHandler<E> handler)
Send all of the training items to the specified handler. See the class documentation above for a specification of which items are visited based on the value of the number of folds and the current fold.

Overrides:
visitTrain in class Corpus<ObjectHandler<E>>
Parameters:
handler - Handler receiving training items.

visitTest

public void visitTest(ObjectHandler<E> handler)
Send all of the test items to the specified handler. See the class documentation above for a specification of which items are visited based on the value of the number of folds and the current fold.

Overrides:
visitTest in class Corpus<ObjectHandler<E>>
Parameters:
handler - Handler receiving training items.

visitCorpus

public void visitCorpus(ObjectHandler<E> handler)
Description copied from class: Corpus
Visit the entire corpus, sending all extracted events to the specified handler.

This is just a convenience method that is defined by:

 visitCorpus(handler,handler);
 

Overrides:
visitCorpus in class Corpus<ObjectHandler<E>>
Parameters:
handler - Handler for events extracted from the corpus.

visitCorpus

public void visitCorpus(ObjectHandler<E> trainHandler,
                        ObjectHandler<E> testHandler)
Description copied from class: Corpus
Visit the entire corpus, first sending training events to the specified training handler and then sending testing events to the test handler.

This is just a convenience method that is defined by:

 visitTrain(trainHandler);
 visitTest(testHandler);
 

Overrides:
visitCorpus in class Corpus<ObjectHandler<E>>
Parameters:
trainHandler - Handler for training events from the corpus.
testHandler - Handler for testing events from the corpus.

visitTest

public void visitTest(ObjectHandler<E> handler,
                      int fold)
Visit the test portion of the specified fold with the specified handler.

This method ignores the value of the current fold.

Parameters:
handler - Handler for objects in corpus.
fold - Fold whose test portion is visited.

visitTrain

public void visitTrain(ObjectHandler<E> handler,
                       int fold)
Visit the training portion of the specified fold with the specified handler.

This method ignores the value of the current fold.

Parameters:
handler - Handler for objects in corpus.
fold - Fold whose training portion is visited.