com.aliasi.tokenizer
Class Tokenizer

java.lang.Object
  extended by com.aliasi.tokenizer.Tokenizer
All Implemented Interfaces:
Iterable<String>
Direct Known Subclasses:
FilterTokenizer

public abstract class Tokenizer
extends Object
implements Iterable<String>

Abstract base class for tokenizers. Acts as an iterator over both space and token streams. The next space is returned through nextWhitespace(), and the next token through nextToken(). Some tokenizers may implement lastTokenStartPosition(), which returns the offset of the previous token's first character in an underlying character stream.

The entire underlying character sequence may be reconstructed by alternating the next whitespace and next token, beginning with the first whitespace, until the end of both are reached. Offsets returned by lastTokenStartPosition() are not guaranteed to be into this sequence of characters.

Concrete subclasses must implement nextToken() to return the next token. They may override nextWhitespace() to return the next space string; it is implemented in this class to return a single space Strings.SINGLE_SPACE_STRING. Subclasses may also implement lastTokenStartPosition(), which otherwise will throw an UnsupportedOperationException.

Since:
LingPipe1.0
Version:
3.1
Author:
Bob Carpenter

Constructor Summary
protected Tokenizer()
          Construct a tokenizer.
 
Method Summary
 Iterator<String> iterator()
          Returns an iterator over the tokens remaining in this tokenizer.
 int lastTokenStartPosition()
          Returns the offset of the first character of the most recently returned token (optional operation).
abstract  String nextToken()
          Returns the next token in the stream, or null if there are no more tokens.
 String nextWhitespace()
          Returns the next whitespace.
 String[] tokenize()
          Returns the remaining tokens in an array of strings.
 void tokenize(List<? super String> tokens, List<? super String> whitespaces)
          Adds the remaining tokens and whitespaces to the specified lists.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokenizer

protected Tokenizer()
Construct a tokenizer.

Method Detail

iterator

public Iterator<String> iterator()
Returns an iterator over the tokens remaining in this tokenizer.

The returned iterator is not thread safe with respect to the underlying tokenizer. Specifically, it maintains a handle to this tokenizer. Calls to the iterators hasNext() and nextToken() methods call this tokenizers nextToken() method.

Specified by:
iterator in interface Iterable<String>
Returns:
An iterator over the tokens remaining in this tokenizer.

nextToken

public abstract String nextToken()
Returns the next token in the stream, or null if there are no more tokens. Flushes any whitespace that has not been returned.

Returns:
The next token, or null if there are no more tokens.

nextWhitespace

public String nextWhitespace()
Returns the next whitespace. Returns the same result for subsequent calls without a call to nextToken. Default implementation in this class is to return a single space, Strings.SINGLE_SPACE_STRING.

Returns:
The next space.

lastTokenStartPosition

public int lastTokenStartPosition()
Returns the offset of the first character of the most recently returned token (optional operation). A tokenizer should return -1 if no token has been returned yet.

The implementation here simply throws an unsupported operation exception. Subclasses should override this method if they support character offset indexing.

Returns:
The character offset of the first character of the most recently returned token.
Throws:
UnsupportedOperationException - If this method is not supported.

tokenize

public void tokenize(List<? super String> tokens,
                     List<? super String> whitespaces)
Adds the remaining tokens and whitespaces to the specified lists.

Parameters:
tokens - List to which tokens are added.
whitespaces - List to which whitespaces are added.

tokenize

public String[] tokenize()
Returns the remaining tokens in an array of strings. If called first, this returns all of the tokens produced by this tokenizer. Flushes all remaining whitespace.

Returns:
Array of tokens remaining in this tokenizer.