com.aliasi.tokenizer
Class Tokenizer

java.lang.Object
  extended by com.aliasi.tokenizer.Tokenizer
All Implemented Interfaces:
Iterable<String>
Direct Known Subclasses:
FilterTokenizer

public abstract class Tokenizer
extends Object
implements Iterable<String>

The abstract class Tokenizer serves as a base for tokenizer implementations, which provide streams of tokens, whitespaces, and positions.

A tokenizer acts as an iterator over both space and token streams. The next space is returned through nextWhitespace(), and the next token through nextToken(). Some tokenizers may implement lastTokenStartPosition(), which returns the offset of the previous token's first character in an underlying character stream.

Tokenizers implement the Iterable interface to allow easy iteration over just the tokens using for-each loops.

The entire underlying character sequence may be reconstructed by alternating the next whitespace and next token, beginning with the first whitespace, until the end of both are reached. Offsets returned by lastTokenStartPosition() are not guaranteed to be into this sequence of characters.

Concrete subclasses must implement nextToken() to return the next token. They may override nextWhitespace() to return the next space string; it is implemented in this class to return a single space Strings.SINGLE_SPACE_STRING. Subclasses may also implement lastTokenStartPosition(), which otherwise will throw an UnsupportedOperationException.

Since:
LingPipe1.0
Version:
3.8.1
Author:
Bob Carpenter

Constructor Summary
Tokenizer()
          Construct a tokenizer.
 
Method Summary
 Iterator<String> iterator()
          Returns an iterator over the tokens remaining in this tokenizer.
 int lastTokenEndPosition()
          Returns the offset of one position past the last character of the most recently returned token (optional operation).
 int lastTokenStartPosition()
          Returns the offset of the first character of the most recently returned token (optional operation).
abstract  String nextToken()
          Returns the next token in the stream, or null if there are no more tokens.
 String nextWhitespace()
          Returns the next whitespace.
 String[] tokenize()
          Returns the remaining tokens in an array of strings.
 void tokenize(List<? super String> tokens, List<? super String> whitespaces)
          Adds the remaining tokens and whitespaces to the specified lists.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokenizer

public Tokenizer()
Construct a tokenizer.

Method Detail

iterator

public Iterator<String> iterator()
Returns an iterator over the tokens remaining in this tokenizer.

The returned iterator is not thread safe with respect to the underlying tokenizer. Specifically, it maintains a handle to this tokenizer. Calls to the iterators hasNext() and nextToken() methods call this tokenizers nextToken() method.

Specified by:
iterator in interface Iterable<String>
Returns:
An iterator over the tokens remaining in this tokenizer.

nextToken

public abstract String nextToken()
Returns the next token in the stream, or null if there are no more tokens. Flushes any whitespace that has not been returned.

Returns:
The next token, or null if there are no more tokens.

nextWhitespace

public String nextWhitespace()
Returns the next whitespace. Returns the same result for subsequent calls without a call to nextToken.

The default implementation in this class is to return a single space, Strings.SINGLE_SPACE_STRING.

Returns:
The next space.

lastTokenStartPosition

public int lastTokenStartPosition()
Returns the offset of the first character of the most recently returned token (optional operation). A tokenizer should return -1 if no token has been returned yet.

The position returned is relative to the beginning of the slice of the character array being tokenized, not the beginning of the array itself.

The implementation here simply throws an unsupported operation exception. Subclasses should override this method if they support character offset indexing.

Returns:
The character offset of the first character of the most recently returned token, or -1 if not token has yet been returned.
Throws:
UnsupportedOperationException - If this method is not supported.

lastTokenEndPosition

public int lastTokenEndPosition()
Returns the offset of one position past the last character of the most recently returned token (optional operation). A tokenizer should return -1 if no token has been returned yet.

The position returned is relative to the beginning of the slice of the character array being tokenized, not the beginning of the array itself.

The implementation here throws an unsupported operation exception. Subclasses should override this method to support offset indexing.

Returns:
One plus the offset of the last character of the most recently returned token, or -1 if not token has yet been returned.

tokenize

public void tokenize(List<? super String> tokens,
                     List<? super String> whitespaces)
Adds the remaining tokens and whitespaces to the specified lists.

Parameters:
tokens - List to which tokens are added.
whitespaces - List to which whitespaces are added.

tokenize

public String[] tokenize()
Returns the remaining tokens in an array of strings. If called first, this returns all of the tokens produced by this tokenizer. Flushes all remaining whitespace.

Returns:
Array of tokens remaining in this tokenizer.