com.aliasi.tokenizer
Class Tokenization

java.lang.Object
  extended by com.aliasi.tokenizer.Tokenization
All Implemented Interfaces:
Serializable

public class Tokenization
extends Object
implements Serializable

A Tokenization represents the result of tokenizing a string. Tokenizations are constructed from a character sequence and a tokenizer factory. A tokenization contains the underlying text, tokens, and token start/end positions in the text.

Equality and Hash Codes

Two tokenizations are equal if they have the same text, tokens, whitespaces, and start/end positions for the tokens.

Hash codes are consistent with equality. They only depend on the text and number of tokens.

Serialization

A tokenization may be serialized. Deserialization should produce an identical tokenization.

Thread Safety

After safely published, objects are completely thread safe. The text and tokenizer factory should not be modified concurrently with construction.

Since:
LingPipe3.9
Version:
3.9
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
Tokenization(char[] cs, int start, int length, TokenizerFactory factory)
          Construct a tokenization from the specified text and tokenizer factory.
Tokenization(String text, List<String> tokens, List<String> whitespaces, int[] tokenStarts, int[] tokenEnds)
          Construct a tokenization from the specified components.
Tokenization(String text, TokenizerFactory factory)
          Construct a tokenization from the specified text and tokenizer factory.
 
Method Summary
 boolean equals(Object that)
          Returns true if the specified object is a tokenization that is equal to this one.
 int hashCode()
          Returns the hash code for this tokenization.
 int numTokens()
          Return the number of tokens in this tokenization.
 String text()
          Return the underlying text for this tokenization.
 String token(int n)
          Return the token at the specified input position.
 int tokenEnd(int n)
          Return the position of one past the last character in the specified input position.
 List<String> tokenList()
          Returns an unmodifiable view of the list of tokens for this tokenization.
 String[] tokens()
          Returns the array of tokens underlying this tokenization.
 int tokenStart(int n)
          Return the position of the first character in the specified input position.
 String whitespace(int n)
          Return the whitespace before the token at the specified input position, or the last whitespace if the specified position is the number of tokens.
 List<String> whitespaceList()
          Returns an unmodifiable view of the list of whitespaces for this tokenization.
 String[] whitespaces()
          Return the array of whitespaces for this tokenization.
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokenization

public Tokenization(char[] cs,
                    int start,
                    int length,
                    TokenizerFactory factory)
Construct a tokenization from the specified text and tokenizer factory. The text is converted to a string so that subsequent changes to the text will not affect this class. (Note that the text should not be changed concurrently with constructing a tokenization.)

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
length - Length of slice.
factory - Tokenizer factory to use for tokenization.
Throws:
IndexOutOfBoundsException - If the start and length indices are outside of bounds of the array.

Tokenization

public Tokenization(String text,
                    TokenizerFactory factory)
Construct a tokenization from the specified text and tokenizer factory.

Parameters:
text - Underlying text for tokenization.
factory - Tokenizer factory to perform tokenization.

Tokenization

public Tokenization(String text,
                    List<String> tokens,
                    List<String> whitespaces,
                    int[] tokenStarts,
                    int[] tokenEnds)
Construct a tokenization from the specified components. The arrays and lists are copied so that modifications to them will not affect the constructed object after construction.

Parameters:
text - Underlying text.
tokens - List of tokens.
whitespaces - List of whitespaces.
tokenStarts - Offset of first character in tokens.
tokenEnds - Offset of last character plus one in tokens.
Throws:
IllegalArgumentException - If the number of whitespaces is not equal to the number of tokens plus one, a tokens start occurs after a token end, or a token start or end is out of bounds for the text.
Method Detail

text

public String text()
Return the underlying text for this tokenization.

Returns:
Text for tokenization.

numTokens

public int numTokens()
Return the number of tokens in this tokenization.

Returns:
The number of tokens.

token

public String token(int n)
Return the token at the specified input position.

Parameters:
n - Position of token.
Returns:
Token at specified position.
Throws:
IndexOutOfBoundsException - If the position is less than 0 or greater than or equal to the number of tokens.

whitespace

public String whitespace(int n)
Return the whitespace before the token at the specified input position, or the last whitespace if the specified position is the number of tokens.

Parameters:
n - Position of token.
Returns:
Whitespace before the token in the specified position.
Throws:
IndexOutOfBoundsException - If the position is less than 0 or greater than the number of tokens.

tokenStart

public int tokenStart(int n)
Return the position of the first character in the specified input position.

Parameters:
n - Position of token.
Returns:
The index of the first character in the specified token.
Throws:
IndexOutOfBoundsException - If the position is less than 0 or greater than or equal to the number of tokens.

tokenEnd

public int tokenEnd(int n)
Return the position of one past the last character in the specified input position.

Parameters:
n - Position of token.
Returns:
The index of the last character plus one for the specified token.
Throws:
IndexOutOfBoundsException - If the position is less than 0 or greater than or equal to the number of tokens.

tokens

public String[] tokens()
Returns the array of tokens underlying this tokenization. This array's length is the number of tokens and it is indexed by token position.

The array is copied from the underlying list of tokens, so modifying it will not affect this tokenization.

Returns:
Array of tokens for this tokenization.

whitespaces

public String[] whitespaces()
Return the array of whitespaces for this tokenization. The array's length is one greater than the number of tokens, and it is indexed by following token position.

The array is copied from the underlying list of tokens, so modifying it will not affect this tokenization.

Returns:
Array of whitespaces for this tokenization.

tokenList

public List<String> tokenList()
Returns an unmodifiable view of the list of tokens for this tokenization.

Returns:
List of tokens for this tokenization.

whitespaceList

public List<String> whitespaceList()
Returns an unmodifiable view of the list of whitespaces for this tokenization.

Returns:
List of whitespaces for this tokenization.

equals

public boolean equals(Object that)
Returns true if the specified object is a tokenization that is equal to this one. Equality is defined as having the same text, tokens, whitespaces, and token start and end positions.

Overrides:
equals in class Object

hashCode

public int hashCode()
Returns the hash code for this tokenization. The hash code is consistent with equality, but only considers the text and number of tokens.

Overrides:
hashCode in class Object
Returns:
The hash code for this tokenization.