com.aliasi.suffixarray
Class TokenSuffixArray

java.lang.Object
  extended by com.aliasi.suffixarray.TokenSuffixArray

public class TokenSuffixArray
extends Object

A TokenSuffixArray implements a suffix array of tokens. See CharSuffixArray for a description of suffix arrays and their applications.

Constructing a Token Suffix Array

A suffix array is constructed from a list of tokens. These may be provided directly, or as a character array and tokenizer factory.

If the maximum length is less than the length of the array, strings are truncated to be at most this length before comparison. The result isn't a standard, fully sorted suffix array, but can be faster to create and will suffice for many applications. The indexes will be sorted relative to the truncated strings, so they will be in order up to the specified length.

Document Boundary Token

The document boundary token is used to separate documents. When the document boundary token is found when comparing tokens, it's considered smaller than any other token (no matter how it would sort as a string) and also as a string terminator.

Thus if the tokenization corresponds to multiple documents, the boundary token should be used to separate them.

Tokenization Normalization for Comparison

In order to do comparisons that are case insensitive, or ignore punctuation, the tokenizer should perform the normalization.

Using Suffix Arrays

Token suffix arrays are used in exactly the same way as character suffix arrays; see CharSuffixArray for details and an example.

Thread Safety

Once constructed, a tokenized suffix array is thread safe.

Since:
4.0.2
Version:
4.1.0
Author:
Bob Carpenter

Field Summary
static String DEFAULT_DOCUMENT_BOUNDARY_TOKEN
          The default boundary token for documents.
 
Constructor Summary
TokenSuffixArray(Tokenization tokenization)
          Construct at token suffix array with no limit on suffix length and the default document-boundary token.
TokenSuffixArray(Tokenization tokenization, int maxSuffixLength)
          Construct a suffix array from the specified tokenization, comparing suffixes using up the specified maximum suffix length using the default document-boundary token.
TokenSuffixArray(Tokenization tokenization, int maxSuffixLength, String documentBoundaryToken)
          Construct a suffix array from the specified tokenization, comparing suffixes using up the specified maximum suffix length using the default document-boundary token.
 
Method Summary
 String documentBoundaryToken()
          Returns the token used to separate documents in this suffix array.
 int maxSuffixLength()
          Returns the maximum suffix length for this token suffix array.
 List<int[]> prefixMatches(int minMatchLength)
          Returns a list of maximal spans of suffix array indexes which refer to suffixes that share a prefix of at least the specified minimum match length.
 String substring(int idx, int maxTokens)
          Returns the substring of the original string that's spanned by the tokens starting at the specified suffix array index and running the specified maximum number of tokens (or until the token sequence ends).
 int suffixArray(int idx)
          Returns the value of the suffix array at the specified index.
 int suffixArrayLength()
          Returns the number of tokens in the suffix array.
 Tokenization tokenization()
          Returns the tokenization underlying this suffix array.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_DOCUMENT_BOUNDARY_TOKEN

public static final String DEFAULT_DOCUMENT_BOUNDARY_TOKEN
The default boundary token for documents.

See Also:
Constant Field Values
Constructor Detail

TokenSuffixArray

public TokenSuffixArray(Tokenization tokenization)
Construct at token suffix array with no limit on suffix length and the default document-boundary token.

Parameters:
tokenization - Tokenization on which to base the suffix array.

TokenSuffixArray

public TokenSuffixArray(Tokenization tokenization,
                        int maxSuffixLength)
Construct a suffix array from the specified tokenization, comparing suffixes using up the specified maximum suffix length using the default document-boundary token.

Parameters:
tokenization - Tokenization on which to base suffix array.
maxSuffixLength - Maximum length of token sequences to compare.

TokenSuffixArray

public TokenSuffixArray(Tokenization tokenization,
                        int maxSuffixLength,
                        String documentBoundaryToken)
Construct a suffix array from the specified tokenization, comparing suffixes using up the specified maximum suffix length using the default document-boundary token.

Parameters:
tokenization - Tokenization on which to base suffix array.
maxSuffixLength - Maximum length of token sequences to compare.
documentBoundaryToken - Token used to separate documents.
Method Detail

documentBoundaryToken

public String documentBoundaryToken()
Returns the token used to separate documents in this suffix array.

Returns:
Separator token.

maxSuffixLength

public int maxSuffixLength()
Returns the maximum suffix length for this token suffix array.

Returns:
Maximum length of suffixes.

tokenization

public Tokenization tokenization()
Returns the tokenization underlying this suffix array. The tokenization may be used to retrieve the processed tokens, the underlying text, as well as the positions of the tokens in the text.

Returns:
The tokenization for this suffix array.

suffixArray

public int suffixArray(int idx)
Returns the value of the suffix array at the specified index. This value is an index into the underlying list of tokens.

Parameters:
idx - Suffix array index.
Returns:
Index of the first token of the suffix at the specified index.

suffixArrayLength

public int suffixArrayLength()
Returns the number of tokens in the suffix array.

Returns:
Number of tokens in the suffix array.

substring

public String substring(int idx,
                        int maxTokens)
Returns the substring of the original string that's spanned by the tokens starting at the specified suffix array index and running the specified maximum number of tokens (or until the token sequence ends).

Parameters:
idx - Index in suffix array of first token.
maxTokens - Maximum number of tokens to include in string.
Returns:
Substring starting at the specified index and running the maximum number of tokens or until the end of the tokenization.

prefixMatches

public List<int[]> prefixMatches(int minMatchLength)
Returns a list of maximal spans of suffix array indexes which refer to suffixes that share a prefix of at least the specified minimum match length.

Parameters:
minMatchLength - Minimum number of tokens required to match.
Returns:
The list of pairs of start (inclusive) and end (exclsuive) positions in the suffix array that match up to the specified minimum number of tokens.