com.aliasi.suffixarray
Class DocumentTokenSuffixArray

java.lang.Object
  extended by com.aliasi.suffixarray.DocumentTokenSuffixArray

public class DocumentTokenSuffixArray
extends Object

A DocumentTokenSuffixArray implements a suffix array over a collection of named documents.

How it Works

The basic idea is that the documents are concatenated and then stored in a token suffix array. This class provides methods for extracting the document given a position in the suffix array.

The documents are concatenated with a specified distinguished token as a separator. The separator acts as an end-of-document marker that terminates comparisons.

A document suffix array is constructed from a mapping of identifiers to documents. A tokenizer factory and separator are also provided.

The underlying suffix array may be retrieved using suffixArray() and manipulated as any other token-based suffix array. The method textPositionToDocId(int) provides the means to map a position in the underlying token array to the document that spans the positions.

Since:
LingPipe 4.0.2
Version:
4.1.0
Author:
Bob Carpenter

Constructor Summary
DocumentTokenSuffixArray(Map<String,String> idToDocMap, TokenizerFactory tf, int maxSuffixLength, String documentBoundaryToken)
          Construct a suffix array from the specified identified document collection using the specified tokenizer factory, limiting comparisons to the specified maximum suffix length and separating documents with the specified boundary token.
 
Method Summary
 int docEndToken(String docId)
          Returns the index of the next token past the last token of the specified document.
 int docStartToken(String docId)
          Returns the starting token position in the underlying token suffix array of the document with the specified identifier in the overall set of documents.
 Set<String> documentNames()
          Returns an unmodifiable view of the set of document names in the collection.
 String documentText(String docName)
          Return the text of the document with the specified name.
static int largestWithoutGoingOver(int[] vals, int val)
          Given an increasing array of values and a specified value, return the largest index into the array such that the array's value at the index is smaller than or equal to the specified value.
 int numDocuments()
          Returns the number of documents in the collection.
 TokenSuffixArray suffixArray()
          Return the token suffix array backing this document suffix array.
 String textPositionToDocId(int textPosition)
          Return the identifier of the document that contains the specified position in the underlying text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentTokenSuffixArray

public DocumentTokenSuffixArray(Map<String,String> idToDocMap,
                                TokenizerFactory tf,
                                int maxSuffixLength,
                                String documentBoundaryToken)
Construct a suffix array from the specified identified document collection using the specified tokenizer factory, limiting comparisons to the specified maximum suffix length and separating documents with the specified boundary token.

For this class to work properly, the tokenizer factory must tokenize the document boundary token into a single token when surrounded by spaces.

Parameters:
idToDocMap - Mapping from document identifiers to document texts.
tf - Tokenizer factory to use for matching.
maxSuffixLength - Maximum suffix length (in tokens) for comparsions.
documentBoundaryToken - Distinguished token used to separate documents.
Throws:
IllegalArgumentException - If the tokenizer factory does not tokenize the document boundary token surrounded by single whitespaces into a single token consisting of the boundary token. // raise exception if find boundary in tokens of doc?
Method Detail

suffixArray

public TokenSuffixArray suffixArray()
Return the token suffix array backing this document suffix array.

Returns:
Underlying suffix array.

textPositionToDocId

public String textPositionToDocId(int textPosition)
Return the identifier of the document that contains the specified position in the underlying text.

Parameters:
textPosition - Position in underlying list of concatenated documents.
Returns:
Position

documentText

public String documentText(String docName)
Return the text of the document with the specified name.

Parameters:
docName - Name of document.
Returns:
Text for that document.
Throws:
NullPointerException - If the document name is not known.

numDocuments

public int numDocuments()
Returns the number of documents in the collection.

Returns:
Number of documents in the collection.

documentNames

public Set<String> documentNames()
Returns an unmodifiable view of the set of document names in the collection.

Returns:
The set of document names.

docStartToken

public int docStartToken(String docId)
Returns the starting token position in the underlying token suffix array of the document with the specified identifier in the overall set of documents. Returns -1 if the document is not part of the collection.

Parameters:
docId - Document identifier.
Returns:
Position of first token in document in the underlying token suffix array.

docEndToken

public int docEndToken(String docId)
Returns the index of the next token past the last token of the specified document. Returns -1 if the document is not part of the collection.

Parameters:
docId - Document identifier.
Returns:
Position of first token in document in the underlying token suffix array.

largestWithoutGoingOver

public static int largestWithoutGoingOver(int[] vals,
                                          int val)
Given an increasing array of values and a specified value, return the largest index into the array such that the array's value at the index is smaller than or equal to the specified value. Returns -1 if there are no entries in the array less than the specified value.

Warning: No test is made that the values are in increasing order. If they are not, the behavior of this method is not specified.

Parameters:
vals - Array of values, sorted in ascending order.
val - Specified value to search.