|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectcom.aliasi.suffixarray.DocumentTokenSuffixArray
public class DocumentTokenSuffixArray
A DocumentTokenSuffixArray implements a suffix array over a
collection of named documents.
The documents are concatenated with a specified distinguished token as a separator. The separator acts as an end-of-document marker that terminates comparisons.
A document suffix array is constructed from a mapping of identifiers to documents. A tokenizer factory and separator are also provided.
The underlying suffix array may be retrieved using suffixArray() and manipulated as any other token-based suffix
array. The method textPositionToDocId(int) provides
the means to map a position in the underlying token array to
the document that spans the positions.
| Constructor Summary | |
|---|---|
DocumentTokenSuffixArray(Map<String,String> idToDocMap,
TokenizerFactory tf,
int maxSuffixLength,
String documentBoundaryToken)
Construct a suffix array from the specified identified document collection using the specified tokenizer factory, limiting comparisons to the specified maximum suffix length and separating documents with the specified boundary token. |
|
| Method Summary | |
|---|---|
int |
docEndToken(String docId)
Returns the index of the next token past the last token of the specified document. |
int |
docStartToken(String docId)
Returns the starting token position in the underlying token suffix array of the document with the specified identifier in the overall set of documents. |
Set<String> |
documentNames()
Returns an unmodifiable view of the set of document names in the collection. |
String |
documentText(String docName)
Return the text of the document with the specified name. |
static int |
largestWithoutGoingOver(int[] vals,
int val)
Given an increasing array of values and a specified value, return the largest index into the array such that the array's value at the index is smaller than or equal to the specified value. |
int |
numDocuments()
Returns the number of documents in the collection. |
TokenSuffixArray |
suffixArray()
Return the token suffix array backing this document suffix array. |
String |
textPositionToDocId(int textPosition)
Return the identifier of the document that contains the specified position in the underlying text. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public DocumentTokenSuffixArray(Map<String,String> idToDocMap,
TokenizerFactory tf,
int maxSuffixLength,
String documentBoundaryToken)
For this class to work properly, the tokenizer factory must tokenize the document boundary token into a single token when surrounded by spaces.
idToDocMap - Mapping from document identifiers to document
texts.tf - Tokenizer factory to use for matching.maxSuffixLength - Maximum suffix length (in tokens) for
comparsions.documentBoundaryToken - Distinguished token used to separate
documents.
IllegalArgumentException - If the tokenizer factory does not
tokenize the document boundary token surrounded by single whitespaces
into a single token consisting of the boundary token.
// raise exception if find boundary in tokens of doc?| Method Detail |
|---|
public TokenSuffixArray suffixArray()
public String textPositionToDocId(int textPosition)
textPosition - Position in underlying list of concatenated
documents.
public String documentText(String docName)
docName - Name of document.
NullPointerException - If the document name is not known.public int numDocuments()
public Set<String> documentNames()
public int docStartToken(String docId)
-1 if the
document is not part of the collection.
docId - Document identifier.
public int docEndToken(String docId)
-1 if the document is not
part of the collection.
docId - Document identifier.
public static int largestWithoutGoingOver(int[] vals,
int val)
Warning: No test is made that the values are in increasing order. If they are not, the behavior of this method is not specified.
vals - Array of values, sorted in ascending order.val - Specified value to search.
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||