com.aliasi.suffixarray
Class CharSuffixArray

java.lang.Object
  extended by com.aliasi.suffixarray.CharSuffixArray

public class CharSuffixArray
extends Object

A CharSuffixArray implements a suffix array of characters.

What is a Suffix Array?

Given a string characters cs, the corresponding suffix array is an array of int values of length equal to cs.length(). The suffix array contains each integer between 0 (inclusive) and the length of cs (exclusive). The suffix array is sorted so that an index m appears before n only if the string running from index m to the end of cs (i.e., cs.substring(m,cs.length-1) is less than the string running from index n to the end of cs, using ordinary Java String comparison.

Example

The standard example is the suffix array for the array of characters derived from the string "abracadabra". Here's the string itself, with its corresponding indexes:
 abracadabra
 012345678901
 0         1
 
The suffixes and their starting indexes are
Char Array IndexSuffix
0abracadabra
1bracadabra
2racadabra
3acadabra
4cadabra
5adabra
6dabra
7abra
8bra
9ra
10a
The suffix array sorts the char array indexes based on the sort order of the corresponding suffixes as strings.
Suffix IndexValueSuffix
010a
17abra
20abracadabra
33acadabra
45adabra
58bra
61bracadabra
74cadabra
86dabra
99ra
102racadabra
Thus the suffix array itself for "abracadabra" is the int[]-type array
 suffixArray("abracadbra")
 = { 10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2 }
 

Constructing a Suffix Array

A suffix array is constructed from a character array and optionally a maximum length at which to compare strings. If the maximum length is less than the length of the array, strings are truncated to be at most this length before comparison. This isn't a true suffix array, but can be faster to create and will suffice for many applications. The indexes will be sorted relative to the truncated strings, so they will be in order up to the specified length.

Using Suffix Arrays

The primary application of suffix arrays is finding duplicate substrings. The key idea is that two substrings of a string can be represented as the prefixes of two suffixes of the string. For instance, the example above, "abracadabra", has two instances of the substring "br", corresponding to the suffixes "bracadabra" starting at index 1 in the original string and "bra" starting at index 8 in the original string. Note that these two suffixes are adjacent in the suffix array, occupying indexes 5 and 6 (in reverse order, because suffix "bra" sorts before "bracadabra" as a string.

The method prefixMatches(int) will return all spans in the suffix array that match up to a specific number of characters. For instance, to find all substrings that match of length 3 from suffix array sa, the method call sa.prefixMatches(3) returns a list containing all spans as integer arrays of type int[] with spans being represented from start position (inclusive) to end position (exclusive), which would here contain elements {1,3}, {5,7} indicating first that the suffixes at positions 1 and 2, namely "abra" and "abracadabra" start with the same three characters and second that the suffixes at positions 5 and 6, "bra" and "bracadabra", start with the same three characters. Thus we found all substrings of length 3 that occur more than once, namely "abr" and "bra", along with their positions. By using the suffix array itself, the positions in the underlying string may be retrieved. For instance, the suffixes at positions 1 and 2 in the suffix array start at positions

Thread Safety

A suffix array is thread safe after construction.

Since:
4.0.2
Version:
4.1.0
Author:
Bob Carpenter

Field Summary
static char SEPARATOR
          A special separator character, used to mark the boundaries of documents within the character array.
 
Constructor Summary
CharSuffixArray(String text)
          Construct a suffix array from the specified string, with no bound on suffix length.
CharSuffixArray(String text, int maxSuffixLength)
          Construct a suffix array from the specified string, bounding comparisons for sorting by the specified maximum suffix length.
 
Method Summary
 int maxSuffixLength()
          Returns the maximum suffix length for this character suffix array.
 List<int[]> prefixMatches(int minMatchLength)
          Returns a list of maximal spans of suffix array indexes which refer to suffixes that share a prefix of at least the specified minimum match length.
 String suffix(int csIndex, int maxLength)
          Returns the string that starts at position i in the character index and runs to the end of the character array or up to the specified maximum length.
 int suffixArray(int idx)
          Return the value of the suffix array at the specified index.
 int suffixArrayLength()
          Return the number of entries in this suffix array.
 String text()
          Returns the underlying array of characters for this class.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

SEPARATOR

public static char SEPARATOR
A special separator character, used to mark the boundaries of documents within the character array. Suffixes are considered virtually to only run up to a separator. The value of the separator char is '?'.

Constructor Detail

CharSuffixArray

public CharSuffixArray(String text)
Construct a suffix array from the specified string, with no bound on suffix length.

Parameters:
text - Underlying characters making up suffix array.

CharSuffixArray

public CharSuffixArray(String text,
                       int maxSuffixLength)
Construct a suffix array from the specified string, bounding comparisons for sorting by the specified maximum suffix length.

This constructor is appropriate if no operations will be subsequently performed on suffixes greater than the maximum specified length.

Parameters:
text - Underlying text for suffix array.
maxSuffixLength - Maximum suffix length for comparison.
Method Detail

text

public String text()
Returns the underlying array of characters for this class.

Returns:
The text underlying this suffix array.

maxSuffixLength

public int maxSuffixLength()
Returns the maximum suffix length for this character suffix array.

Returns:
Maximum length of suffixes.

suffixArray

public int suffixArray(int idx)
Return the value of the suffix array at the specified index.

Parameters:
idx - Index into suffix array.
Returns:
Index of suffix start position in the underlying character array.

suffixArrayLength

public int suffixArrayLength()
Return the number of entries in this suffix array.

Returns:
Length of the suffix array.

suffix

public String suffix(int csIndex,
                     int maxLength)
Returns the string that starts at position i in the character index and runs to the end of the character array or up to the specified maximum length.

Parameters:
csIndex - Starting index in underlying array of characters.
maxLength - Maximum length of returned string.
Returns:
String starting at the specified index to the end of the character array, truncated at max length.

prefixMatches

public List<int[]> prefixMatches(int minMatchLength)
Returns a list of maximal spans of suffix array indexes which refer to suffixes that share a prefix of at least the specified minimum match length.

Parameters:
minMatchLength - Minimum number of characters required to match.
Returns:
The list of pairs of start (inclusive) and end (exclsuive) positions in the suffix array that match up to the specified minimum number of characters.