com.aliasi.lm
Class TrieCharSeqCounter

java.lang.Object
  extended by com.aliasi.lm.TrieCharSeqCounter
All Implemented Interfaces:
CharSeqCounter

public class TrieCharSeqCounter
extends Object
implements CharSeqCounter

A TrieCharSeqCounter stores counts for substrings of strings. When the counter is constructed, a maximum length is specified, and counts are only stored for strings up to that length. For instance, an n-gram language model needs only counts for strings up to length n.

Strings may be added to the counter using incrementSubstrings(char[],int,int), which increments the counts for all substrings of the specified character slice up to the specified maximum length substring. The method incrementPrefixes(char[],int,int) increments only the prefixes of the specified string. All substrings are incremented by incrementing prefixes for each suffix. A substring counter may be pruned using prune(int), which removes all substrings with count below the specified threshold.

There are a wide range of reporting methods for trie-based counters.

Implementation Note: The trie counters are a heavily unfolded implementation of a character-based Patricia (PAT) trie.

Since:
LingPipe2.0
Version:
3.8
Author:
Bob Carpenter

Constructor Summary
TrieCharSeqCounter(int maxLength)
          Construct a substring counter that stores substrings up to the specified maximum length.
 
Method Summary
 char[] charactersFollowing(char[] cs, int start, int end)
          Returns the array of characters that have been observed following the specified character slice in unicode order.
 long count(char[] cs, int start, int end)
          Returns the count for the specified character sequence.
 long count(CharSequence cSeq)
          Returns the count in the training corpus for the specified sequence of characters.
 void decrementSubstrings(char[] cs, int start, int end)
          Decrements all of the substrings of the specified character slice by one.
 void decrementUnigram(char c)
          Decrements the unigram count for the specified character.
 void decrementUnigram(char c, int count)
          Decrements the unigram count by the specified amount for the specified character.
 long extensionCount(char[] cs, int start, int end)
          Returns the sum of the counts of all character sequences one character longer than the specified character slice.
 long extensionCount(CharSequence cSeq)
          Returns the sum of the counts of all character sequences one character longer than the specified character sequence.
 void incrementPrefixes(char[] cs, int start, int end)
          Increments the count of all prefixes of the specified character sequence up to the maximum length specified in the constructor.
 void incrementPrefixes(char[] cs, int start, int end, int count)
          Increments the count of all prefixes of the specified character sequence up to the maximum length specified in the constructor.
 void incrementSubstrings(char[] cs, int start, int end)
          Increments the count of all substrings of the specified character array slice up to the maximum length specified in the constructor.
 void incrementSubstrings(char[] cs, int start, int end, int count)
          Increments by the specified count all substrings of the specified character array slice up to the maximum length specified in the constructor.
 void incrementSubstrings(CharSequence cSeq)
          Increments the count of all substrings of the specified character sequence up to the maximum length specified in the constructor.
 void incrementSubstrings(CharSequence cSeq, int count)
          Increments by the specified count all substrings of the specified character sequence up to the maximum length specified in the constructor.
 int[] nGramFrequencies(int nGramOrder)
          Returns an array of frequency counts for n-grams of the specified n-gram order sorted in descending frequency order.
 int numCharactersFollowing(char[] cs, int start, int end)
          Returns the number of characters that when appended to the end of the specified character slice produce an extended slice with a non-zero count.
 char[] observedCharacters()
          Returns an array consisting of the characters with non-zero count in unicode order.
 void prune(int minCount)
          Removes strings with counts below the specified minimum.
static TrieCharSeqCounter readCounter(TrieReader reader, int maxNGram)
          Reads a trie character sequence counter from the specified trie reader, restricting the result to the specified maximum n-gram.
static TrieCharSeqCounter readFrom(InputStream in)
          Reads a trie character sequence counter from the specified input stream.
 ObjectToCounterMap<String> topNGrams(int nGramOrder, int maxReturn)
          Returns a counter of occurrences of the highest frequency n-grams of a specified n-gram order.
 String toString()
          Returns a string representation of the trie structure of counts underlying this counter.
 long totalSequenceCount()
          Returns the sum of counts for all non-empty character sequences.
 long totalSequenceCount(int length)
          Returns the sum of the counts of all character sequences of the specified length.
 long uniqueSequenceCount()
          Returns the number of character sequences with non-zero counts, including the empty (zero length) character sequence.
 long uniqueSequenceCount(int nGramOrder)
          Returns the number of character sequences of the specified length with non-zero counts.
 long[][] uniqueTotalNGramCount()
          Returns the array of unique and total n-gram counts for each n-gram length.
static void writeCounter(CharSeqCounter counter, TrieWriter writer, int maxNGram)
          Writes the specified sequence counter to the specified trie writer, restricting output to n-grams not longer than the specified maximum.
 void writeTo(OutputStream out)
          Writes an encoding of this counter to the specified output stream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TrieCharSeqCounter

public TrieCharSeqCounter(int maxLength)
Construct a substring counter that stores substrings up to the specified maximum length.

Parameters:
maxLength - Maximum length of substrings stored by this counter.
Throws:
IllegalArgumentException - If the maximum length is negative.
Method Detail

count

public long count(char[] cs,
                  int start,
                  int end)
Description copied from interface: CharSeqCounter
Returns the count for the specified character sequence.

Specified by:
count in interface CharSeqCounter
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
Returns:
Count of character array slice in model.

extensionCount

public long extensionCount(char[] cs,
                           int start,
                           int end)
Description copied from interface: CharSeqCounter
Returns the sum of the counts of all character sequences one character longer than the specified character slice.

Specified by:
extensionCount in interface CharSeqCounter
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
Returns:
The sum of the counts of all character sequences one character longer than the specified character slice.

observedCharacters

public char[] observedCharacters()
Description copied from interface: CharSeqCounter
Returns an array consisting of the characters with non-zero count in unicode order. The return value of this method will be equal to the return value of charactersFollowing(new char[0],0,0).

Specified by:
observedCharacters in interface CharSeqCounter
Returns:
Array of characters with non-zero counts.

charactersFollowing

public char[] charactersFollowing(char[] cs,
                                  int start,
                                  int end)
Description copied from interface: CharSeqCounter
Returns the array of characters that have been observed following the specified character slice in unicode order. The returned array will be in ascending unicode numerical order. Note that unicode order is not necessarily the same as any localized alpha-numeric sort order. rie

Specified by:
charactersFollowing in interface CharSeqCounter
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
The number of characters following the specified character slice.

numCharactersFollowing

public int numCharactersFollowing(char[] cs,
                                  int start,
                                  int end)
Description copied from interface: CharSeqCounter
Returns the number of characters that when appended to the end of the specified character slice produce an extended slice with a non-zero count. In symbols:
numCharactersFollowing(cSlice)
  = | { c | count(cSlice.c) > 0 } |
where count(cSlice.c) represents the count of the character slice cSlice suffixed with the character c.

Specified by:
numCharactersFollowing in interface CharSeqCounter
Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
The number of characters following the specified character slice.

totalSequenceCount

public long totalSequenceCount()
Returns the sum of counts for all non-empty character sequences.

Returns:
The sum of counts for all non-empty character sequences.

totalSequenceCount

public long totalSequenceCount(int length)
Returns the sum of the counts of all character sequences of the specified length.

Returns:
The sum of the counts of all character sequences of the specified length.

uniqueSequenceCount

public long uniqueSequenceCount()
Returns the number of character sequences with non-zero counts, including the empty (zero length) character sequence.

Returns:
Number of character sequences with non-zero counts.

uniqueSequenceCount

public long uniqueSequenceCount(int nGramOrder)
Returns the number of character sequences of the specified length with non-zero counts.

Returns:
The number of character sequences of the specified length with non-zero counts.

prune

public void prune(int minCount)
Removes strings with counts below the specified minimum. Counts for remaining strings are not affected. Pruning may be interleaved with updating counts in any order.

Parameters:
minCount - Minimum count required to retain a substring count.
Throws:
IllegalArgumentException - If the count is less than 1.

nGramFrequencies

public int[] nGramFrequencies(int nGramOrder)
Returns an array of frequency counts for n-grams of the specified n-gram order sorted in descending frequency order. This form of result is sometimes called a Zipf plot because of the sorting.

Parameters:
nGramOrder - Order of n-gram counted.
Returns:
Array of frequency counts, sorted in decreasing order of rank.

uniqueTotalNGramCount

public long[][] uniqueTotalNGramCount()
Returns the array of unique and total n-gram counts for each n-gram length. The return array is indexed in the first position by n-gram length, and in the second position by 0 for unique counts and 1 for total counts. Thus for 0<=n<=maxLength():
uniqueTotalNGramCount()[n][0] == uniqueNGramCount(n)
and
uniqueTotalNGramCount()[n][1] == totalNGramCount(n)
If unique and total counts are required for several n-gram depths, this method is much more efficient than calling all of the individual methods separately.

Returns:
The array of unique and total n-gram counts for each n-gram length.

topNGrams

public ObjectToCounterMap<String> topNGrams(int nGramOrder,
                                            int maxReturn)
Returns a counter of occurrences of the highest frequency n-grams of a specified n-gram order. The actual n-grams are represented as strings in the result; recall that strings are instances of CharSequence.

The maximum number of results returned must be specified, because the entire set of n-grams is usually too large to return as a counter.

Parameters:
nGramOrder - Order of n-gram to count.
maxReturn - Maximum number of objects returned.

count

public long count(CharSequence cSeq)
Returns the count in the training corpus for the specified sequence of characters. The count returned may have been reduced from the raw counts in training cases by pruning.

Parameters:
cSeq - Character sequence.
Returns:
Count of character sequence in model.

extensionCount

public long extensionCount(CharSequence cSeq)
Returns the sum of the counts of all character sequences one character longer than the specified character sequence.

Parameters:
cSeq - Character sequence.
Returns:
The sum of the counts of all character sequences one character longer than the specified character sequence.

incrementSubstrings

public void incrementSubstrings(char[] cs,
                                int start,
                                int end)
Increments the count of all substrings of the specified character array slice up to the maximum length specified in the constructor.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
Throws:
IndexOutOfBoundsException - If the specified start and one plus end point are not in the bounds of character sequence.

incrementSubstrings

public void incrementSubstrings(char[] cs,
                                int start,
                                int end,
                                int count)
Increments by the specified count all substrings of the specified character array slice up to the maximum length specified in the constructor.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
count - Amount to increment.
Throws:
IndexOutOfBoundsException - If the specified start and one plus end point are not in the bounds of character sequence.

incrementSubstrings

public void incrementSubstrings(CharSequence cSeq)
Increments the count of all substrings of the specified character sequence up to the maximum length specified in the constructor.

Parameters:
cSeq - Character sequence.

incrementSubstrings

public void incrementSubstrings(CharSequence cSeq,
                                int count)
Increments by the specified count all substrings of the specified character sequence up to the maximum length specified in the constructor.

Parameters:
cSeq - Character sequence.
count - Amount to increment.

incrementPrefixes

public void incrementPrefixes(char[] cs,
                              int start,
                              int end)
Increments the count of all prefixes of the specified character sequence up to the maximum length specified in the constructor.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
Throws:
IndexOutOfBoundsException - If the specified start and one plus end point are not in the bounds of character sequence.

incrementPrefixes

public void incrementPrefixes(char[] cs,
                              int start,
                              int end,
                              int count)
Increments the count of all prefixes of the specified character sequence up to the maximum length specified in the constructor.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
count - Amount to increment.
Throws:
IndexOutOfBoundsException - If the specified start and one plus end point are not in the bounds of character sequence.

decrementSubstrings

public void decrementSubstrings(char[] cs,
                                int start,
                                int end)
Decrements all of the substrings of the specified character slice by one. This method may be used in conjunction with incrementSubstrings(char[],int,int) to implement counts for conditional probability estimates without affecting underlying estimates. For example, the following code:
 char[] cs = "abcdefghi".toCharArray();
 counter.incrementSubstrings(cs,3,7);
 counter.decrementSubstrings(cs,3,5);
 
will increment the substrings of "defg" and then decrement the substrings of "de", causing the net effect of incrementing the counts of substrings "defg", "efg", "fg", "g", "def", "ef", and "f". This has the effect of increasing the estimate of g given def, without increasing the estimate of d in an empty context.

Parameters:
cs - Underlying array of characters in slice.
start - Index of first character in slice.
end - Index of one past last character in slice.
Throws:
IllegalArgumentException - If the array slice is valid.

toString

public String toString()
Returns a string representation of the trie structure of counts underlying this counter.

Warning: The resulting string will be very large if the number of substrings is large. To avoid blowing out memory, do not call this method for large counters.

Overrides:
toString in class Object
Returns:
String representation of this counter.

decrementUnigram

public void decrementUnigram(char c)
Decrements the unigram count for the specified character. This method is useful for training conditional probabilities, even though it is not powerful enough to do it in full generality.

Parameters:
c - Decrement the unigram count for the specified character.

decrementUnigram

public void decrementUnigram(char c,
                             int count)
Decrements the unigram count by the specified amount for the specified character. This method is useful for training conditional probabilities, even though it is not powerful enough to do it in full generality.

Parameters:
c - Decrement the unigram count for the specified character.
count - Amount to decrement.

writeTo

public void writeTo(OutputStream out)
             throws IOException
Writes an encoding of this counter to the specified output stream. It may be read back in using readFrom(InputStream).

The output is produced using a BitTrieWriter wrapped around a BitOutput wrapped around the specified underlying output stream. First, the bit output is used to delta-code the maximum n-gram plus 1. Then, the trie is encoded as described in BitTrieWriter. Finally, the bit output is flushed. The underlying output stream is neither flushed nor closed, allowing them to be used for other pruposes after this counter is written.

If necessary for efficiency, streams should be buffered before being passed to this method.

Parameters:
out - Underlying output stream for writing.
Throws:
IOException - If there is an underlying I/O error.

writeCounter

public static void writeCounter(CharSeqCounter counter,
                                TrieWriter writer,
                                int maxNGram)
                         throws IOException
Writes the specified sequence counter to the specified trie writer, restricting output to n-grams not longer than the specified maximum.

Parameters:
counter - Counter to write.
writer - Trie writer to which counter is written.
maxNGram - Maximum length n-gram written.
Throws:
IOException - If there is an underlying I/O error.

readFrom

public static TrieCharSeqCounter readFrom(InputStream in)
                                   throws IOException
Reads a trie character sequence counter from the specified input stream.

The expected encoding is described in writeTo(OutputStream).

If necessary for efficiency, streams should be buffered before being passed to this method.

Parameters:
in - Underlying input stream for reading.
Throws:
IOException - If there is an underlying I/O error.

readCounter

public static TrieCharSeqCounter readCounter(TrieReader reader,
                                             int maxNGram)
                                      throws IOException
Reads a trie character sequence counter from the specified trie reader, restricting the result to the specified maximum n-gram.

Parameters:
reader - Reader from which to read the trie.
maxNGram - Maximum length n-gram to read.
Returns:
The counter read from the reader.
Throws:
IOException - If there is an underlying I/O error.