com.aliasi.lm
Class TrieIntSeqCounter

java.lang.Object
  extended by com.aliasi.lm.TrieIntSeqCounter
All Implemented Interfaces:
IntSeqCounter

public class TrieIntSeqCounter
extends Object
implements IntSeqCounter

An TrieIntSeqCounter implements an integer sequence counter with a trie structure of counts.

Implementation Note: This trie-based integer sequence counter is not as tight in memory as the character tries, but is much more efficient for nodes with many daughters. It unfolds 1-daughter and 2-daughter nodes, and beyond that uses balanced binary trees (via java.util.TreeMap)

Since:
LingPipe2.0
Version:
3.9
Author:
Bob Carpenter

Constructor Summary
TrieIntSeqCounter(int maxLength)
          Construct an integer sequence counter for subsequences up to the specified maximum length.
 
Method Summary
 int count(int[] is, int start, int end)
          Returns the count of the specified sequence of integers.
 long extensionCount(int[] is, int start, int end)
          Returns the sum of the count of all sequences that extend the specified sequence by one integer.
 void handleNGrams(int nGram, int minCount, ObjectHandler<int[]> handler)
          Supplies each n-gram of the specified length and with greater than or equal to the specified minimum count to the specified handler.
 void incrementSequence(int[] is, int start, int end, int count)
          Increments the count for the specified slice by the specified amount.
 void incrementSubsequences(int[] is, int start, int end)
          Increments the count for all subsequences of the specified integer sequence up to the specified maximum length.
 void incrementSubsequences(int[] is, int start, int end, int count)
          Increments the count for all subsequences of the specified integer sequence up to the specified maximum length with the specified count.
 int[] integersFollowing(int[] is, int start, int end)
          Returns an array of the integers that follow the specified integer array slice.
 int maxLength()
          Returns the maximum length of subsequence of integers being counted.
 ObjectToCounterMap<int[]> nGramCounts(int nGram, int minCount)
          Returns a histogram of counts for n-grams of integers of the specified length, with a count of at least the specified minimum count.
 int numExtensions(int[] is, int start, int end)
          Returns the number of one integer extensions of the specified with non-zero counts.
 int[] observedIntegers()
          Returns an array of all integers that have non-zero counts in the model.
 void prune(int minCount)
          Removes all counts for sequences that are less than the minimum count.
 void rescale(double countMultiplier)
          Rescales all counts by multiplying them by the specified factor.
 String toString()
          Return a string-based representation of this integer sequence counter.
 int trieSize()
          Returns the size of this graph, measured in number of nodes in the trie structure.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TrieIntSeqCounter

public TrieIntSeqCounter(int maxLength)
Construct an integer sequence counter for subsequences up to the specified maximum length.

Parameters:
maxLength - Maximum length of subsequences counted.
Throws:
IllegalArgumentException - If the maximum length is less than zero.
Method Detail

prune

public void prune(int minCount)
Removes all counts for sequences that are less than the minimum count. This operation is safe in that it will never remove the root node. Pruning is idempotent in that pruning twice with the same count has no effect.

Parameters:
minCount - Minimum count to maintain a node.

rescale

public void rescale(double countMultiplier)
Rescales all counts by multiplying them by the specified factor. Counts are rounded down by casting back to int after being multipled by the scaling factor:
count=(int)(count*countMultiplier)
Unlike pruning, scaling has a cumulative effect and is not idempotent. For instance, a count of four scaled by half once will be two, and scaled by half twice will be one. Because of rounding, it's not even guaranteed that rescaling twice, rescale(0.5); rescale(0.5);, returns the same result as rescaling with the combined factor, rescale(0.25);.

Also unlike pruning, scaling, because of the integer rounding, may change the ratios between surviving counts. For instance, under scaling by 0.5, both 3 and 2 rescale to 1.

Parameters:
countMultiplier - Amount by which counts are scaled.

maxLength

public int maxLength()
Returns the maximum length of subsequence of integers being counted.

Returns:
The maximum length of subsequence of integers being counted.

incrementSubsequences

public void incrementSubsequences(int[] is,
                                  int start,
                                  int end)
Increments the count for all subsequences of the specified integer sequence up to the specified maximum length. For instance, calling incrementSubsequences({1,3,17,8,122},1,4) with a maximum length of 2 increments the bigram sequence counts {3,17}, {17,8} and the unigram sequence counts {3}, {17} and {8}.

Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.
Throws:
IndexOutOfBoundsException - If the start and end minus one indices do not fall within the range of the integer array.

incrementSubsequences

public void incrementSubsequences(int[] is,
                                  int start,
                                  int end,
                                  int count)
Increments the count for all subsequences of the specified integer sequence up to the specified maximum length with the specified count. Calling incrementSubsequences(is,start,end,n) is equivalent to calling incrementSubsequences(is,start,end) a total of n times.

Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.
count -
Throws:
IndexOutOfBoundsException - If the start and end minus one indices do not fall within the range of the integer array.
IllegalArgumentException - If the count is less than zero.

incrementSequence

public void incrementSequence(int[] is,
                              int start,
                              int end,
                              int count)
Increments the count for the specified slice by the specified amount. For instance, calling incrementSequence({1,2,3,4},1,3,15) results in the sequence 2,3 having its count incremented by 15.

If the sequence provided is longer than the maximum sequence counted, only its final counts are used. For example, if the maximum length is 3, then calling incrementSequence({1,2,3,4,5},0,5,12) is equivalent to calling incrementSequence({3,4,5},0,3,12).

Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.
count -
Throws:
IndexOutOfBoundsException - If the start and end minus one indices do not fall within the range of the integer array.
IllegalArgumentException - If the count is less than zero.

nGramCounts

public ObjectToCounterMap<int[]> nGramCounts(int nGram,
                                             int minCount)
Returns a histogram of counts for n-grams of integers of the specified length, with a count of at least the specified minimum count. The resulting counter will be empty under if there are no n-grams in this counter of the specified length above the specified threshold. Note that one case of this is if the specified n-gram is greater than the maximum n-gram length for this counter.

Parameters:
nGram - Length of n-gram whose histrogram is returned.
minCount - Minimum count of element in histogram.
Returns:
Histogram of counts of n-grams of the specified length with counts above the specified minimum.
Throws:
IllegalArgumentException - If the n-gram length is less than 1.

trieSize

public int trieSize()
Returns the size of this graph, measured in number of nodes in the trie structure. This is equal to the number of sequences of integers for which this counter stores counts.

Returns:
The size of this counter.

handleNGrams

public void handleNGrams(int nGram,
                         int minCount,
                         ObjectHandler<int[]> handler)
Supplies each n-gram of the specified length and with greater than or equal to the specified minimum count to the specified handler.

Parameters:
nGram - Length of n-grams to visit.
minCount - Minimum count of visited n-gram.
handler - Handler for visited n-grams.

count

public int count(int[] is,
                 int start,
                 int end)
Description copied from interface: IntSeqCounter
Returns the count of the specified sequence of integers.

Specified by:
count in interface IntSeqCounter
Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.

extensionCount

public long extensionCount(int[] is,
                           int start,
                           int end)
Description copied from interface: IntSeqCounter
Returns the sum of the count of all sequences that extend the specified sequence by one integer.

Specified by:
extensionCount in interface IntSeqCounter
Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.

numExtensions

public int numExtensions(int[] is,
                         int start,
                         int end)
Description copied from interface: IntSeqCounter
Returns the number of one integer extensions of the specified with non-zero counts.

Specified by:
numExtensions in interface IntSeqCounter
Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.

observedIntegers

public int[] observedIntegers()
Description copied from interface: IntSeqCounter
Returns an array of all integers that have non-zero counts in the model.

Specified by:
observedIntegers in interface IntSeqCounter
Returns:
Integers with non-zero counts in the model.

integersFollowing

public int[] integersFollowing(int[] is,
                               int start,
                               int end)
Description copied from interface: IntSeqCounter
Returns an array of the integers that follow the specified integer array slice.

Specified by:
integersFollowing in interface IntSeqCounter
Parameters:
is - Underlying array of integers.
start - Index of first integer in the slice.
end - Index of one past the last integer in the slice.

toString

public String toString()
Return a string-based representation of this integer sequence counter.

Overrides:
toString in class Object
Returns:
A string-based representation of this integer sequence counter.