com.aliasi.lm
Interface CharSeqCounter

All Known Implementing Classes:
CharSeqMultiCounter, TrieCharSeqCounter

public interface CharSeqCounter

A CharSeqCounter counter provides counts for sequences of characters.

The method count(char[],int,int) returns the count of the specified character array slice. The method extensionCount(char[],int,int) counts the number of single-character extensions of the specified character array slice. The maximum likelihood estimator can be computed directly from these counts by:

PML(cN|c0,...cN-1)
  = count({c0,...,cN},0,N+1) / extensionCount({c0,...,cN-1},0,N)
The reason the denominator is not a simple count of the context is because of the way final suffix counts are incremented. For instance, consider counts of all substrings of "abab"; the maximum likelihood estimate of P(a|b) is count(ba)/extensionCount(b)=1/1, not count(ba)/count(b)=1/2.

The method observedCharacters() returns an array of all characters that appear in at least one substring. The method method charactersFollowing(char[],int,int) returns the number of characters observed following the specified character slice, whereas numCharactersFollowing(char[],int,int) returns the number of characters observed following the specified character slice. These methods are useful for computing the Witten-Bell estimator used in NGramProcessLM.

Since:
LingPipe2.0
Version:
3.0
Author:
Bob Carpenter

Method Summary
 char[] charactersFollowing(char[] cs, int start, int end)
          Returns the array of characters that have been observed following the specified character slice in unicode order.
 long count(char[] cs, int start, int end)
          Returns the count for the specified character sequence.
 long extensionCount(char[] cs, int start, int end)
          Returns the sum of the counts of all character sequences one character longer than the specified character slice.
 int numCharactersFollowing(char[] cs, int start, int end)
          Returns the number of characters that when appended to the end of the specified character slice produce an extended slice with a non-zero count.
 char[] observedCharacters()
          Returns an array consisting of the characters with non-zero count in unicode order.
 

Method Detail

count

long count(char[] cs,
           int start,
           int end)
Returns the count for the specified character sequence.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
Returns:
Count of character array slice in model.
Throws:
IndexOutOfBoundsException - If the start and end minus one indices are not in the range of the character array.

extensionCount

long extensionCount(char[] cs,
                    int start,
                    int end)
Returns the sum of the counts of all character sequences one character longer than the specified character slice.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - Index of one past last character in slice.
Returns:
The sum of the counts of all character sequences one character longer than the specified character slice.
Throws:
IndexOutOfBoundsException - If the start and end minus one indices are not in the range of the character array.

numCharactersFollowing

int numCharactersFollowing(char[] cs,
                           int start,
                           int end)
Returns the number of characters that when appended to the end of the specified character slice produce an extended slice with a non-zero count. In symbols:
numCharactersFollowing(cSlice)
  = | { c | count(cSlice.c) > 0 } |
where count(cSlice.c) represents the count of the character slice cSlice suffixed with the character c.

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
The number of characters following the specified character slice.
Throws:
IndexOutOfBoundsException - If the start and end minus one indices are not in the range of the character array.

charactersFollowing

char[] charactersFollowing(char[] cs,
                           int start,
                           int end)
Returns the array of characters that have been observed following the specified character slice in unicode order. The returned array will be in ascending unicode numerical order. Note that unicode order is not necessarily the same as any localized alpha-numeric sort order. rie

Parameters:
cs - Underlying character array.
start - Index of first character in slice.
end - One plus index of last character in slice.
Returns:
The number of characters following the specified character slice.
Throws:
IndexOutOfBoundsException - If the start and end minus one indices are not in the range of the character array.

observedCharacters

char[] observedCharacters()
Returns an array consisting of the characters with non-zero count in unicode order. The return value of this method will be equal to the return value of charactersFollowing(new char[0],0,0).

Returns:
Array of characters with non-zero counts.