|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
public interface CharSeqCounter
A CharSeqCounter counter provides counts for sequences
of characters.
The method count(char[],int,int) returns the count of
the specified character array slice. The method extensionCount(char[],int,int) counts the number of
single-character extensions of the specified character array slice.
The maximum likelihood estimator can be computed directly from
these counts by:
PML(cN|c0,...cN-1)
= count({c0,...,cN},0,N+1)
/ extensionCount({c0,...,cN-1},0,N)
The reason the denominator is not a simple count of the context is
because of the way final suffix counts are incremented. For
instance, consider counts of all substrings of
"abab"; the maximum likelihood estimate of
P(a|b) is
count(ba)/extensionCount(b)=1/1, not
count(ba)/count(b)=1/2.
The method observedCharacters() returns an array of all
characters that appear in at least one substring. The method
method charactersFollowing(char[],int,int) returns the
number of characters observed following the specified character slice,
whereas numCharactersFollowing(char[],int,int) returns the
number of characters observed following the specified character
slice. These methods are useful for computing the Witten-Bell
estimator used in NGramProcessLM.
| Method Summary | |
|---|---|
char[] |
charactersFollowing(char[] cs,
int start,
int end)
Returns the array of characters that have been observed following the specified character slice in unicode order. |
long |
count(char[] cs,
int start,
int end)
Returns the count for the specified character sequence. |
long |
extensionCount(char[] cs,
int start,
int end)
Returns the sum of the counts of all character sequences one character longer than the specified character slice. |
int |
numCharactersFollowing(char[] cs,
int start,
int end)
Returns the number of characters that when appended to the end of the specified character slice produce an extended slice with a non-zero count. |
char[] |
observedCharacters()
Returns an array consisting of the characters with non-zero count in unicode order. |
| Method Detail |
|---|
long count(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.
IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.
long extensionCount(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - Index of one past last character in slice.
IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.
int numCharactersFollowing(char[] cs,
int start,
int end)
numCharactersFollowing(cSlice)
= | { c | count(cSlice.c) > 0 } |
where count(cSlice.c) represents the count
of the character slice cSlice suffixed with the
character c.
cs - Underlying character array.start - Index of first character in slice.end - One plus index of last character in slice.
IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.
char[] charactersFollowing(char[] cs,
int start,
int end)
cs - Underlying character array.start - Index of first character in slice.end - One plus index of last character in slice.
IndexOutOfBoundsException - If the start and end minus
one indices are not in the range of the character array.char[] observedCharacters()
charactersFollowing(new
char[0],0,0).
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||