com.aliasi.stats
Class Statistics

java.lang.Object
  extended by com.aliasi.stats.Statistics

public class Statistics
extends Object

The Statistics class provides static utility methods for statistical computations.

Since:
LingPipe2.0
Version:
3.5
Author:
Bob Carpenter

Method Summary
static double chiSquaredIndependence(double[][] contingencyMatrix)
          Returns the value of Pearson's C2 goodness of fit statistic for independence over the specified contingency matrix.
static double chiSquaredIndependence(double both, double oneOnly, double twoOnly, double neither)
          Returns Pearson's C2 goodness-of-fit statistic for independence over the specified binary contingency matrix.
static double correlation(double[] xs, double[] ys)
          Returns (Pearson's) correlation coefficient between two specified arrays.
static double dirichletLog2Prob(double[] alphas, double[] xs)
          Returns the log (base 2) of the probability of the specified discrete distribution given the specified Dirichlet distribution concentration parameters.
static double dirichletLog2Prob(double alpha, double[] xs)
          Returns the log (base 2) of the probability of the specified discrete distribution given the specified uniform Dirichlet with concentration parameters equal to the specified value.
static double jsDivergence(double[] p, double[] q)
          Return the Jenson-Shannon divergence between the specified multinomial distributions.
static double kappa(double observedProb, double expectedProb)
          Returns the value of the kappa statistic for the specified observed and expected probabilities.
static double klDivergence(double[] p, double[] q)
          Returns the Kullback-Leibler divergence of the second specified multinomial relative to the first.
static double klDivergenceDirichlet(double[] xs, double[] ys)
          Returns the Kullback-Leibler divergence of the second specified Dirichlet distribution relative to the first.
static double[] linearRegression(double[] xs, double[] ys)
          Returns a two-element array of lineary regression coefficients for the specified x and y values.
static double[] logisticRegression(double[] xs, double[] ys, double maxValue)
          Returns a two-element array of logistic regression coefficients for the specified x and y values.
static double mean(double[] xs)
          Returns the mean of the specified array of input values.
static double[] normalize(double[] probabilityRatios)
          Return an array of probabilities resulting from normalizing the specified probability ratios.
static int[] permutation(int length)
          Returns a permutation of the integers between 0 and the specified length minus one.
static int[] permutation(int length, Random random)
          Returns a permutation of the integers between 0 and the specified length minus one using the specified randomizer.
static int sample(double[] cumulativeProbRatios, Random random)
          Returns a sample from the discrete distribution represented by the specified cumulative probability ratios, using the specified random number generator.
static double standardDeviation(double[] xs)
          Returns the standard deviation of the specified array of input values.
static double symmetrizedKlDivergence(double[] p, double[] q)
          Returns the symmetrized KL-divergence between the specified distributions.
static double symmetrizedKlDivergenceDirichlet(double[] xs, double[] ys)
          Returns the symmetric version of the Kullback-Leibler divergence between the Dirichlet distributions determined by the specified parameters.
static double variance(double[] xs)
          Returns the variance of the specified array of input values.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

klDivergenceDirichlet

public static double klDivergenceDirichlet(double[] xs,
                                           double[] ys)
Returns the Kullback-Leibler divergence of the second specified Dirichlet distribution relative to the first. The Dirichlet distributions are specified by their distribution and concentration, which are folded into a single count argument.

The KL divergence between two Dirichlet distributions with parameters xs and ys of dimensionality n is:

 DKL(Dirichlet(xs) || Dirichlet(ys))
 =   p(θ|xs) log ( p(θ|xs) / p(θ|ys) ) dθ
 = log Γ(Σi < n xs[i])
   - log Γ(Σi < n ys[i])
   - Σi < n log Γ(xs[i])
   + Σi < n log Γ(ys[i])
   + Σi < n (xs[i] - ys[i]) * (Ψ(xs[i]) - Ψ(Σk < n xs[i]))
where Γ is the gamma function (see Math.log2Gamma(double)), and where Ψ is the digamma function defined in Math.digamma(double).

This method in keeping with other information-theoretic methods returns the answer in bits (base 2) rather than nats (base e). The return is rescaled from the natural-log based divergence:

 klDivergenceDirichlet(xs,ys)
 = DKL(Dirichlet(xs) || Dirichlet(ys)) / log2 e

Further descriptions of the computation of KL-divergence between Dirichlets may be found in:

Parameters:
xs - Dirichlet parameter for the first distribution.
ys - Dirichlet parameter for the second distribution.
Returns:
The KL-divergence of the second Dirichlet distribution relative to the first.

symmetrizedKlDivergenceDirichlet

public static double symmetrizedKlDivergenceDirichlet(double[] xs,
                                                      double[] ys)
Returns the symmetric version of the Kullback-Leibler divergence between the Dirichlet distributions determined by the specified parameters.

The symmetrized KL divergence for Dirichlets is just the average of both relative divergences:

 DSKL(Dirichlet(xs), Dirichlet(ys))
 = (DKL(Dirichlet(xs) || Dirichlet(ys)) + DKL(Dirichlet(ys) || Dirichlet(xs))) / 2
 

See the method documentation for klDivergenceDirichlet(double[],double[]) for a definition of the relative divergence for Dirichlet distributions.

Parameters:
xs - Dirichlet parameter for the first distribution.
ys - Dirichlet parameter for the second distribution.
Returns:
The symmetrized KL-divergence of the distributions.

klDivergence

public static double klDivergence(double[] p,
                                  double[] q)
Returns the Kullback-Leibler divergence of the second specified multinomial relative to the first.

The K-L divergence of a multinomial q relative to a multinomial p, both with n outcomes, is:

 DKL(p||q)
 = Σi < n  p(i) log2 (p(i) / q(i))
The value is guaranteed to be non-negative, and will be 0.0 only if the two distributions are identicial. If any outcome has zero probability in q and non-zero probability in p, the result is infinite.

KL divergence is not symmetric. That is, there are p and q such that DKL(p||q) != DKL(q||p). See symmetrizedKlDivergence(double[],double[]) and jsDivergence(double[],double[]) for symmetric variants.

KL divergence is equivalent to conditional entropy, although it is written in the opposite order. If H(p,q) is the joint entropy of the distributions p and q, and H(p) is the entropy of p, then:

 DKL(p||q) = H(p,q) - H(p)

Parameters:
p - First multinomial distribution.
q - Second multinomial distribution.
Throws:
IllegalArgumentException - If the distributions are not the same length or have entries less than zero or greater than 1.

symmetrizedKlDivergence

public static double symmetrizedKlDivergence(double[] p,
                                             double[] q)
Returns the symmetrized KL-divergence between the specified distributions. The symmetrization is carried out by averaging their relative divergences:
 DSKL(p,q)
 = ( DKL(p,q) + DKL(q,p) ) / 2
 

Parameters:
p - First multinomial.
q - Second multinomial.
Returns:
The Symmetrized KL divergence between the multinomials.
Throws:
IllegalArgumentException - If the distributions are not the same length or have entries less than zero or greater than 1.

jsDivergence

public static double jsDivergence(double[] p,
                                  double[] q)
Return the Jenson-Shannon divergence between the specified multinomial distributions. The JS divergence is defined by
 DJS(p,q)
 = ( DKL(p,m) + DKL(q,m) ) / 2
where m is defined as the balanced linear interpolation (that is, the average) of p and q:
m[i] = (p[i] + q[i]) / 2
The JS divergence is non-zero, equal to zero only if p and q are the same distribution, and symmetric.

Parameters:
p - First multinomial.
q - Second multinomial.
Returns:
The JS divergence between the multinomials.
Throws:
IllegalArgumentException - If the distributions are not the same length or have entries less than zero or greater than 1.

permutation

public static int[] permutation(int length)
Returns a permutation of the integers between 0 and the specified length minus one.

Parameters:
length - Size of permutation to return.
Returns:
Permutation of the specified length.

permutation

public static int[] permutation(int length,
                                Random random)
Returns a permutation of the integers between 0 and the specified length minus one using the specified randomizer.

Parameters:
length - Size of permutation to return.
random - Randomizer to use for permutation.
Returns:
Permutation of the specified length.

chiSquaredIndependence

public static double chiSquaredIndependence(double both,
                                            double oneOnly,
                                            double twoOnly,
                                            double neither)
Returns Pearson's C2 goodness-of-fit statistic for independence over the specified binary contingency matrix. Asymptotically, this statistic has a χ2 distribution with one degree of freedom. The higher the value, the less likely the two outcomes are independent. Note that this method is just a special case of the general chi-squared independence test:
 chiSquaredIndependence(both,oneOnly,twoOnly,neither)
 = chiSquaredIndependence(new double[][] { {both, oneOnly},
                                           {twoOnly, neither} });
 
The specified values make up the following contingency matrix for boolean outcomes labeled One and Two:
 +Two-Two
+OnebothoneOnly
-OnetwoOnlyneither

The value both is the count of events where both outcomes occurred, oneOnly for where only the first outcome occurred, twoOnly for where only the second outcome occurred, and neither for when neither occurred. Let totalCount be the sum of all of the cells in the matrix.

From the contingency matrix, marginal probabilities for the two outcomes may be computed:

P(One) = (both + oneOnly) / totalCount
P(Two) = (both + twoOnly) / totalCount
If these probabilities are independent, the expected values of the matrix cells are:
E(both)= totalCount * P(One) * P(Two)
E(oneOnly) = totalCount * P(One) * (1-P(Two))
E(twoOnly) = totalCount * (1-P(One)) * P(Two)
E(neither) = totalCount * (1-P(One)) * (1-P(Two))
These are used to derive the independence test statistic, which is the square differences between observed and expected values under the independence assumption, normalized by the expected values:
C2 = (both - E(both))2 / E(both)
        + (oneOnly - E(oneOnly))2 / E(oneOnly)
        + (twoOnly - E(twoOnly))2 / E(twoOnly)
        + (neither - E(neither))2 / E(neither)
Unlike the higher dimensional case, this statistic applies as a hypothesis test only in the case when all expected values are at least 10.

Parameters:
both - Count of samples of both outcomes.
oneOnly - Count of samples with the first and not the second outcome.
twoOnly - Count of samples with the second and not the first outcome.
neither - Count of samples with with neither outcome.
Returns:
Pearson's C2 goodness-of-fit statistic for independence over the specified sample counts.
Throws:
IllegalArgumentException - If any of the arguments are not non-negative finite numbers.

linearRegression

public static double[] linearRegression(double[] xs,
                                        double[] ys)
Returns a two-element array of lineary regression coefficients for the specified x and y values. The coefficients returned, { β0, β1 }, define a linear function:
f(x) = β1 * x + β0
The coefficients returned produce the linear function f(x) with the smallest squared error:
sqErr(f,xs,ys) = Σi (f(xs[i]) - ys[i])2
where all sums are for 0 << i < xs.length. The funciton requires only a single pass through the two arrays, with β0 and β1 given by:
 β1 = n * Σi x[i] * y[i]  -  (Σi x[i])(Σi y[i])
      ------------------------------------------
          n * Σi x[i]*x[i]  -  (Σi x[i])2
 
 β0 = Σi y[i] - β1 Σi x[i]
      ---------------------
              n
 
where n = xs.length = ys.length.

Parameters:
xs - Array of x values.
ys - Parallel array of y values.
Returns:
Pair of regression coefficients.
Throws:
IllegalArgumentException - If the arrays are of length less than 2, or if the arrays are not of the same length.

logisticRegression

public static double[] logisticRegression(double[] xs,
                                          double[] ys,
                                          double maxValue)
Returns a two-element array of logistic regression coefficients for the specified x and y values. The coefficients returned, { β0, β1 }, define the logistic function:
                L
 f(x) =  ---------------
         1 + e β1 * x + β0
 
with the minimum squared error. See linearRegression(double[],double[]) for a definition of squared error. This function takes real values in the the open interval (0, L).

Logistic regression coefficients are computed using linear regression, after transforming the y values. This is possible because of the following linear relation:

 log ((L - y) / y) = β1 * x + β0
 

Parameters:
xs - Array of x values.
ys - Array of y values.
maxValue - Maximum value of function.
Returns:
Binary array of logistic regression coordinates.
Throws:
IllegalArgumentException - If the maximum value is not positive and finite, if the arrays are of length less than 2, or if the arrays are not of the same length.

chiSquaredIndependence

public static double chiSquaredIndependence(double[][] contingencyMatrix)
Returns the value of Pearson's C2 goodness of fit statistic for independence over the specified contingency matrix. Asymptotically, this statistic has a χ2 distribution with (numRows-1)*(numCols-1) degrees of freedom. The higher the value, the less likely the two outcomes are independent. Pearson's C2 statistic is defined as follows:
C2 = Σi Σj (matrix[i][j] - expected(i,j,matrix))2 / expectedCount(i,j,matrix)
where the expected count is the total count times the max likelihood estimates of row i probability times column j probability:
expectedCount(i,j,matrix)
= totalCount(matrix)
    * rowCount(i,matrix)/totalCount(matrix)
    * colCount(j,matrix)/totalCount(matrix)
= rowCount(i,matrix) * colCount(j,matrix) / totalCount(matrix)
where
rowCount(i,matrix) = Σ0<=j<=numCols matrix[i][j]
colCount(j,matrix) = Σ0<=i<=numRows matrix[i][j]
totalCount(matrix) = Σ0<=i<=numRows = Σ0<=j<=numCols matrix[i][j]

The χ2 test is a large sample test and is only valid if all of the expected counts are at least 5. This restriction is often ignored for ranking purposes.

Parameters:
contingencyMatrix - The specified contingency matrix.
Returns:
Pearson's C2 statistic for the independence testing over the contingency matrix.
Throws:
Illegal - argument exception if the matrix is not rectangular or if all values are not non-negative finite numbers.

normalize

public static double[] normalize(double[] probabilityRatios)
Return an array of probabilities resulting from normalizing the specified probability ratios. The resulting array of probabilities is the same length as the input ratio array and each probability is simply the input array's value divided by the sum of the ratios.

Warning: This method is implemented by summing the probability ratios and then dividing each element by the sum. Because of the limited precision of double-based arithmetic, if the largest ratio is much larger than the next largest ratio, then the largest normalized probability will be one and all others will be zero. Java double values follow the IEEE 754 arithmetic standard and thus use 52 bits for their mantissas. Thus only ratios within 252~1016 of the maximum ratio will be non-zero.

Parameters:
probabilityRatios - Ratios of probabilities.
Returns:
Probabilities resulting from normalizing the ratios.
Throws:
IllegalArgumentException - If the input contains a value that is not a finite non-negative number, or if the input does not contain at least one non-zero entry.

kappa

public static double kappa(double observedProb,
                           double expectedProb)
Returns the value of the kappa statistic for the specified observed and expected probabilities. The kappa statistic provides a kind of adjustment for the exptected (random) difficulty of a problem. It is defined by:
kappa(p,e) = (p - e) / (1 - e)

The most typical use for kappa is in evaluating classification problems of a machine versus a gold standard or between two human annotators to assess inter-annotator agreement.

Parameters:
observedProb - Observed probability.
expectedProb - Expected probability.
Returns:
The value of the kappa statistic for the specified probability and expected probability.

mean

public static double mean(double[] xs)
Returns the mean of the specified array of input values. The mean of an array is defined by:
mean(xs) = Σi < xs.length xs[i] / xs.length
If the array is of length zero, the result is Double.NaN.

Parameters:
xs - Array of values.
Returns:
Mean of array of values.

variance

public static double variance(double[] xs)
Returns the variance of the specified array of input values. The variance of an array of values is the mean of the squared differences between the values and the mean:
variance(xs) = Σi < xs.length (mean(xs) - xs[i])2 / xs.length
If the array is of length zero, the result is Double.NaN.

Parameters:
xs - Array of values.
Returns:
Variance of array of values.

standardDeviation

public static double standardDeviation(double[] xs)
Returns the standard deviation of the specified array of input values. The standard deviation is just the square root of the variance:
standardDeviation(xs) = variance(xs)(1/2)
If the array is of length zero, the result is Double.NaN.

Parameters:
xs - Array of values.
Returns:
Standard deviation of array of values.

correlation

public static double correlation(double[] xs,
                                 double[] ys)
Returns (Pearson's) correlation coefficient between two specified arrays. The correlation coefficient, traditionally notated as r2, measures the square error between a best linear fit between the two arrays of data. Rescaling either array by a positive constant will not affect the result.

The square root r of the correlation coefficient r2 is the variance in the second array explained by a linear relation with the the first array.

The definition of the correlation coefficient is:

correlation(xs,ys)2
= sumSqDiff(xs,ys)2
  / (sumSqDiff(xs,xs) * sumSqDiff(xs,xs))
where
sumSqDiff(xs,ys)
= Σi<xs.length (xs[i] - mean(xs)) * (ys[i] - mean(ys))
and thus the terms in the denominator above reduce using:
sumSqDiffs(xs,xs)
= Σi<xs.length (xs[i] - mean(xs))2

See the following for more details:

Parameters:
xs - First array of values.
ys - Second array of values.
Returns:
The correlation coefficient of the two arrays.
Throws:
IllegalArgumentException - If the arrays are not the same length.

sample

public static int sample(double[] cumulativeProbRatios,
                         Random random)
Returns a sample from the discrete distribution represented by the specified cumulative probability ratios, using the specified random number generator.

The cumulative probability ratios represent unnormalized probabilities of generating the value of their index or less, that is, unnormalized cumulative probabilities. For instance, consider the cumulative probability ratios { 0.5, 0.5, 3.9, 10.1}:

OutcomeValueUnnormalized ProbProb
0 0.5 0.5 0.5/10.1
1 0.5 0.0 0.0/10.1
2 3.9 3.4 3.4/10.1
3 10.1 6.2 6.2/10.1
A sample is taken by generating a random number x between 0.0 and the value of the last element (here 10.1). The value returned is the index i such that:
 cumulativeProbRatios[i-1] < x <= cumulativeProbRatios[i]
The corresponding probabilities given the sample input are listed in the last column in the table above.

Note that if two elements have the same value, there is no chance of generating the outcome with the higher index. In the example above, this corresponds to outcome 1 having probaiblity 0.0

Warning: The cumulative probability ratios are required to meet two conditions. First, all values must be finite and non-negative. Second, the values must be non-decreasing, so that cumulativeProbRatios[i] <= cumulativeProbRatios[i+1]. If either of these conditions is not met, the result is undefined.

Parameters:
cumulativeProbRatios - Cumulative probability for outcome less than or equal to index.
random - Random number generator for sampling.
Returns:
A sample from the specified distribution.

dirichletLog2Prob

public static double dirichletLog2Prob(double alpha,
                                       double[] xs)
Returns the log (base 2) of the probability of the specified discrete distribution given the specified uniform Dirichlet with concentration parameters equal to the specified value.

See dirichletLog2Prob(double[],double[]) for more information on the Dirichlet distribution. This method returns a result equivalent to the following (though is more efficiently implemented):

 double[] alphas = new double[xs.length];
 java.util.Arrays.fill(alphas,alpha);
 assert(dirichletLog2Prob(alpha,xs) == dirichletLog2Prob(alphas,xs));

For the uniform Dirichlet, the distribution simplifies to the following form:

 p(xs | Dirichlet(α)) = (1/Z(α)) Πi < k xs[i]α-1
where
 Z(α) = Γ(α)k / Γ(k * α)

Warning: The probability distribution must be proper in the sense of having values between 0 and 1 inclusive and summing to 1.0. This property is not checked by this method.

Parameters:
alpha - Dirichlet concentration parameter to use for each dimension.
xs - The distribution whose probability is returned.
Returns:
The log (base 2) probability of the specified distribution in the uniform Dirichlet distribution with concentration parameters equal to alpha.
Throws:
IllegalArgumentException - If the values in the distribution are not between 0 and 1 inclusive or if the concentration parameter is not positive and finite.

dirichletLog2Prob

public static double dirichletLog2Prob(double[] alphas,
                                       double[] xs)
Returns the log (base 2) of the probability of the specified discrete distribution given the specified Dirichlet distribution concentration parameters. A Dirichlet distribution is a distribution over k-dimensional discrete distributions.

The Dirichlet is widely used because it is the conjugate prior for multinomials in a Bayesian setting, and thus may be used to specify a convenient distribution over discrete distributions.

A Dirichlet distribution is specified by a dimensionality k and a concentration parameter α[i] > 0 for each i < k. The probability of the distribution xs in a Dirichlet distribution of dimension k and concentration parameters α is given (up to a normalizing constant) by:

 p(xs | Dirichlet(α))
  Πi < k xs[i]α[i]-1
The full distribution is:
 p(xs | Dirichlet(α))
 = (1/Z(α)) * Πi < k xs[i]α[i]-1
where the normalizing constant is given by:
 Z(α) = Πi < k Γ(α[i])
        / Γ(Σi < k α[i])

Warning: The probability distribution must be proper in the sense of having values between 0 and 1 inclusive and summing to 1.0. This property is not checked by this method.

Parameters:
alphas - The concentration parameters for the uniform Dirichlet.
xs - The outcome distribution
Returns:
The probability of the outcome distribution given the Dirichlet concentratioin parameter.
Throws:
IllegalArgumentException - If the Dirichlet parameters and distribution are not arrays of the same length or if the distribution parameters in xs are not between 0 and 1 inclusive, or if any of the concentration parameters is not positive and finite.