com.aliasi.matrix
Class SvdMatrix
java.lang.Object
com.aliasi.matrix.AbstractMatrix
com.aliasi.matrix.SvdMatrix
- All Implemented Interfaces:
- Matrix
public class SvdMatrix
- extends AbstractMatrix
An SvdMatrix provides a means of storing a matrix that
has been factored via a singular-value decomposition (SVD). This
class also provides a static method for computing regularized
singular-value decompositions of partial matrices.
Singular Value Decomposition
Singular value decomposition (SVD) factors an
m×n matrix A into a product of
three matrices:
A = U * S * VT
where U is an m×k matrix,
V is an n×k matrix, and
S is a k×k matrix, where
k is the rank of the matrix A. The
multiplication (*) is matrix multiplication and
the superscripted T indicates matrix transposition.
The m-dimensional vectors making up the columns of
U are called left singular vectors, whereas
the n-dimesnional vectors making up the rows of
V are called right singular vectors. The values on
the diagonal of S are called the singular values.
The combination of the q-th left singular vector,
right singular vector, and singular value is known as a factor.
The singular value matrix S is a diagonal matrix
with positive, strictly non-increasing values, so that
S[i][i] >= S[i+1][i+1], for i < k.
The set of left and set of right singular vectors are orthonormal.
Normality means that each singular vector is of unit length (length
1). Orthogonality means that any pair of left singular
vectors is orthogonal and any pair of right singular vectors are
orthogonal (meaning their dot product is 0).
Matrices have unique singular-value decompositions
up to the re-ordering of columns with equal singular values and
up to cancelling sign changes in the singular vectors.
Construction and Value Computation
An SvdMatrix may be constructed out of the singular
vectors and singular values, or out of the vectors with singular
values scaled in.
Given that S is diagonal, the value of a particular
entry in A works out to:
A[i][j] = Σk U[i][k] * S[k][k] * V[j][k]
To save time in the application and space in the class,
we factor S into U and V
to produce a new pair of matrices U' and V'
defined by:
U' = U * sqrt(S)
V'T = sqrt(S) * VT
with the square-root performed component-wise:
sqrt(S)[i][j] = sqrt(S[i][j])
By the associativity of matrix multiplication, we have:
U * S * VT
= U * sqrt(S) * sqrt(S) * V
= U' * V'T
Thus the class implementation is able to store U'
and V', thus reducing the amount computation involved
in returning a value (using column vectors as the default vector
orientation):
A[i][j] = U'[i]T * V'[j]
Square Error and the Frobenius Norm
Suppose A and B are
m×n matrices. The square error between them is
defined by the so-called Frobenius norm, which extends the standard
vector L2 or Euclidean norm to matrices:
squareError(A,B)
= frobeniusNorm(A-B)
= Σi < m Σj < n (A[i][j] - B[i][j])2
The square error is sometimes referred to as the residual sum
of squares (RSS), because A[i][j] - B[i][j]
is the residual (difference between actual and predicted value).
Lower-order Approximations
Consider factoring a matrix A of dimension
m×n into the product of two matrices
X * YT, where X is of dimension
m×k and Y is of dimension
n×k. We may then measure the square
error squareError(A,X * Y) to determine how well
the factorization matches A. We know that if
we take U', V' from the singular value
decomposition that:
squareError(A, U' * V'T) = 0.0
Here U' and V' have a number of columns
(called factors) equal to the rank of the matrix A.
The singular value decomposition is such that the first
k columns of U' and V'
provide the best order q approximation of
A of any X and Y
of dimensions m×q and n×q
In symbols:
U'q, V'q = argmin X is m×q, Y is n×q squareError(A, X * YT)
where U'q is the restriction of U'
to its first q columns.
Often errors are reported as means, where the mean square error
(MSE) is defined by:
meanSquareError(A,B) = squareError(A,B)/(m×n)
To adjust to a linear scale, the square root of mean square
error (RMSE) is often used:
rootMeanSquareError(A,B) = sqrt(meanSquareError(A,B))
Partial Matrices
A partial matrix is one in which some of the values are unknown.
This is in contrast with a sparse matrix, in which most of the
values are zero. A variant of singular value decomposition may be
used to impute the unknown values by minimizing square error with
respect to the known values only. Unknown values are then simply
derived from the factorization U' * V'T.
Typically, the approximation is of lower order than the rank of
the matrix.
Regularized SVD via Shrinkage
Linear regression techniques such as SVD often overfit their
training data in order to derive exact answers. This problem
is mitigated somewhat by choosing low-order approximations to
the full-rank SVD. Another option is to penalize large values
in the singular vectors, thus favoring smaller values. The
most common way to do this because of its practicality is via
parameter shrinkage.
Shrinkage is a general technique in least squares fitting that
adds a penalty term proportional to the square of the size of
the parameters. Thus the square error objective function is
replaced with a regularized version:
regularizedError(A, U' * V')
= squareError(A, U' * V')
+ parameterCost(U') + parameterCost(V')
where the parameter costs for a vector X of
dimensionality q is the sum of the squared
parameters:
parameterCost(X)
= λ * Σi < q X[i]2
= λ * length(X)2
Note that the hyperparameter λ scales the
parameter cost term in the overall error.
Setting the regularization parameter to zero sets the parameter
costs to zero, resulting in an unregularized SVD computation.
In Bayesian terms, regularization is equivalent to a normal
prior on parameter values centered on zero with variance controlled
by λ. The resulting minimizer of regularized
error is the maximum a posteriori (MAP) solution. The Bayesian
approach also allows covariances to be estimated (including simple
parameter variance estimates), but these are not implemented in
this class.
Regularized Stochastic Gradient Descent
Singular value decomposition may be computed "exactly"
(modulo arithmetic precision and convergence) using an algorithm
whose time complexity is O(m3 +
n3) for an m×n matrix (equal
to O(max(m,n)3). Arithmetic precision is
especially vexing at singular values near zero and with highly
correlated rows or columns in the input matrix.
For large partial matrices, we use a form of stochastic gradient
descent which computes a single row and column singular vector (a
single factor, that is) at a time. Each factor is estimated by
iteratively visiting the data points and adjusting the unnormalized
singular vectors making up the current factor. Each adjustment is
a least-squares fitting step, where we simultaneously adjust the
left singular vectors given the right singular vectors and
vice-versa.
The least-squares adjustments take the following form. For each
value, we compute the current error (using the order k
approximation and the current order k+1 values) and
then move the vectors to reduce error. We use U' and
V' for the incremental values of the singular vectors
scaled by the singular values:
for (k = 0; k < maxOrder; ++k)
for (i < m) U'[i][k] = random.nextGaussian()*initValue;
for (j < n) V'[j][k] = random.nextGaussian()*initValue;
for (epoch = 0; epoch < maxEpochs && not converged; ++epoch)
for (i,j such that M[i][j] defined)
error = M[i][j] - U'k[i] * V'k[j] // * is vector dot product
uTemp = U'[i][k]
vTemp = V'[j][k]
U'[i][k] += learningRate[epoch] * ( error * vTemp - regularization * uTemp )
V'[j][k] += learningRate[epoch] * ( error * uTemp - regularization * vTemp )
where initValue, maxEpochs,
learningRate (see below), and
regularization are algorithm hyperparameters. Note
that the initial values of the singular vectors are set randomly to
the result of a draw from a Gaussian (normal) distribution with
mean 0.0 and variance 1.0.
Because we use the entire set of
factors in the error computation, the current factor is guaranteed
to have singular vectors orthogonal to the singular vectors already
computed.
Note that in the actual implementation, the contribution to the
error of the first k-1 factors is cached to reduce
time in the inner loop. This requires a double-length floating
point value for each defined entry in the matrix.
Gradient Interpretation
Like most linear learners, this algorithm merely moves the
parameter vectors U'[i] and U'[j] in the
direction of the gradient of the error. The gradient is the
multivariate derivative of the objective function being minimized.
Because our object is squared error, the gradient is just its
derivative, which is just (two times) the (linear) error itself. We
roll the factor of 2 into the learning rate to derive the update in
the algorithm pseudo-code above.
The term (error * vTemp) is the component of the
error gradient due to the current value of V'[i][k]
and the term (regularization * uTemp) is the component of the
gradient to the size of the parameter U'[i][k]. The
updates thus move the parameter vectors in the direction of
the gradient.
Convergence Conditions
The convergence conditions for a given factor require either
hitting the maximum number of allowable epochs, or finding the
improvement from one epoch to the next is below some relative
threshold:
regError(epoch) = regError(M,U'k(epoch) * V'k(epoch)T)
relativeImprovement(epoch+1) = relativeDiff(regError(epoch+1), regError(epoch))
relativeDiff(x,y) = abs(x-y)/(abs(x) + abs(y))
When the relative difference in square error is less than
a hyperparameter threshold minImprovement, the
epoch terminates and the algorithm moves on to the next
factor k+1.
Note that a complete matrix is a degenerate kind of partial
matrix. The gradient descent computation still works in this
situation, but is not as efficient or as accurate as an
algebraic SVD solver for small matrices.
Annealing Schedule
Learning rates that are too high are unstable, whereas learning rates
that are too low never reach their targets. To get around this
problem, the learning rate, learningRate[epoch], is
lowered as the number of epochs increase. This lowering of the
learning rate has a thermodynamic interpretation in terms of free
energy, hence the name "annealing". Larger moves are
made in earlier epochs, then the temperature is gradually lowered
so that the learner may settle into a stable fixed point. The
function learningRate[epoch] is called the annealing
schedule.
There are theoretical requirements on the annealing schedule
that guarantee convergence (up to arithmetic precision and no upper
bound on the number of epochs):
Σepoch learningRate[epoch] = infinity
Σepoch learningRate[epoch]2 < infinity
The schedule we use is the one commonly chosen to meet the
above requirements:
learningRate[epoch] = initialLearningRate / (1 + epoch/annealingRate)
where initialLearningRate is an initial learning rate and
annealingFactor determines how quickly it shrinks.
The learning rate moves from its initial size
(initialLearningRate) to one half (1/2) of its
original size after annealingRate epochs, and
moves from its initial size to one tenth (1/10) of
its initial size after 9 * annealingRate epochs,
and one hundredth of its initial size after
99 * annealingRate epochs..
Parameter Choices
The previous discussion has introduced a daunting list of
parameters required for gradient descent for singular value
decomposition. Unfortunately, the stochastic gradient descent
solver requires a fair amount of tuning to recover a low mean
square error factorization. The optimal settings will also depend
on the input matrix; for example, very sparse partial matrices are
much easier to fit than dense matrices.
Maximum Order
Determining the order of the decomposition is mostly a matter
of determining how many orders are needed for the amount of
smoothing required. Low order reconstructions are useful for
most applications. One way to determine maximum order is using
held out data for an application. Another is to look for
a point where the singular values become (relatively) insignificant.
Maximum Epochs and Early Stopping
For low mean square error factorizations, many epochs may be
necessary. Lowering the maximum number of epochs leads to what is
known as early stopping. Sometimes, early stopping leads to
more accurate held-out predictions, but it will always raise
the factorization error (training data predictions). In general,
the maximum number of epochs needs to be set empirically.
Initially, try fewer epochs and gradually raise the number of
epochs until the desired accuracy is achieved.
Minimum Improvement
By setting the minimum improvement to 0.0, the algorithm is forced
to use the maximum number of epochs. By setting it above this
level, a form of early stopping is achieved when the benefit of
another epoch of refitting falls below a given improvement
threshold. This value may also be set on an application basis
using held out data, or it may be fit to achieve a given level of
mean square error on the training (input) data. The minimum
improvement needs to be set fairly low (less than 0.000001) to
achieve reasonably precise factorizations. Note that minimum
improvement is defined relatively, as noted above.
Initial Parameter Values
Initial values of the singular vectors are not particularly
sensitive, because we are using multiplicative updates. A good
rule of thumb for starting values is the the inverse square root of
the maximum order:
initVal ~ 1 / sqrt(maxOrder)
Learning Rate, Maximum Epochs and Annealing
A good starting point for the learning rate is 0.01. The
annealing parameter should be set so that the learning rate is cut
by at least a factor of 10 in the final rounds. This calls for a
value that's roughly one tenth of the maximum number of epochs. If
the initial learning rate is set higher, then the annealing
schedule should be more agressive so that it spends a fair amount
of time in the 0.001 to 0.0001 learning rate range.
References
There are thorough Wikipedia articles on singular value decomposition
and gradient descent, although the SVD entry focuses entirely on
complete (non-partial) matrices:
Both of the standard machine learning textbooks have good
theoretical explanations of least squares, regularization,
gradient descent, and singular value decomposition, but not
all together:
- Chris Bishop. 2007. Pattern Recognition and Machine Learning. Springer.
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001.
The Elements of Statistical Learning. Springer.
The following is a particularly clear explanation of many of
the issues involved in the context of neural networks:
- Genevieve Orr, Nici Schraudolph, and Fred Cummins. 1999.
CS-449: Neural Networks. Willamette University course notes.
Our partial SVD solver is based on C code from Timely
Development (see license below). Timely based their code on Simon
Funk's blog entry. Simon Funk's approach was itself based on his
and Genevieve Gorrell's 2005 EACL paper.
Acknowledgements
The singular value decomposition code is rather loosely based on
a C program developed by Timely Development.
Here is the copyright notice for the original code:
// SVD Sample Code
//
// Copyright (C) 2007 Timely Development (www.timelydevelopment.com)
//
// Special thanks to Simon Funk and others from the Netflix Prize contest
// for providing pseudo-code and tuning hints.
//
// Feel free to use this code as you wish as long as you include
// these notices and attribution.
//
// Also, if you have alternative types of algorithms for accomplishing
// the same goal and would like to contribute, please share them as well :)
//
// STANDARD DISCLAIMER:
//
// - THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
// - OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
// - LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
// - FITNESS FOR A PARTICULAR PURPOSE.
- Since:
- LingPipe3.2
- Version:
- 4.0.0
- Author:
- Bob Carpenter
|
Constructor Summary |
SvdMatrix(double[][] rowSingularVectors,
double[][] columnSingularVectors,
double[] singularValues)
Construct an SVD matrix using the specified singular vectors
and singular values. |
SvdMatrix(double[][] rowVectors,
double[][] columnVectors,
int order)
Construct a factored matrix using the specified row and column
vectors at the specified order. |
|
Method Summary |
double[][] |
leftSingularVectors()
Returns a matrix in which the left singular vectors make up the
columns. |
int |
numColumns()
Returns the number of columns in this matrix. |
int |
numRows()
Returns the number of rows in this matrix. |
int |
order()
Returns the order of this factorization. |
static SvdMatrix |
partialSvd(int[][] columnIds,
double[][] values,
int maxOrder,
double featureInit,
double initialLearningRate,
double annealingRate,
double regularization,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
Return the singular value decomposition of the specified
partial matrix, using the specified search parameters. |
double[][] |
rightSingularVectors()
Returns a matrix in which the right singular vectors make up
the columns. |
double |
singularValue(int order)
Returns the singular value for the specified order. |
double[] |
singularValues()
Returns the array of singular values for this decomposition. |
static SvdMatrix |
svd(double[][] values,
int maxOrder,
double featureInit,
double initialLearningRate,
double annealingRate,
double regularization,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
Returns the signular value decomposition of the specified
complete matrix of values. |
double |
value(int row,
int column)
Returns the value of this matrix at the specified row and column. |
SvdMatrix
public SvdMatrix(double[][] rowVectors,
double[][] columnVectors,
int order)
- Construct a factored matrix using the specified row and column
vectors at the specified order. Each vector in the arrays of
row and column vectors must be of the same dimension as the
order.
See the class documentation above for more information
on singular value decomposition.
- Parameters:
rowVectors - Row vectors, indexed by row.columnVectors - Column vectors, indexed by column.order - Dimensionality of the row and column vectors.
SvdMatrix
public SvdMatrix(double[][] rowSingularVectors,
double[][] columnSingularVectors,
double[] singularValues)
- Construct an SVD matrix using the specified singular vectors
and singular values. The order of the factorization is equal to
the length of the singular values. Each singular vector must
be the same dimensionality as the array of singular values.
See the class documentation above for more information
on singular value decomposition.
- Parameters:
rowSingularVectors - Row singular vectors, indexed by row.columnSingularVectors - Column singular vectors, indexed by column.singularValues - Array of singular values, in decreasing order.
numRows
public int numRows()
- Returns the number of rows in this matrix.
- Specified by:
numRows in interface Matrix- Specified by:
numRows in class AbstractMatrix
- Returns:
- The number of rows in this matrix.
numColumns
public int numColumns()
- Returns the number of columns in this matrix.
- Specified by:
numColumns in interface Matrix- Specified by:
numColumns in class AbstractMatrix
- Returns:
- The number of columns in this matrix.
order
public int order()
- Returns the order of this factorization. The order
is the number of dimensions in the singular vectors
and the singular values.
- Returns:
- The order of this decomposition.
value
public double value(int row,
int column)
- Returns the value of this matrix at the specified row and column.
- Specified by:
value in interface Matrix- Specified by:
value in class AbstractMatrix
- Parameters:
row - Row index.column - Column index.
- Returns:
- The value of this matrix at the specified row and
column.
singularValues
public double[] singularValues()
- Returns the array of singular values for this decomposition.
- Returns:
- The singular values for this decomposition.
singularValue
public double singularValue(int order)
- Returns the singular value for the specified order.
- Parameters:
order - Dimension of the singular value.
- Returns:
- The singular value at the specified order.
leftSingularVectors
public double[][] leftSingularVectors()
- Returns a matrix in which the left singular vectors make up the
columns. The first index is for the row of the original matrix
and the second index is for the order of the singular vector.
Thus the returned matrix is
m×k, if the
original input was an m×n matrix and SVD was
performed at order k.
- Returns:
- The left singular vectors of this matrix.
rightSingularVectors
public double[][] rightSingularVectors()
- Returns a matrix in which the right singular vectors make up
the columns. The first index is for the column of the original
matrix and the second index is for the order of the singular
vector. Thus the returned matrix is
n×k, if
the original input was an m×n matrix and SVD
was performed at order k.
- Returns:
- The right singular vectors.
svd
public static SvdMatrix svd(double[][] values,
int maxOrder,
double featureInit,
double initialLearningRate,
double annealingRate,
double regularization,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
- Returns the signular value decomposition of the specified
complete matrix of values.
For a full description of the arguments and values, see
the method documentation for
partialSvd(int[][],double[][],int,double,double,double,double,Reporter,double,int,int)
and the class documentation above.
The two-dimensional array of values must be an
m × n matrix. In particular, each row
must be of the same length. If this is not the case, an illegal
argument exception will be raised.
This is now a utility method that calls svd(double[][],int,double,double,double,double,Reporter,double,int,int)
with a reporter wrapping the specified print writer at the
debug level, or a silent print writer.
- Parameters:
values - Array of values.maxOrder - Maximum order of the decomposition.featureInit - Initial value for singular vectors.initialLearningRate - Incremental multiplier of error
determining how fast learning occurs.annealingRate - Rate at which annealing occurs; higher values
provide more gradual annealing.regularization - A regularization constant to damp learning.reporter - Reporter to which to send progress and error
reports.minImprovement - Minimum relative improvement in mean
square error required to finish an epoch.minEpochs - Minimum number of epochs for training.maxEpochs - Maximum number of epochs for training.
training, or null if no output is desired.
- Returns:
- Singular value decomposition for the specified partial matrix
at the specified order.
- Throws:
IllegalArgumentException - Under conditions listed in the
method documentation above.
partialSvd
public static SvdMatrix partialSvd(int[][] columnIds,
double[][] values,
int maxOrder,
double featureInit,
double initialLearningRate,
double annealingRate,
double regularization,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
- Return the singular value decomposition of the specified
partial matrix, using the specified search parameters.
The writer parameter may be set to allow incremental progress
reports to that writer during training. These report on RMSE
per epoch.
See the class documentation above for a description of the
algorithm.
There are a number of constraints on the input, any
violation of which will raise an illegal argument exception.
The conditions are:
- The maximum order must be greater than zero.
- The minimum relative improvement in mean square error must be non-negative
and finite.
- The minimum number of epochs must be greater than zero
and less than or equal to the maximum number of epochs.
- The feature initialization value must be non-zero and finite.
- The learning rate must be positive and finite.
- The regularization parameter must be non-negative and finite.
- The column identitifer and value arrays must be the same
length.
- The elements of the column identifier array and the
value array must all be of the same length.
- All column identifiers must be non-negative.
- Each row of the column identifier matrix must contain
columns in strictly ascending order.
- Parameters:
columnIds - Identifiers of column index for given row and entry.values - Values at row and column index for given entry.maxOrder - Maximum order of the decomposition.featureInit - Initial value for singular vectors.initialLearningRate - Incremental multiplier of error determining how
fast learning occurs.annealingRate - Rate at which annealing occurs; higher values
provide more gradual annealing.regularization - A regularization constant to damp learning.reporter - Reporter to which progress reports are written, or
null if no reporting is required.minImprovement - Minimum relative improvement in mean square error required
to finish an epoch.minEpochs - Minimum number of epochs for training.maxEpochs - Maximum number of epochs for training.
- Returns:
- Singular value decomposition for the specified partial matrix
at the specified order.
- Throws:
IllegalArgumentException - Under conditions listed in the
method documentation above.