What is Singular Value Decomposition?
SVD as Factorization
Singular value decomposition (SVD) is a means of decomposing a a matrix into a product of three simpler matrices. In this way it is related to other matrix decompositions such as eigen decomposition, principal components analysis (PCA), and non-negative matrix factorization (NNMF).
SVD as Least Squares Approximation
Full rank SVD recreates the underlying matrix exactly, whereas lower-order SVD provides the best (in the least square error sense) approximation of a matrix by orthogonal row and column factors. These lower-order approximations to the larger matrix may uncover interesting relationships among the rows and/or columns of the underlying matrix.
SVD for Matrices with Unknown Values
LingPipe's stocastic gradient descent SVD solver is designed to accomodate a partial matrix as input. By partial, we mean a matrix only some of whose values are known. SVD may be used in this case to impute the value of the positions whose values are unknown. In so doing, it uses the least-squares property of singular value decomposition.
Applications of SVD
Singular value decomposition may be applied anywhere a least-squares fit, possibly with unknown values, is needed.
Latent Semantic Analysis
The original and most well known application of SVD in natural language processing has been for latent semantic analysis (LSA). LSA is an application of reduced-order SVD in which the rows of the input matrix represent words and the columns documents, with entries being the count of the words in the document. The singular vectors and correspdoning singular values produced by SVD allow words and documents to be mapped into the same "latent semantic space". The resulting embedding places similar words and documents as measured by co-occurrence near one another even if they never co-occurred in the training corpus.
Latent Semantic Indexing
It is straightforward to apply LSA's notion of term-document similarity to information retrieval, the resulting systems being known as latent semantic indexing (LSI). A query consisting of several terms just corresponds to the sum of their k-dimensional vector representations. The resulting k-dimensional query vector may be compared to the k-dimensional document vectors to determine similarity (typically using the cosine measure).
SVD for General Classification
Latent semantic indexing is really just a simple classification problem where each document is its own category, the training data for the category is the text extracted from the document, and where queries are just texts to classify.
Many researchers have applied SVD to other classification problems. The hope, in these cases, is for a degree of robustness arising from the lower-order representation of terms and documents. For instance, the main architect of LingPipe, Bob Carpenter, applied it with Jennifer Chu-Carroll to automatic routing of telphone calls (after a speech recognition phase), and more interestingly, to generating follow-up questions to resolve ambiguities for marginal queries. Follow-up questions were generated by looking for terms that could be added to the query to resolve it as one category or another.
SVD for Clustering and Features
Because of the simple vector representations of terms and documents produced by SVD, SVD (as well as PCA) has been widely used for clustering.
More generally, SVD has been applied to feature generation. Rather than taking raw features, or in addition to raw features, it is common to add the first few factors from the singular value decomposition of relevant inputs (such as words). These decompositions may be drawn from large data sets in the hopes of smoothing the representations with reasonable assumptions about their meanings.
SVD for Collaborative Filtering
Outside of strictly linguistic applications, SVD has been used for collaborative filtering, most notably in the context of the Netflix Prize competition. In this context, rows represent users and columns represent movies. The scores that exist are ratings provided by users for films (on a 1-5 integer scale). The contest involves guessing ratings for films a user hasn't rated; that is, imputing the unknown values.
Definition of SVD
We first consider the case where all values are known. When all values
are known,
singular value decomposition (SVD) factors an
m × n
matrix A
into a product of
three matrices:
A = U * S * V^{T}
where U
is an m × k
matrix,
V
is an n × k
matrix, and
S
is a k × k
matrix, where
k
is the rank of the matrix A
. The
multiplication (*
) is matrix multiplication and
the superscripted T
indicates matrix transposition.
Singular Vectors: An Orthonormal Basis
The m
-dimensional vectors making up the columns of
U
are called left singular vectors, whereas
the n
-dimesnional vectors making up the columns of
V
are called right singular vectors.
The set of left and set of right singular vectors are orthonormal
(that is, both orthogonal and normal). Normality means that each
singular vector is of unit length (length 1
). A set of
vectors is said to be orthogonal if any pair of vectors in the set is
orthogonal; recall that vectors are orthogonal if and only if their
product (equivalently cosine) is zero (0
). Orthogonality
that any pair of left singular vectors is The right singular vectors
are also orthonormal, being of unit length and pairwise orthogonal.
Singular Values: Factor Weighting
The values on the diagonal of S
are called the
singular values. The combination of
the q
-th left singular vector, right singular vector and
singular value is known as a factor. The singular values determine
the relative importance of the dimensions, and are arranged in
non-increasing order:
S[i][i] >= S[i+1][i+1] (for i < k)
Uniqueness
Matrices have unique singular-value decompositions up to the re-ordering of columns with equal singular values and up to offsetting sign changes in the singular vectors.
Implementing SVD Matrices
The upshot of the first equation A = U * S * V^{T}
is that a single value A[i][j]
in the matrix may be computed by the
dot product of the i
-th row vector and the j
-th column
vector, scaled by singular values:
A[i][j] = Σ_{i < k} U[i][k] * S[k][k] * V[j][k] = Σ_{i < k} U[i][k] * sqrt(S[k][k]) * sqrt(S[k][k]) * V[j][k] = Σ_{i < k} (U[i][k] * sqrt(S[k][k])) * (sqrt(S[k][k]) * V[j][k]) = (U[i] * sqrt(S_{diag})^{T}) * (V[j] * sqrt(S_{diag})^{T})^{T}
Calculating Values
In the implementation com.aliasi.matrix.SVDMatrix
of the matrix
interface com.aliasi.matrix.Matrix
, the scaled row and column
vectors are stored. For instance, here is the value of the i
th
row vector after scaling by the square root of the singular values:
U[i] * sqrt(S_{diag})^{T} = { U[i][0] * sqrt(S[0][0]), U[i][1] * sqrt(S[1][1]), ..., U[i][k] * sqrt(S[k][k]) }
Time and Space Requirements
Thus extracting a value from an SVD matrix of order k
requires
k
multiplications and additions.
The storage space required for an SVD representation of a matrix of
dimensionality m × n
of order k
is (n+m)*k
double values (of eight bytes each).
Latent Semantic Analysis and Indexing
Latent semantic analysis (LSA) is a straightforward application of singular value decomposition to term-document matrices. Latent semantic indexing (LSI) applies LSA to information retrieval. To illustrate LSI, we implement the example from Deerwester et al.'s original LSI paper (see the references below).
Running the Demo
To run the example, simply invoke the Ant target lsi
:
> cd $LINGPIPE/demos/tutorial/svd > ant lsi Buildfile: build.xml compile: [javac] Compiling 1 source file to C:\carp\mycvs\lingpipe\demos\tutorial\svd\build\classes lsi: [java] Computing SVD [java] maxFactors=2 [java] featureInit=0.01 [java] initialLearningRate=0.0050 [java] annealingRate=1000 [java] regularization0.0 [java] minImprovement=0.0 [java] minEpochs=10 [java] maxEpochs=50000 [java] [java] SCALES [java] 0 3.34 [java] 1 2.54 [java] [java] TERM VECTORS [java] ( 0.22, -0.11) human [java] ( 0.20, -0.07) interface ...
We'll explain all the output produced as we go through the code and a brief explanation of how latent semantic indexing works through SVD.
Term-Document Matrix
LSI begins with a term-document matrix. The following 9 documents are titles of technical reports. We have taken the liberty of renumbering them from 0 to 8.
Documents (Deerwester et al. (1990), Table 2, Part 1) | |
---|---|
Doc | Text |
0 | Human machine interface for Lab ABC computer applications |
1 | A survey of user opinion of computer system response time |
2 | The EPS user interface management system |
3 | System and human system engineering testing of EPS |
4 | Relation of user-perceived response time to error measurement |
5 | The generation of random, binary, unordered trees |
6 | The intersection graph of paths in trees |
7 | Graph minors IV: Widths of trees and well-quasi-ordering |
8 | Graph minors: A survey |
The terms are just the words occurring in more than one document, with closed-class words like "and" and "to" removed. The words are normalized for case by downcasing, but not otherwise stemmed or filtered. The terms may be the result of an arbitrary tokenization, including stemming, stoplisting, case normalization, and even character n-grams.
Term-Document Matrix (Deerwester et al., Table 2, Part 2) | |||||||||
---|---|---|---|---|---|---|---|---|---|
Term | Document | ||||||||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
human | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
interface | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
computer | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
user | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
system | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 |
response | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
time | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
EPS | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
survey | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
trees | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
graph | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
minors | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
The count of a term in a document here is just the raw frequency count. In applications of SVD, these counts are often weighted using inverse document frequency and dampened using square roots or logs.
LSI Code
The code for our LSI demo can be found in the file
src/Lsi.java
. Because we
are only concerned with creating the single example, it is quite
compact.
The code starts by defining the term-document matrix and the array of terms and documents themselves:
static final String[] TERMS = new String[] { "human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors" };
static final String[] DOCS = new String[] { "Human machine interface for Lab ABC computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user-perceived response time to error measurement", "The generation of random, binary, unordered trees", "The intersection graph of paths in trees", "Graph minors IV: Widths of trees and well-quasi-ordering", "Graph minors: A survey" };
static final double[][] TERM_DOCUMENT_MATRIX = new double[][] { {1, 0, 0, 1, 0, 0, 0, 0, 0 }, {1, 0, 1, 0, 0, 0, 0, 0, 0 }, {1, 1, 0, 0, 0, 0, 0, 0, 0 }, {0, 1, 1, 0, 1, 0, 0, 0, 0 }, {0, 1, 1, 2, 0, 0, 0, 0, 0 }, {0, 1, 0, 0, 1, 0, 0, 0, 0 }, {0, 1, 0, 0, 1, 0, 0, 0, 0 }, {0, 0, 1, 1, 0, 0, 0, 0, 0 }, {0, 1, 0, 0, 0, 0, 0, 0, 1 }, {0, 0, 0, 0, 0, 1, 1, 1, 0 }, {0, 0, 0, 0, 0, 0, 1, 1, 1 }, {0, 0, 0, 0, 0, 0, 0, 1, 1 } };
There is a row for each term, with the columns indexed by the document ID (0 to 8). The terms and term-document matrix would obviously have to be computed from the documents in a general application. Here, we have just copied Deerwester et al.'s matrix into a data structure by hand. We only use the term array and document array for display purposes.
We compute the singular value decomposition of the matrix
using the svd()
method of SvdMatrix
,
with the number of factors (latent semantic dimensions) set
in a constant (to 2, in this case, following Deerwester
et al.'s example):
static final int NUM_FACTORS = 2; public static void main(String[] args) throws Exception { maxFactors = 2; double featureInit = 0.01; double initialLearningRate = 0.005; int annealingRate = 1000; double regularization = 0.00; double minImprovement = 0.0000; int minEpochs = 10; int maxEpochs = 50000; SvdMatrix matrix = SvdMatrix.svd(TERM_DOCUMENT_MATRIX, maxFactors, featureInit, initialLearningRate, annealingRate, regularization, null, minImprovement, minEpochs, maxEpochs); ...
The final argument is null
, which turns
off feedback from the SVD solver during the solution process.
We discuss what it outputs and how to tune SVD in the
next sections on complete and partial SVD.
We print the output vectors for the two-dimensional SVD we have computed. The singular values, left singular vectors and right singular values are extracted from the SVD matrix:
double[] scales = matrix.singularValues(); double[][] termVectors = matrix.leftSingularVectors(); double[][] docVectors = matrix.rightSingularVectors();
We do not repeat the code here to print out these arrays, but here's what the demo program prints:
SCALES 0 3.34 1 2.54 TERM VECTORS ( 0.22, -0.11) human ( 0.20, -0.07) interface ( 0.24, 0.04) computer ( 0.40, 0.06) user ( 0.64, -0.17) system ( 0.27, 0.10) response ( 0.27, 0.10) time ( 0.30, -0.14) EPS ( 0.21, 0.27) survey ( 0.01, 0.49) trees ( 0.04, 0.62) graph ( 0.03, 0.45) minors DOC VECTORS ( 0.20, -0.06) Human machine interface for Lab ABC computer applications ( 0.61, 0.16) A survey of user opinion of computer system response time ( 0.46, -0.13) The EPS user interface management system ( 0.54, -0.23) System and human system engineering testing of EPS ( 0.28, 0.10) Relation of user-perceived response time to error measurement ( 0.00, 0.19) The generation of random, binary, unordered trees ( 0.01, 0.44) The intersection graph of paths in trees ( 0.02, 0.62) Graph minors IV: Widths of trees and well-quasi-ordering ( 0.08, 0.53) Graph minors: A survey
These values match those reported in the appendix (page 26) of Deerwester et al. (2000). This is the complete latent semantic indexing of the specified document collection.
Note that this representation makes clear how the terms and documents are cast into the same k-dimensional space for comparison. This is often called the latent semantic space. For comparison, the dimensions are scaled by their corresponding singular values. We could, for instance, use these vectors for clustering terms, clustering documents or relating terms to documents.
Next we consider the search component of latent semantic indexing. Queries are computed by taking the centroid of the term vectors corresponding to the terms in the query. For instance, Deerwester et al. use the query "human computer interaction" for illustration, and we repeat their example here. Only the terms "human" and "computer" show up. The centroid's computed pointwise, by adding the values in each dimension:
human computer (0.22, -0.11) + (0.24, 0.04) = (0.46, -0.07)
This is then matched against the document vectors using the scaling provided by the singular values, with the following result:
Query=[human, computer, interaction] Query Vector=( 0.46, -0.07 ) DOCUMENT SCORES VS. QUERY 0: 0.31 Human machine interface for Lab ABC computer applications 1: 0.91 A survey of user opinion of computer system response time 2: 0.74 The EPS user interface management system 3: 0.88 System and human system engineering testing of EPS 4: 0.41 Relation of user-perceived response time to error measurement 5: -0.03 The generation of random, binary, unordered trees 6: -0.05 The intersection graph of paths in trees 7: -0.07 Graph minors IV: Widths of trees and well-quasi-ordering 8: 0.03 Graph minors: A survey
As noted by Deerwester et al., document 2 and document 4, neither of which contain any of the search terms, score relatively high. This is b ecause of second-order relations between terms, such as the term "user" showing up in the related documents. If we were to match the query vector to terms, we get the following related query terms:
TERM SCORES VS. QUERY 0: 0.36 human 1: 0.32 interface 2: 0.36 computer 3: 0.61 user 4: 1.02 system 5: 0.39 response 6: 0.39 time 7: 0.49 EPS 8: 0.27 survey 9: -0.07 trees 10: -0.05 graph 11: -0.03 minors
Note that the query "human computer interaction" supplies terms "human" and "computer" which are both judged reasonably close to the query itself, which is the centroid of the vectors for "human" and "interface". More interestingly, terms such as "interface" and "system" are also close to the query in latent semantic space.
Given the indexing (the singular value decomposition, that is), it
is straightforward to compute these scorings. Here is the code
from src/Lsi.java
.
public static void main(String[] args) throws Exception { ... SvdMatrix matrix = SvdMatrix.svd(TERM_DOCUMENT_MATRIX, ... double[] scales = matrix.singularValues(); double[][] termVectors = matrix.leftSingularVectors(); double[][] docVectors = matrix.rightSingularVectors(); ... for (int m = 0; m < args.length; ++m) search(scales,termVectors,docVectors,args[m]);
We just iterate through the arguments and search relative to the indexing. Here's the search method:
static void search(double[] scales, double[][] termVectors, double[][] docVectors, String arg) { String[] terms = arg.split(" |,"); // space or comma separated double[] queryVector = new double[NUM_FACTORS]; Arrays.fill(queryVector,0.0); for (String term : terms) addTermVector(term,termVectors,queryVector); ...
The code first finds the terms by splitting on spaces or commas. It then initializes the query vector to be the number of dimensions in the latent semantic space, that is, the number of factors in the SVD. It then iterates over the terms and adds their term vectors to the query vector. The addition's done by brute force, looking for the term in the list of terms and then adding its vector if found. If the term isn't found in the array of terms, it's ignored.
static void addTermVector(String term, double[][] termVectors, double[] queryVector) { for (int i = 0; i < TERMS.length; ++i) { if (TERMS[i].equals(term)) { for (int j = 0; j < NUM_FACTORS; ++j) { queryVector[j] += termVectors[i][j]; } return; } } }
Finally, the query vector is printed (we do not repeat that code) and then scored against the document and term vectors:
System.out.println("DOCUMENT SCORES VS. QUERY"); for (int j = 0; j < docVectors.length; ++j) { double score = dotProduct(queryVector,docVectors[j],scales); System.out.printf(" %d: % 5.2f %s\n",j,score,DOCS[j]); } System.out.println("TERM SCORES VS. QUERY"); for (int i = 0; i < termVectors.length; ++i) { double score = dotProduct(queryVector,termVectors[i],scales); System.out.printf(" %d: % 5.2f %s\n",i,score,TERMS[i]); }
Dot product is computed relative to the scales as follows:
static double dotProduct(double[] xs, double[] ys, double[] scales) { double sum = 0.0; for (int k = 0; k < xs.length; ++k) sum += xs[k] * ys[k] * scales[k]; return sum; }
It's common to replace the dot product measure of similarity with a cosine ordering. That would look like this:
static double cosine(double[] xs, double[] ys, double[] scales) { double product = 0.0; double xsLengthSquared = 0.0; double ysLengthSquared = 0.0; for (int k = 0; k < xs.length; ++k) { double sqrtScale = Math.sqrt(scales[k]); double scaledXs = sqrtScale * xs[k]; double scaledYs = sqrtScale * ys[k]; xsLengthSquared += scaledXs * scaledXs; ysLengthSquared += scaledYs * scaledYs; product += scaledXs * scaledYs; } return product / Math.sqrt(xsLengthSquared * ysLengthSquared); }
Cosine-based scores are as follows (with dot-product scores reproduced in parens for convenience):
DOCUMENT SCORES VS. QUERY 0: 0.99 Human machine interface for Lab ABC computer applications ( 0.31) 1: 0.94 A survey of user opinion of computer system response time ( 0.91) 2: 0.99 The EPS user interface management system ( 0.74) 3: 0.98 System and human system engineering testing of EPS ( 0.88) 4: 0.90 Relation of user-perceived response time to error measurement ( 0.41) 5: -0.11 The generation of random, binary, unordered trees (-0.03) 6: -0.09 The intersection graph of paths in trees (-0.05) 7: -0.09 Graph minors IV: Widths of trees and well-quasi-ordering (-0.07) 8: 0.05 Graph minors: A survey ( 0.03) TERM SCORES VS. QUERY 0: 0.96 human ( 0.36) 1: 0.98 interface ( 0.32) 2: 0.96 computer ( 0.36) 3: 0.97 user ( 0.61) 4: 1.00 system ( 1.02) 5: 0.90 response ( 0.39) 6: 0.90 time ( 0.39) 7: 0.97 EPS ( 0.49) 8: 0.55 survey ( 0.27) 9: -0.10 trees (-0.07) 10: -0.06 graph (-0.05) 11: -0.05 minors (-0.03)
Note that cosine changes the actual ranking of results as well as the relative scores. Euclidean distance or other distance measures would be a third way to measure. Dot product is the most natural for the underlying linear model implied by the use of singular value decomposition.
And that's it. A complete demo implementation of latent semantic indexing in a few lines of code.
Complete Matrix SVD
The API is set up to allow easy calculation of SVDs for a complete matrix, that is, one where every value is known. There is a single method call, but the method requires a slew of fairly sensitive arguments, the effects of which we will endeavour to illustrate in this section. Before doing that, let's get an example up and running.
Complete SVD for Token Bigrams
For our first exercise, we will simply look at bigram counts in textual data. Bigram counts are asymmetric; for example, "of the" occurs much more frequently than "the of" in English.
The Code
The code may be found in the standalone
demo code file src/TokenBigramSvd.java
.
Downloading Pride and Prejudice
Before we can run, we need text data. We chose to analyze Jane Austen's early 19th century English novel, Pride and Prejudice. It may be downloaded from Project Gutenberg from:
- Overview: Pride and Prejudice
- Text: Pride and Prejudice [700KB ASCII]
After downloading, place the text in a file, which we'll call
$TEXT_FILE
. We suggest editing out the top of the file,
which is the Project Gutenberg header. It always helps to start with clean data.
You may use texts other than Pride and Prejudice, of course.
Running the Demo
Running from Ant
The demo may be run directly from ant, with the path to the text data file supplied on the command line:
> cd $LINGPIPE/demos/tutorial/svd > ant -Dtext.file=e:\data\gutenberg\dist\JaneAusten\PrideAndPrejudice.txt tokenBigram ...
Parameter Report
The first thing the program does is report all of its parameters. Other than the file name, these are hardcoded into the program and reported here for convenience.
... TokenBigramSVD Extracting Bigrams File=E:\data\gutenberg\dist\JaneAusten\PrideAndPrej tokenizerFactory.getClass()=class com.aliasi.tokeni input charset=ASCII Number of tokens=146592 Number of distinct tokens=6628 #Matrix entries=43930384 Computing SVD Max factors=3 featureInit=0.1 initialLearningRate=0.0010 annealingRate=10 regularization0.0 minImprovement=0.0 minEpochs=10 maxEpochs=200 output charset=ASCII ...
Factor- and Epoch-Level Report
After the report, the iterative SVD solver comes into
play. The factors are computed one at a time. For each
factor, the entire matrix is visited several times. Each
loop through the data is referred to as an "epoch".
The number of epochs is bounded below and above by parameters
passed to SVD. The SVD method may be enabled for online
reporting, which we have done here. Reports from the SVD
solver are preceded by "partialSvd|
".
It reports when each factor is started and root mean square
error for the current factorization (involving the current
factor and all previous factors).
... partialSvd| Start partialSvd| Factor=0 partialSvd| epoch=0 rmse=0.2307442009625624 partialSvd| epoch=1 rmse=0.2307009155564279 partialSvd| epoch=2 rmse=0.2306634757907735 partialSvd| epoch=3 rmse=0.2305997643148771 partialSvd| epoch=4 rmse=0.23037212972010507 partialSvd| epoch=5 rmse=0.2293394109578155 partialSvd| epoch=6 rmse=0.22494535998882884 partialSvd| epoch=7 rmse=0.21186634823241973 partialSvd| epoch=8 rmse=0.19391353624907023 partialSvd| epoch=9 rmse=0.18656738572758658 partialSvd| epoch=10 rmse=0.18585605225301427 ... partialSvd| epoch=197 rmse=0.1830182265560673 partialSvd| epoch=198 rmse=0.1830134851434906 partialSvd| epoch=199 rmse=0.18300877442023536 partialSvd| Factor=1 partialSvd| epoch=0 rmse=0.18187860408677023 partialSvd| epoch=1 rmse=0.1818195893457582 partialSvd| epoch=2 rmse=0.18177486371581822 ... partialSvd| epoch=197 rmse=0.16628383740250385 partialSvd| epoch=198 rmse=0.16627595545058857 partialSvd| epoch=199 rmse=0.16626816248929846 partialSvd| Factor=2 partialSvd| epoch=0 rmse=0.1653360181791556 partialSvd| epoch=1 rmse=0.16527773279071495 partialSvd| epoch=2 rmse=0.16523608990094368 ... partialSvd| epoch=197 rmse=0.1501681463210289 partialSvd| epoch=198 rmse=0.1501662935811939 partialSvd| epoch=199 rmse=0.15016445305391138
Note that as the epochs progress, error is gradually reduced. The algorithm is actually guaranteed to find the unique solution (up to ordering, signs and arithmetic precision) if allowed enough epochs with an appropriate learning rate (more on these tuning parameters below). We have specified a maximum of 200 epochs, at which point RMSE is only decreasing in the 5th or 6th decimal place. We see that with one factor, the original matrix was reconstructed with RMSE of 0.192, with two factors, RMSE is 0.175 and with three factors, 0.158. With as many factors as the minimum dimension of the input matrix, RMSE will converge to zero (we show this with some small artificial examples below).
Results: Factors
After computing the SVD, our demo program reports on the singular values and vectors it has found. Following Gorrell's paper, we report singular values along with the largest and smallest values in the left and right singular vectors. These are labeled by the token. Here's the report for the first factor:
... ORDER=0 singular value=992.3642458892039 Extreme Left Values 0 of 0.632 1 to 0.417 2 in 0.386 3 and 0.243 4 for 0.198 5 with 0.166 6 on 0.150 7 that 0.139 8 by 0.133 9 at 0.116 ... 7026 DEAR -0.000 7027 Michael -0.000 7028 EVEN -0.000 Extreme Right Values 0 the 0.748 1 her 0.372 2 his 0.267 3 a 0.211 4 be 0.209 5 it 0.115 ... 7027 NIECE -0.000 7028 However -0.000 ...
The first singular value is 985, which is very high for this corpus, where average counts are quite low. This says that the first factor provides high counts to bigrams with first words "of", "to", "in", and "and", and with second words like "the", "her", and "be".
As the report below shows,
"of the" is the most common bigram in Jane Austen's
text, with a count of only 461. If we used only the first
factor to predict, we would predict "of the" to
have a count of 0.614 * 0.742 * 985.0 = 448
,
which is pretty close to the empirical corpus count of 461.
This is not surprising, as the first factor will always account
for the largest patterns in the data. It further predicts
that "to the" has a high corpus count, but
only 0.441/0.614 as high as "of the". Similarly
"of her" is predicted to have a count that's
only 0.373/0.742 as high as "of the".
Subsequent factors are required to be orthogonal to the first dimension. They are also required to have singular values that are equally sized or smaller than previous factors. Here's the second factor report:
... ORDER=1 singular value=525.2432789690466 Extreme Left Values 0 of 0.281 1 in 0.191 2 for 0.087 3 that 0.069 4 on 0.068 5 and 0.067 ... 7025 not -0.185 7026 would -0.208 7027 I -0.358 7028 to -0.701 Extreme Right Values 0 the 0.160 1 a 0.082 2 his 0.061 3 it 0.053 4 she 0.042 5 I 0.040 ... 7025 do -0.149 7026 am -0.196 7027 have -0.328 7028 be -0.766 ...
Note that it not only adds positive counts for good bigrams like "she had", but positive counts bad bigrams like "she be". Remember, this is just an approximation, and it's being driven by least squares, not linguistic theory.
What's interesting to note is that there are large negative dimensions in this factor. But remember the signs. These values will reduce estimates for bigrams like "I her" and "for was", but will actually increase counts in cases where both values are negative, such as "in his".
Here's the third factor (remember we count from zero):
... [java] ORDER=2 singular value=486.5512054169151 [java] Extreme Left Values [java] 0 I 0.471 [java] 1 she 0.430 [java] 2 he 0.336 [java] 3 it 0.312 [java] 4 and 0.233 [java] ... [java] 7025 of -0.028 [java] 7026 must -0.038 [java] 7027 will -0.039 [java] 7028 to -0.304 [java] [java] Extreme Right Values [java] 0 was 0.518 [java] 1 had 0.391 [java] 2 am 0.283 [java] 3 is 0.209 [java] 4 could 0.199 [java] ... [java] 7025 the -0.042 [java] 7026 make -0.053 [java] 7027 her -0.055 [java] 7028 be -0.302 ...
The third singular value is also high, at 488.6, which means the factor is still quite significant. As the singular values decrease, so does the effect of the dimension. In factor 2, we see left tokens making up subject nouns like "he" or "she" (and "that" used as a subject relative), and right tokens making up auxiliaries and object nouns, which is rather difficult to interpret. Also note the high negative values, which accounts for the special behavior of auxiliaries like "be". Here we remove the counts for "she be", correcting an estimation error incurred by the first two factors alone.
Results: Reconstruction
Finally, with just the three factors, here is how well the original values are reconstructed for the token pairs with the highest counts in the training data:
... RECONSTRUCTED TOP COUNTS LeftToken,RightToken OriginalValue SvdValue of,the 475.0 493.20816651434825 to,be 437.0 413.3925446237487 in,the 372.0 302.55408105368076 I,am 302.0 101.95034167348793 of,her 262.0 234.09037956436723 to,the 253.0 256.92410882744946 of,his 233.0 176.70164636168101 had,been 198.0 2.177427219085274 I,have 187.0 95.79751755804523 to,her 173.0 161.8049819111378 she,had 168.0 86.28589374732717 that,he 168.0 20.290267706751138 could,not 167.0 9.925917868636159 and,the 162.0 181.341092484635 it,was 161.0 82.93542413282482 she,was 156.0 114.16278862013016 for,the 155.0 153.97709624041548 he,had 145.0 67.52377567671842 on,the 138.0 117.40751622553063 such,a 138.0 6.3652589510111826 have,been 135.0 1.6367507551545604 did,not 131.0 2.981038158810108 he,was 126.0 89.33272971786619 that,she 124.0 25.93488437393153 in,a 123.0 88.78029570435424
Note that some bigrams are reconstructed accurately, whereas others are way off. Remember that square error is the metric. For instance, the original count of "to be" is 436, the reconstructed count is 393, so the difference is 436-393=43, for a contribution to square error of 43*43=1849 (recall this is divided by the number of positions in the matrix to determine mean square error, the square root of which is reported as RMSE in the epoch reports).
Comparison with Gorrell's Results
Gorrell started with a matrix of bigram counts just as we did. But she normalized the vectors to a probability scale. Here's her report of the top dimensions of the first two factors:
Gorrell EACL 2005 (Gone with the Wind) Left Values Right Values Factor 0 of 0.5125468 the 0.8755944 in 0.49723375 her 0.28781646 and 0.39370865 a 0.23318098 to 0.2748983 his 0.14336193 on 0.21759394 she 0.1128443 at 0.17932475 it 0.06529821 for 0.16905183 he 0.063333265 with 0.16042696 you 0.058997907 from 0.13463423 their 0.05517004 Factor 1 she 0.6633538 was 0.58067155 he 0.38005337 had 0.50169927 it 0.30800354 could 0.2315106 and 0.18958427 would 0.17589279
Note that these bear some similarity to our results, which are repeated here in a similar form:
This Demo (Pride and Prejudice) Factor 0 of 0.632 the 0.748 to 0.417 her 0.372 in 0.386 his 0.267 and 0.243 a 0.211 for 0.198 be 0.209 with 0.166 it 0.115 on 0.150 my 0.099 that 0.139 their 0.092 by 0.133 Mr 0.088 at 0.116 all 0.086 Factor 1 I 0.471 was 0.518 she 0.430 be 0.391 he 0.336 am 0.283 it 0.312 is 0.209 and 0.233 could 0.199
This is close enough that we may assume that it's the text source leading to the difference. Of course, this illustrates the variability of texts; if Pride and Prejudice had the same bigram counts as Gone with the Wind, the results would've been identical.
Inspecting the Code
The code for generating the SVD is quite simple. Most of
the work's in I/O as there's only one method call for SVD. Here's
the top-level command implementation from
src/TokenBigramSvd.java
:
public static void main(String[] args) throws Exception { File textFile = new File(args[0]); MapSymbolTable symbolTable = new MapSymbolTable(); TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY; String charset = "ASCII"; double[][] values = extractBigrams(textFile,symbolTable,tokenizerFactory,charset); int maxFactors = 3; double featureInit = 0.1; double initialLearningRate = 0.001; int annealingRate = 10; double regularization = 0.00; double minImprovement = 0.000000; int minEpochs = 10; int maxEpochs = 200; PrintWriter verbosePrintWriter = new PrintWriter(new OutputStreamWriter(System.out,charset)); Reporter reporter = Reporters.writer(verbosePrintWriter).setLevel(LogLevel.DEBUG); SvdMatrix matrix = SvdMatrix.svd(values, maxFactors, featureInit, initialLearningRate, annealingRate, regularization, reporter, minImprovement, minEpochs, maxEpochs); reportSvd(values,matrix,symbolTable); }
The first batch of code here simply extracts a matrix of bigram
counts from the specified text file and assigns it to the variable
values
. This requires a tokenizer factory. As it works,
it the bigram extractor populates a symbol table mapping the tokens to
the dimensions of the matrix; this is used later for output reporting,
but otherwise does not come into play in the SVD computation.
The second group of statements simply sets all the parameters for the run. We will explain these in detail shortly.
Next is the actual SVD computation, which is handled by a call
to the static method svd()
, which is found in the
com.aliasi.matrix.SvdMatrix
class. The result is
an instance of the class SvdMatrix
. As we mentioned
earlier, the SvdMatrix
class implements the
Matrix
interface by storing the scaled singular vectors
and using them to compute values at run time.
Note that we have created an instance of com.aliasi.io.Reporter
using a static factory method in com.aliasi.io.Reporters
,
and then set its log level (com.aliasi.io.LogLevel
) to
the enum value DEBUG
. This is passed in to the
SVD process to provide feedback on the progress of the optimizer.
The argument may be null
for no reporting, or reports
can be sent to files, standard output, etc.
The final piece of the code reports on the resulting SVD matrix using the original array of values and symbol table.
static void reportSvd(double[][] values, SvdMatrix matrix, MapSymbolTable symbolTable) { double[] singularValues = matrix.singularValues(); double[][] leftSingularVectors = matrix.leftSingularVectors(); double[][] rightSingularVectors = matrix.rightSingularVectors(); for (int order = 0; order < singularValues.length; ++order) { System.out.println("\n\nFACTOR=" + order + " singular value=" + singularValues[order]); System.out.println("Extreme Left Values"); extremeValues(leftSingularVectors,order,symbolTable); System.out.println("\nExtreme Right Values"); extremeValues(rightSingularVectors,order,symbolTable); } ...
The first thing the report method does is pull out the vector of
singular values and matrices of singular vectors. We then just walk
through the factors (the dimensionality of all three arrays) and
report on the singular value and extreme values. The extreme values
computation is straightforward. It uses our convenience class
util.ObjectToDoubleMap
to create a map from symbol to
counts, and then simply reports on the top values. Note that the
symbol table is required to map the dimension i
back
to a symbol token
.
static void extremeValues(double[][] values, int order, MapSymbolTable symbolTable) { ObjectToDoubleMap<String> topVals = new ObjectToDoubleMap<String>(); for (int i = 0; i < values.length; ++i) { String token = symbolTable.idToSymbol(i); topVals.set(token,values[i][order]); } List<String> tokensByValue = topVals.keysOrderedByValueList(); int size = tokensByValue.size(); for (int i = 0; i < 10 && i < size; ++i) { String token = tokensByValue.get(i); double value = topVals.getValue(token); System.out.printf(" %6d %-15s % 5.3f\n",i,token,value); } System.out.println("..."); for (int i = 10; --i >= 0; ) { String token = tokensByValue.get(size-i-1); double value = topVals.getValue(token); System.out.printf(" %6d %-15s % 5.3f\n",size-i-1,token,value); } }
Tuning SVD
In this section, we will cover all the parameters required for
singular value decomposition. These are described in technical detail
in the class javadoc, matrix.SvdMatrix
.
Total SVD Parameters | ||
---|---|---|
Parameter | Demo Value | Description |
values |
(counts from Pride & Prejudice) | Input matrix. |
maxOrder |
3 |
Maximum number of factors produced. |
featureInit |
0.1 |
Initial value for singular vector dimensions. Good values lead to lower initial RMSE in each epoch. Require good value to converge. |
initialLearningRate |
0.001 |
How much to change singular vectors/values away from errors. Determines whether and how quickly the algorithm terminates. |
annealingRate |
10 |
How quickly the learning rate is lowered (annealed). |
regularization |
0.0 |
How much penalty is assessed for singular value size. Setting higher than zero essentially smoothes and allows better convergence properties, but results in non-zero error at convergence. |
minImprovement |
0.0001 |
Relative reduction in RMSE per epoch required to continue to next epoch. |
minEpochs |
10 |
Minimum number of training epochs per factor. |
maxEpochs |
200 |
Maximum number of training epochs per factor. Must be set high enough to converge as much as required. |
verbosePrintWriter |
new PrintWriter( new OutputStreamWriter( System.out, charset)) |
Print writer for verbose output on factors and epochs from SVD method. |
Utility Features
Maximum Order
Specifying the maximum number of factors allows SVDs with relatively small numbers of factors (compared to the rank of the matrix) to be computed without computing the entire SVD. This should simply be determined by how close the approximation is. Using fewer factors leads to more smoothing, that is a less exact approximation, of the underlying matrix. For large-scale NLP problems, such as latent semantic indexing, typical values are around 100.
PrintWriter
A print writer may be supplied or it may be left null. If it's supplied, the SVD algorithm reports each factor and each epoch back to the printer.
Regularization
The regularization parameter places a penalty on the size of singular values. Directly, it places a penalty on the magnitude of the incremental singular vector dimensions, which has the result of reducing singular values in the end.
The regularization parameter is important for two reasons. The first reason is that it makes ill-conditioned problems well-behaved. An ill-conditioned SVD problem arises when two dimensions of the input matrix are highly correlated (in the worst case, linear multiple of each other, for a correlation of 1.0). Ill-conditioned matrices cause uncertainty in numerical solvers as there is little constraint on the magnitudes of the dimensions.
The second reason regularization is important is to reduce overfitting. Without regularization, the solver may work very hard to fit quirky facts of the data. With regularization, that becomes expensive, as it requires large values in singular vectors.
Learning and Convergence Features
Feature initialization, learning rates, and annealing rate primarily affect the convergence of the algorithm.
Feature Initialization
The initial value of all positions in the singular vectors are given by the feature initialization value. The value should be set so that there is a finite RMSE, and ideally so that the RMSE after training epochs is as low as possible.
Initial Learning Rate
The initial learning rate determines how quickly the singular vectors move to reduce their error. If the rate is too low, it will take too many epochs to move in the right direction. If the rate is too high, the algorithm will overshoot the target and oscillate rather than converging.
Let's consider some examples. If we set the learning rate down to 0.01, we cannot even get off the ground:
... initialLearningRate=0.01 ... partialSvd| Factor=0 partialSvd| epoch=0 rmse=NaN
If we start with a smaller learning rate, such as 0.0001, that is, 1.0E-4, convergence is too slow. Here's the RMSE after ten epochs for learning rate 0.0001, with the results for learning rate 0.001 in parentheses.
... initialLearningRate=1.0E-4 (0.001) ... partialSvd| Factor=0 ... partialSvd| epoch=10 rmse=0.2426697687009369 (0.19438242531824138)
This illustrates how we narrow down the acceptable values for parameters. Here we have determined that 0.01 is too fast a learning rate and 0.0001 too slow a learning rate. We could put more effort in to determine how close 0.001 is to optimal, but we have not done so for this tutorial.
Simulated Annealing
The annealing rate determines how quickly the learning rate is lowered. If it is set too low, the learning rate may be lowered too quickly and there may not be enough epochs for the factor to move to its correct value. If the annealing rate is set too high, the learning rate may not be lowered quickly enough to converge tightly on the proper value.
The learning rate in an epoch is a fraction of the learning rate proportional to the epoch. The value is the number of epochs required to increment the denominator. With an annealing rate of 10, then after 10 epochs, the learning rate is 1/(1+1) = 1/2 of its original value. After 20 epochs, the rate is 1/(1+2), and so on. After 90 epochs, the rate is 1/(1+9) = 1/10th of its original magnitude. After 990 epochs, the rate is 1/100th of its original magnitude and so on.
We started with an annealing rate of 10, which is very agressive in that the temperature cools very quickly. If we set the annealing rate even faster, say to 1, we won't converge fast enough. Here's an example run with the rate set to 1 (and the original rate 10 shown in parentheses):
... annealingRate=1 (10) ... partialSvd| epoch=10 rmse=0.23664566796409153 (0.19438242531824138) ... partialSvd| epoch=199 rmse=0.1925776063676676 (0.19184783367034072)
We are not converging fast enough with the annealing rate set so low (that is, fast).
If we set the annealing rate to 100, the temperature will cool much more slowly (again shown with the original annealing rate of 10 shown in parens).
annealingRate=100 (10) ... partialSvd| epoch=10 rmse=0.19589718967682496 (0.19438242531824138) ... partialSvd| epoch=199 rmse=0.19314865895356362 (0.19184783367034072)
Here, the temperature remains too high to converge to a stable solution. As with the other parameters, experimentation (i.e. trial and error) is the best guide.
Keep in mind that the values of learning rate and annealing rate are highly dependent on one another. As the learning rate goes up (higher temperature), the annealing rate may go down (faster cooling).
Minimum Improvement and Stopping
The minimum and maximum number of epochs control the number of epochs used in exactly the way suggested by their names.
The minimum improvement value is the relative reduction in root mean square error (RMSE) required to continue to the next epoch. If an epoch's work doesn't lower the RMSE by the percentage indicated, the loop exits and the factor is considered complete.
Setting the minimum improvement to 0.0 means that the maximum number of epochs will always be used. If the minimum improvement is set too high, say 0.1, then the singular vectors won't be oriented in the correct directions and the singular value won't have the correct magnitude. The value should be set as high as possible for the precision desired.
Using all the same parameters as above, setting the minImprovement value to 0.0001 leads to fewer loops:
... minImprovement=1.0E-4 ... partialSvd| Factor=0 ... partialSvd| epoch=27 rmse=0.19294226306082 partialSvd| exiting in epoch=27 rmse=0.19294226306082 relDiff=9.601844361088717E-5 ... partialSvd| Factor=1 ... partialSvd| epoch=37 rmse=0.17625093353760177 partialSvd| exiting in epoch=37 rmse=0.17625093353760177 relDiff=9.66111026136112E-5 ... partialSvd| Factor=2 ... partialSvd| epoch=63 rmse=0.15869432360581204 partialSvd| exiting in epoch=63 rmse=0.15869432360581204 relDiff=9.714495167956563E-5
Note that the final RMSE is 0.1587, which is much higher than the 0.1576 RMSE of the run allowed 200 epochs per factor. This reflects stopping before convergence (as does the previous run, in fact), so the singular values and vectors are slightly different. For instance, here's the first singular vector's extreme values (with the original 200 epoch per factor run's values in parens):
... minImprovement=1.0E-4 (0.000) ... FACTOR=0 singular value=980.4105120083517 (985.0349975062319) Extreme Left Values 0 of 0.624 (0.614) 1 to 0.424 (0.441) 2 in 0.387 (0.388) 3 and 0.244 (0.242) Extreme Right Values 0 the 0.740 (0.742) 1 her 0.376 (0.373) 2 his 0.269 (0.268) 3 be 0.218 (0.217)
It's clear the answer is in roughly the same ballpark. The other factors are similarly close, though they vary more as the order increases. Whether or not the exact singular values matter is going to be application dependent. Often, computing more exact answers is not necessary.
Regularization by Shrinkage
Regularized singular value decomposition optimizes a slightly different scoring function than standard SVD. The standard variety finds the solution with the least square error of any orthogonal factorization for any given order. At the order equal to the rank of the matrix (number of linearly independent rows and columns), the standard SVD returns an exact factorization (up to arithmetic precision and estimate convergence iterations).
Regularized SVD considers not only squared error from the original matrix values, but also the size of the singular values. It provides the square of the singular values times the shrinkage factor as an additional error term. In simpler terms, it penalizes large singular values proportionally to the square of their length. Regularized SVD thus minimizes the combined error: least squares plus squared singular values.
With linear regression, problems may be ill-conditioned if there are linear dependencies among the input vectors. Regularization is often used to overcome this problem. Ill-conditioning does not cause a problem for our stochastic gradient descent SVD solver.
Linear regression often overfits a set of training data. To compensate for overfitting, regularization is often used to reduce the magnitude of the regression coefficients. This use corresponds to the use in SVD.
Finally, regularization is able to eliminate dimensions, much like non-parametric process models, such as the Dirichlet process prior for a multinomial of unknown dimensionality. Like the process priors, regularization may be interpreted as a prior on the size of singular values.
The exact definition of regularization is found in the class
documentation for SvdMatrix
, which also contains
references to textbook presentations of regularization.
Running the Demo
The demo may be run using the following ant target.
> cd $LINGPIPE/demos/tutorial/svd > ant -Dregularization=0.00 -DmaxEpochs=100000 regularized Regularized SVD. TEST MATRIX 1.0, 2.0, 3.0, -4.0 1.0, 4.0, 9.0, -16.0 1.0, 8.0, 27.0, -64.0 -1.0, -16.0, -81.0, 256.0 SVD PARAMS Regularization=0.0 Computing SVD maxFactors=4 featureInit=0.01 initialLearningRate=5.0E-4 annealingRate=10000 regularization=0.0 minImprovement=0.0 minEpochs=10 maxEpochs=100000 Singular Values k=0 value= 278.47 k=1 value= 9.00 k=2 value= 1.03 k=3 value= 0.00 Reconstructed Matrix 0.96, 2.04, 2.98, -4.00 1.06, 3.94, 9.03, -15.99 0.98, 8.02, 26.99, -64.00 -1.00, -16.00, -81.00, 256.00
The are two properties supplied by command line, the regularization parameter and the maximum number of epochs to run. With 100,000 epochs, the SVD completes in under a second with an accurate reconstruction.
The code is just like our other cases, so we do not repeat it
here. It may be found in the file src/RegularizedSvd.java
.
Let's see what happens when we keep our matrix the same, but change the regularization parameter. With a value of 0.01, the output is:
> ant -Dregularization=0.01 -DmaxEpochs=100000 regularized ... Singular Values k=0 value= 278.43 k=1 value= 8.96 k=2 value= 0.99 k=3 value= 0.00 Reconstructed Matrix 0.93, 2.02, 2.99, -4.00 1.04, 3.92, 9.02, -16.00 0.99, 8.01, 26.95, -64.00 -1.00, -16.00, -81.00, 255.96
The reconstruction is not as accurate, and the singular values ar ea bit smaller. Now let's look at 0.1 and 1.0:
> ant -Dregularization=0.1 -DmaxEpochs=100000 regularized ... Singular Values k=0 value= 278.08 k=1 value= 8.60 k=2 value= 0.63 k=3 value= 0.00 Reconstructed Matrix 0.71, 1.82, 3.04, -3.99 0.91, 3.75, 8.93, -16.01 1.08, 7.95, 26.63, -64.02 -0.99, -16.01, -80.97, 255.61
And here is with regularization of 1.0:
> ant -Dregularization=1.0 -DmaxEpochs=100000 regularized ... Singular Values k=0 value= 274.54 k=1 value= 5.00 k=2 value= 0.00 k=3 value= 0.00 Reconstructed Matrix 0.22, 1.04, 2.50, -4.16 0.47, 2.58, 7.53, -16.30 0.89, 6.51, 24.07, -64.01 -1.06, -16.24, -80.61, 252.11
The reconstruction becomes less and less exact as the regularization goes up. B the time the regularization is 1.0, there are only two factors being returned. The third and forth singular values are zero. With a regularization of 10, we are down to a single factor:
> ant -Dregularization=10.0 -DmaxEpochs=100000 regularized ... Singular Values k=0 value= 239.02 k=1 value= 0.00 k=2 value= 0.00 k=3 value= 0.00 Reconstructed Matrix 0.02, 0.26, 1.28, -3.95 0.07, 0.99, 4.80, -14.83 0.28, 3.80, 18.35, -56.67 -1.07, -14.70, -70.96, 219.21
It is an empirical matter in an application to see whether regularization helps or not.
Scaling Inputs
Although standard SVD is invariant under the scaling of the input matrix, regularized regression is not. That is, if we rescale the inputs and leave the scaling factor the same, we get answers that are different (and not just different by the scaling factors). A typical kind of scaling is normalizing to Z-scores. This might be done per-row, per-column, or for the whole matrix, depending on the kind of application. Note that the Z-score conversion is not linear. Then the SVD is taken of the transformed matrix, and the results converted back using an inverse z-score. For a single problem, this doesn't matter. One can simply choose an appropriate regularization parameter. But if you are fitting a regularization parameter across problems, a Z-score conversion is likely in order.
Partial Matrix SVD
The real strength of stochastic gradient descent is that it allows SVDs to be computed for partial matrices. A partial matrix is one where some values are unknown. LingPipe's gradient descent solver works by minimizing the square error between the SVD reconstructed matrix and the known values.
Representing Partial Matrices
We represent a partial matrix A
using
a pair of parallel two-dimensional matrices:
int[][] columnIndexes; double[][] values;
The arrays values[i]
and columnIndexes[i]
represent the i
-th row of the partial matrix.
Thus values.length
must be the same as
columnIndexes.length
, namely the number of rows
in the partial matrix. The array columnIndexes[i]
represents the columns taking values values[i]
.
That is, the original matrix takes values:
A[i][columnIndexes[i][j]] = values[i][j]
Thus for each row i
,
columnIndexes[i]
and values[i]
must be the same length.
Worked Example
Suppose we have a 3 × 4 partial matrix:
A = { { 1, ?, 2, 5 }, A[0][0] = 1; A[0][2] = 2; A[0][3] = 5; { ?, ?, 9, ? }, A[1][2] = 9; { 3, 7, ?, ? } } A[2][0] = 3; A[2][1] = 7
We use ?
to indicate unknown values. This
partial matrix would be represented by the following pair of
parallel arrays:
columnIndexes = { { 0, 2, 3 }, { 2 }, { 0, 1 } } values= { { 1.0, 2.0, 5.0 }, { 9.0 }, { 3.0, 7.0 } }
Example of Partial SVD
We will illustrate partial matrix SVD using the same data as for full matrix SVD, but we'll cheat. We'll consider all the zero values as unknown, rather than as zero. The effect of ignoring zero entries is that we will not consider the error on bigrams with zero counts in fitting the SVD.
The code for the partial SVD is in src/PartialTokenBigram.svd
.
The only difference is in how the inputs are computed and which SVD method is called. We start with the same complete two-dimensional array values
of counts, only this time they are transformed into a partial
array representation:
double[][] values = TokenBigramSvd .extractBigrams(textFile,symbolTable,tokenizerFactory,charset); int[][] columnIds = columnIds(values); double[][] partialValues = partialValues(values);
The columnIds()
and partialValues()
methods are
similar, so we only show one here:
static int[][] columnIds(double[][] values) { int[][] columnIds = new int[values.length][]; for (int i = 0; i < values.length; ++i) columnIds[i] = columnIdsRow(values[i]); return columnIds; } static int[] columnIdsRow(double[] values) { int count = 0; for (int i = 0; i < values.length; ++i) if (values[i] != 0) ++count; int[] columnIdsRow = new int[count]; count = 0; for (int i = 0; i < values.length; ++i) if (values[i] != 0) columnIdsRow[count++] = i; return columnIdsRow; }
The only substantial difference with the values computation is that arrays of double precision floating point are allocated and the assignment in the innermost loop is:
static double[] partialValuesRow(double[] values) { ... if (values[i] != 0) partialValuesRow[count++] = values[i]; ...
Then, a call is made to the partial SVD method:
SvdMatrix matrix = SvdMatrix.partialSvd(columnIds, partialValues, maxFactors, ...
We have ellided all the other arguments as they are
exactly the same as for the complete matrix method
svd()
. The report on the resulting SVD matrix
is the same. We have used similar, though not identical
parameters (see the program run's report for values):
> ant -Dtext.file=c:\carp\data\gutenberg\dist\JaneAusten\pandp12.txt partialTokenBigram ... TokenBigramSVD Extracting Bigrams File=C:\carp\data\gutenberg\dist\JaneAusten\pandp12.txt tokenizerFactory.getClass()=class com.aliasi.tokenizer.IndoEuropeanTokenizerFactory input charset=ASCII Number of tokens=149362 Number of distinct tokens=7029 #Matrix entries=49406841 Computing SVD maxFactors=3 featureInit=1.0 initialLearningRate=0.0020 annealingRate=100 regularization0.0 minImprovement=0.0 minEpochs=10 maxEpochs=10000 output charset=ASCII
We run the partial SVD for many more epochs, as each epoch is much cheaper to compute:
partialSvd| Start partialSvd| Factor=0 partialSvd| epoch=0 rmse=7.133949741392208 partialSvd| epoch=1 rmse=6.722808642635284 partialSvd| epoch=2 rmse=6.406373728494666 ... partialSvd| epoch=9998 rmse=4.69495343471722 partialSvd| epoch=9999 rmse=4.694952358911906 partialSvd| Factor=1 partialSvd| epoch=0 rmse=4.586320695123342 partialSvd| epoch=1 rmse=4.541375190262957 ... partialSvd| epoch=9999 rmse=3.9478144855972626 partialSvd| Factor=2 partialSvd| epoch=0 rmse=3.871646442556527 partialSvd| epoch=1 rmse=3.8383954975292975 ... partialSvd| epoch=9999 rmse=3.2421189213528083
Overall root mean square error (RMSE) is higher, but that is largely because we are only measuring on the non-zero entries.
Because there is no penalty for estimating unseen bigrams as any values, the singular vectors have far less bigram coherence:
ORDER=0 singular value=12821.954397532272 Extreme Left Values 0 Lady 0.438 1 Miss 0.430 2 the 0.320 3 of 0.172 4 able 0.165 5 a 0.148 6 to 0.113 7 in 0.110 8 acquainted 0.096 9 obliged 0.090 Extreme Right Values 0 am 0.476 1 was 0.238 2 be 0.227 3 the 0.219 4 she 0.150 5 is 0.144 6 had 0.134 7 deal 0.129 8 been 0.128 9 he 0.110
The signular value is also much much higher. The next two singular values and vectors are similarly different than the complete values:
ORDER=1 singular value=7360.618593364118 Extreme Left Values 0 was 0.208 1 such 0.206 2 could 0.163 3 did 0.146 4 do 0.129 5 is 0.128 ... 7024 in -0.051 7025 part -0.058 7026 she -0.081 7027 he -0.085 7028 of -0.091 Extreme Right Values 0 sister 0.186 1 wife 0.166 2 been 0.160 3 mother 0.126 4 length 0.105 5 father 0.084 ... 7024 was -0.073 7025 had -0.086 7026 world -0.105 7027 am -0.618
ORDER=2 singular value=4651.550676369403 Extreme Left Values 0 I 0.306 1 no 0.118 2 that 0.091 3 and 0.075 4 of 0.069 5 resolved 0.061 ... 7024 It -0.084 7025 in -0.113 7026 to -0.114 7027 Miss -0.125 7028 a -0.397 Extreme Right Values 0 sure 0.422 1 I 0.212 2 she 0.210 3 he 0.209 4 afraid 0.095 5 they 0.080 6 of 0.078 ... 7023 least -0.049 7024 was -0.052 7025 dear -0.052 7026 is -0.078 7027 be -0.102
For the bigrams with the largest counts, our reconstruction accuracy is actually a bit better than that for the complete matrix approach:
RECONSTRUCTED TOP COUNTS LeftToken,RightToken OriginalValue SvdValue of,the 475.0 493.26184029020783 to,be 437.0 383.39913620975636 in,the 372.0 302.44315562140076 I,am 302.0 298.54304987916515 of,her 262.0 253.0247324297928 to,the 253.0 307.8277758748272 of,his 233.0 201.66868338777743 had,been 198.0 173.48675219997824 I,have 187.0 152.12066400202096 to,her 173.0 135.74997015344465 she,had 168.0 107.08355045181543 that,he 168.0 120.57469837919749 could,not 167.0 102.27916174649268 and,the 162.0 53.11393249643995 it,was 161.0 101.53309536454064 she,was 156.0 133.15050106612105 for,the 155.0 85.15527656314849 he,had 145.0 101.7485686594175 on,the 138.0 119.02961305369861 such,a 138.0 88.83854905697908 have,been 135.0 115.50775127131148 did,not 131.0 90.29156054123864 he,was 126.0 121.92024146656152 that,she 124.0 133.05950294976054 in,a 123.0 85.86606733901692
The better estimation of high count pairs comes at a cost -- we do much worse at estimating the zero counts.
Comparison with Algebraic SVD Solvers
Standard algebraic SVD solvers, such as those found in popular matrix libraries (see the book by Strang or the Wikipedia article in the references) require cubic time to solve in the dimensionality of the matrix. Standard solvers are numerically unstable, but this problem may be overcome with regularization. The biggest problem for standard solvers is that they don't apply to partial matrices.
LingPipe's SVD solver uses stochastic gradient descent.
This algorithm requires a an amount of arithmetic effort
bounded by the maximum number of epochs times the number of
matrix entries times the number of factors. For full matrices,
the number of entries is quadratic and the number of epochs
is usually a relatively small constant like 200. If a
full rank decomposition is required, we are still stuck
with a cubic algorithm. But if we only require the first
k factors, we have an O(k * n^{2})
algorithm, where k
is the number of factors
and n
is the number of dimensions. If the
matrix is partial, the n^{2}
factor
is really just the number of defined entries.
There are better techniques than stochastic gradient descent for solving full-rank singular value decompositions on modestly sized matrices. These algorithms are cubic in the dimensionality of the matrix, so are prohibitive for large matrices. Such algorithms are not necessary if a low order approximation is desired. The gradient descent algorithm finds a single factor at a time with a few hundred steps per entry per factor.
An example of full matrix SVD for small matrices, the unit test contains examples:
References
Overviews and Reference List
- Michael Berry. Latent Semantic Indexing Web Site. University of Tennessee.
- BenoĆ®t Lemaire and Philippe Dessus. Readings in Latent Semantic Analysis for Cognitive Science and Education. University of Grenoble.
- University of Colorado. LSA Page.
Latent Semantic Analysis and Indexing
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science. 41(6):391--407.
- Michael W. Berry, Susan T. Dumais, and Gavin W. O'Brien. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37(4):573--595.
SVD Solvers
- Gorrell, G. and Webb, B., 2005. Generalized Hebbian Algorithm for Latent Semantic Analysis. Proceedings of Interspeech 2005.
- Genevieve Gorrell. 2005. Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. EACL 2006.
- Genevieve Gorrell. 2006. Generalized Hebbian Algorithm for Dimensionality Reduction in Natural Language Processing. Ph.D. Thesis. Linköping University. Sweden.
- Genevieve Gorrell. GHAPACK.
- Timely Development's Netflix Prize Page.
- Gene H. Golub and Charles F. van Loan. 1996. Matrix Computations. 3rd Edition. Johns Hopkins.
- Gilbert Strang. 2006. Linear Algebra and its Applications. 4th Edition. Thomson.
Other NLP Applications
- Jennifer Chu-Carroll and Bob Carpenter. 1999. Vector-based natural language call routing. Computational Linguistics 25(3).