Eigenvectors and grouping

To group pages by concepts, Smart Wiki Search uses the eigendecomposition of the Wikipedia link transition matrix. Practically, the grouping is performed on an NVidia CUDA card. Eigendecomposition is given by a set of special vectors, called eigenvectors, and their corresponding eigenvalues. Eigenvectors υ are "almost invariant" under the linear transformation described by their parent matrix A; they merely get multiplied by constants - their eigenvalues λ:

A complete set of (complex) eigenvectors and eigenvalues fully describes the matrix. Eigenvectors are not created equal, however. Their relative importance for the forward transformation is given by the absolute magnitudes of their eigenvalues. Because of this, a relatively small number of the eigenvectors corresponding to the eigenvalues with the largest magnitudes can capture the most important properties of the link transition structure.

Google uses the principal eigenvector (the eigenvector with the largest magnitude) to rank the pages in their search results. Due to the Perron-Frobenius theorem, the eigenvalue of the principal eigenvector is 1 (given that the transition matrix is irreducible). All other eigenvectors have lower eigenvalue magnitudes. The components of the principal eigenvector correspond to the stationary probabilities of the user being present at a particular page after randomly following a large number of links. All other eigenvectors do not have this probability meaning and cannot be used for ranking pages.

However many of these other eigenvectors have eigenvalue magnitudes only slightly below 1, which means that they carry almost as much information about the transition matrix. Therefore, it would be natural to try to use them in search algorithms. The most obvious application is the grouping of similar pages, which is what Smart Wiki Search does.

It turns out that page grouping in the vector space spanned by the eigenvectors themselves is not the most computationally efficient method. Eigenvectors of a real non-symmetric matrix like the transition matrix are either real or come in complex-conjugate pairs. To avoid increased storage requirements due the imaginary parts of the eigenvectors, Smart Wiki Search performs page grouping in the so-called Arnoldi basis vectors instead. The Arnoldi basis is a real orthonomal basis found as an intermediate step in the computation of eigenvectors. The significance of the Arnoldi basis is that each eigenvector can be represented as a linear combination of Arnoldi basis vectors with complex coefficients.

In the Smart Wiki Search algorithm, the ~1,600 or so Arnoldi vectors corresponding to the eigenvectors with the largest eigenvalue magnitudes are calculated by the Arnoldi method. A practically usable implementation of this method for large sparse matrices, called the Implicitly Restarted Arnoldi algorithm, was developed by R. Lehoucq et al. in the 1990s and distributed as a free numerical library called ARPACK. All major numerical software packages (Mathematica, Matlab etc.) later incorporated ARPACK for their sparse matrix eigenvalue problems.

To be on the safe side of Google's (or rather Stanford University's) patents, Smart Wiki Search does not use the first Arnoldi vector, which is numerically identical to the principal eigenvector. Discarding the principal eigenvector does not have a significant effect on the quality of grouping. This is understandable, since the sum of eigenvalue magnitudes of the remaining 1,599 vectors is hundreds of times greater than the principal eigenvalue.

Distance calculation: using NVidia CUDA card

When the user submits one or more Wikipedia pages in a query, the distances between these pages and every other page in the included subset are calculated, and closest pages are returned to the in the results. The distance measure is a combination of the direct product and cosine measure:

That is, the distance between pages a and b is the cosine measure multiplied by the absolute value of the direct product to the power of 0.2 (note that taking it to the power of zero would mean not including the direct product at all). a and b in this case are vectors containing the coordinates of the corresponding pages in the Arnoldi basis.

The calculation of direct products can, in practice, be sped up significantly by running it as a matrix multiply operation: the huge Arnoldi basis set needs to be multiplied by a long, thin matrix corresponding to the pages in the user's query. The cosine measures can then be computed from the direct products by dividing them by the absolute magnitudes of vectors a and b.

The matrix multiply operation is by far the most time-consuming part of the algorithm. To give rough numbers, there are n = 300,000 pages in the searched subset (these are pages having links from at least 14 other Wikipedia pages), k = 1,600 Arnoldi basis vectors, and, let's say, there are i = 2 pages in the user's query. The Arnoldi basis is n-by-k in size, and the query matrix is k-by-i. That means that during the matrix multiply operation there will be nki multiply-add operations performed. To speed up the computation, it is run on an NVidia CUDA card (GeForce GTX285). The matrix-matrix multiply function is found in the CUBLAS library (cublasSgemm). With CUDA, a factor of 5-8 speedup is achieved compared to running the same multiplication on the CPU. CUDA cards have an enormous memory bandwidth and hundreds of processing units, which explains their advantage over the CPU on parallelizable tasks such as matrix multiplication. Without CUDA, the computation would take too long to be practical on the current set-up.

In the current configuration, the matrix multiplication for a user request containing one page takes 0.141 s on GPU versus 0.95 s on CPU. For two pages in the query, it takes only 0.143 s on the GPU versus 1.20 s on the CPU. For three pages, it is 0.145 on the GPU and 1.50 s on the CPU. As you can see, for some reason increasing the number of pages in the query (that is, increasing i) carries a negligible additional cost for the GPU but adds significantly to the CPU workload. The CPU in this case is an Intel Core i7 running Intel Math Kernel Library BLAS (sgemm).

Automatic disambiguation

Sometimes the user may encounter disambiguation pages in the results. This usually happens after the user enters words with multiple meanings, for example, "Lock" - it is a disambiguation page in Wikipedia. In cases like these, Smart Wiki Search can provide automatic disambiguation. For disambiguation to work, the user must add the hint operator, *, and a hint word from the relevant field, for example "Lock *river" or "Lock *door".

Disambiguation works in the same way as the general grouping algorithm: it measures the Arnoldi subspace distances from the different meanings of "lock" to the hint word (in the above examples, "river" or "door"), and chooses the meaning with the lowest distance.

Query parser

The parser extracts the names of Wikipedia pages, redirects or disambiguation pages from the search query. It does this by trying to match every substring of the query to the list of page names. The parser then discards the matches that are themselves substrings of other matches. That is, the query "US president" will only match the redirect with the same name, not "US" or "President" separately. If the parser encounters a redirect, it replaces it with the name of the page to which the redirect points. In the case of a disambiguation page, the parser will try to guess the intention of the user if a hint word is provided after the hint operator, *.

Differences with Latent Semantic Indexing (LSI)

The Smart Wiki Search algorithm groups pages in the Arnoldi basis corresponding to the dominant eigenvectors of the Wikipedia link transition matrix. SLI, instead, uses singular value decomposition (SVD). SVD produces two vector subspaces, one for "input" vectors and one for "output" vectors. Neither of these subspaces will work for page grouping as well as the eigenspace (or Arnoldi basis). The fundamental reason for this has to do with the fact that eigendecomposition treats the input space and output space as one, while SVD does not. Loosely speaking, eigendecomposition (and not SVD) is a natural fit for matrices that project a vector space onto itself, as is the case with the link transition matrix.


Questions or comments? Send a message to author{at}dizzylogic.com.