Singular Value Decomposition Example In Python

August 05, 2019

Singular Value Decomposition Example In Python

Singular Value Decomposition, or SVD, has a wide array of applications. These include dimensionality reduction, image compression, and denoising data. In essence, SVD states that a matrix can be represented as the product of three other matrices. In mathematical terms, SVD can be written as follows:

where n is the number of rows (i.e. samples) and p represents the number of dimensions.

Suppose we had a matrix A.

In order to determine the associated U matrix, we must first find the eigenvectors of the matrix A multiplied by the transpose of the matrix A.

since

Recall how the definition for eigenvectors and eigenvalues is:

The latter can also be expressed as:

Thus, going back to our example:

We end up with a fourth degree polynomial.

After solving, we replace lambda with one of the eigenvalues.

After multiplying the matrix by the vector x, we obtain the following:

In solving the equations, we get:

We plug the eigenvectors into the columns of U.

Then, we repeat the same process for the transpose of the matrix A multiplied by the matrix A.

After solving, we obtain the expression for V.

Finally, S is a diagonal matrix whose values are the square root of the eigenvalues from either

Here’s where things get interesting. We know that the product of all three matrices is equivalent to the matrix on the lefthand side.

We can exclude features and still maintain an approximation of the original matrix. Say that the matrix A is dataset of columns and rows or the pixels that make up an image, we can in theory train models using the newly formed matrix and achieve comparable if not better (due to the curse of dimensionality) accuracy.

Code

Let’s take a look at how we could go about applying Singular Value Decomposition in Python. To begin, import the following libraries.

import numpy as np  
from sklearn.datasets import load_digits  
from matplotlib import pyplot as plt  
from sklearn.decomposition import TruncatedSVD  
float_formatter = lambda x: "%.2f" % x  
np.set_printoptions(formatter={'float_kind':float_formatter})  
from sklearn.ensemble import RandomForestClassifier

In the proceeding tutorial, we’ll attempt to classify handwritten digits. Fortunately, the scikit-learn library provides a wrapper function for importing the dataset into our program.

X, y = load_digits(return_X_y=True)

The dataset contains 1797 8x8 images. If you specify return_X_y=True the function will return the pixels as a one dimensional array.

X.shape

y contains the labels for every digit.

Let’s take a look at the first digit. As we can see, it’s simply an array of length 64 containing the pixel intensities.

image = X[0]

If we want to view the image using matplotlib, we must first reshape the array.

image = image.reshape((8, 8))

plt.matshow(image, cmap = 'gray')

Next, we’ll use Singular Value Decomposition to see whether we are able to reconstruct the image using only 2 features for each row. The s matrix returned by the function must be converted into a diagonal matrix using the diag method. By default, diag will create a matrix that is n x n, relative to the original matrix. This causes a problem as the size of the matrices no longer follow the rule of matrix multiplication where the number of columns in a matrix must match the number of rows in the other matrix. Therefore, we create a new m x n matrix and populate the first n x n part of it with the diagonal matrix.

U, s, V = np.linalg.svd(image)

S = np.zeros((image.shape[0], image.shape[1]))

S[:image.shape[0], :image.shape[0]] = np.diag(s)

n_component = 2

S = S[:, :n_component]  
VT = VT[:n_component, :]

A = U.dot(Sigma.dot(VT))

print(A)

``` plt.matshow(A, cmap = 'gray') ```

We can get the reduced feature space by taking the dot product of the U and S matrices.

U.dot(S)

Original vs Reduced Feature Space

Let’s compare the accuracy of a Random Forest model when it’s trained using the original handwritten digits and when it’s trained using the reduced feature space obtained from Singular Value Decomposition.

We can gauge the accuracy of the model by taking a look at the Out-Of-Bag score. If you’re unfamiliar with the concept of OOB, I encourage you checkout this post on Random Forest.

rf_original = RandomForestClassifier(oob_score=True)

rf_original.fit(X, y)

rf_original.oob_score_

Next, we create and fit an instance of the TruncatedSVD class with 2 components. It’s worth mentioning that unlike the previous example, we’re using 2/64 features.

svd = TruncatedSVD(n_components=2)

X_reduced = svd.fit_transform(X)

Every image (i.e. row) in the reduced dataset contains 2 features.

X_reduced[0]

Taking a look at the image, it’s difficult to distinguish what digit the image consists of, it could very well be a 5 and not a 0.

image_reduced = svd.inverse_transform(X_reduced[0].reshape(1,-1))

image_reduced = image_reduced.reshape((8,8))

plt.matshow(image_reduced, cmap = 'gray')

After training a Random Forest Classifier on the reduced dataset, we obtain a meager accuracy of 36.7%

rf_reduced = RandomForestClassifier(oob_score=True)

rf_reduced.fit(X_reduced, y)

rf_reduced.oob_score_

We can get the total variance explained by taking the sum of the explained_variance_ratio_ property. We generally want to aim for 80 to 90 percent.

svd.explained_variance_ratio_.sum()

Let’s try again, only, this time, we use 16 components. We check to see the amount of information contained in the 16 features.

svd = TruncatedSVD(n_components=16)

X_reduced = svd.fit_transform(X)

svd.explained_variance_ratio_.sum()

We obtain an accuracy comparable to the model trained using the original images and we used 16/64=0.25 the amount of data.

rf_reduced = RandomForestClassifier(oob_score=True)

rf_reduced.fit(X_reduced, y)

rf_reduced.oob_score_