Principal Component Analysis Example In Python

May 04, 2019

Principal Component Analysis Example In Python

Principal Component Analysis or PCA is used to reduce the number of features without the loss of too much information. The problem with having too many dimensions is that it makes it difficult to visualize the data and makes training models more computationally expensive.

To give us an intuitive understanding of PCA, suppose that we wanted to build a model to predict house prices. We’d start off by collecting data on the houses in the area. Say that we amassed thousands of samples, where each sample contained information on a given house’s properties. Among these properties include the number of bedrooms, number of bathrooms and square footage. Common sense would lead us to believe that there is some kind of relationship between the number of bedrooms, the number of bathrooms and the square footage of the house. In other words, we suspect that the higher the square footage of a house, the more bedrooms and bathrooms it contains. If the variables are highly correlated, is it really necessary that we have three separate variables for the same underlying feature (i.e. size)? Principal component analysis is used to determine which of the variables (i.e. number of bathrooms or square footage) accounts for the most variance in house price and combines them.

Say that we plotted three samples of an arbitrary variable.

The mean is equal to the sum of all the data points divided by the total number of samples.

The variance of a single data point is then its distance from the mean.

The variance of the entire dataset is the sum of all these distances squared and divided by the total number of samples. We end up squaring the values because in a coordinate system, distances to the left of the mean would be negative and would cancel out with those to the right of the mean.

In two dimensions, to calculate the variance of one of the variables, we project the data points on to its axis and then follow the same procedure as before. The mean and variance of a feature (i.e. salary) are the same regardless of the other feature it’s plotted against.

Unfortunately, the x and y variances do not contain enough information in themselves.

Despite the clear difference in the two plots, they result in the same x and y variances.

Therefore, we make use of another property called covariance. When calculating covariance, rather than taking the square of the distance from the mean, we multiply the x and y coordinates.

Covariance and correlation are distinct concepts. However they both describe the relationship between two variables. To be more specific, the correlation between two variables is actually the covariance divided by the square root of the variance of the first variable multiplied by the variance of the second. Ergo, when the correlation or covariance is negative then an increase in x results in a decrease in y and when the correlation or covariance is positive then an increase in x results in some increase in y.

The covariance matrix of two variables consists of the variance of the first variable in the top left corner, the variance of the second in the bottom right corner and the covariance in the remaining two positions on the diagonal.

As part of the PCA algorithm, the covariance matrix is used to calculate the eigenvalues and eigenvectors.

Say that we plotted the relationship between the two miscellaneous variables.

First, we center the data.

Next, we draw two vectors with magnitudes equivalent to the eigenvalues in the direction of the eigenvectors.

We then take the one with the highest variance as it will result in the least amount of information loss when we drop the other dimension.

Afterwards, the data points are projected on to the line.

The latter is used as a one dimensional plot of a new feature.

Code

Let’s take a look at how we could go about implementing principal component analysis in python. To begin, import all the necessary libraries.

import pandas as pd  
import numpy as np  
from matplotlib import pyplot as plt  
from sklearn.decomposition import PCA  
from sklearn.preprocessing import StandardScaler  
from sklearn.datasets import load_iris

In this example, we’ll be using the iris dataset which can easily be imported using the sklearn API.

iris = load_iris()

X = pd.DataFrame(iris.data, columns=iris.feature_names)  
y = pd.Categorical.from_codes(iris.target, iris.target_names)

As you can see there are 4 features. Intuitively, we might predict that there is a strong correlation between sepal length and sepal width, and a strong correlation between petal length and petal width. Thus, we should be able to reduce the number of dimensions from 4 down to 2.

X.head()

As we saw earlier, the variance is calculated by taking the sum of the squared distances from the mean. In consequence, if a feature is on a scale much larger than another, it will have a much higher variance even through the relative dispersion might be smaller. Therefore, it’s imperative that we scale the data. In scaling the data, the mean is set to 0 and the standard deviation is set to 1.

scaler = StandardScaler()  
X = scaler.fit_transform(X)

Next, we’ll use PCA to reduce the number of dimensions from 4 to 2.

pca = PCA(n_components=2)

principal_components = pca.fit_transform(X)

new_X = pd.DataFrame(data = principal_components, columns = ['PC1', 'PC2'])

Let’s take a look at the two new features.

new_X.head()

Finally, we can use a scree plot to visualize the percentage of variance explained by each principal component.

per_var = np.round(pca.explained_variance_ratio_ * 100, decimals=1)

labels = ['PC' + str(x) for x in range(1, len(per_var) + 1)]

plt.bar(x=range(1, len(per_var)+1), height=per_var, tick_label=labels)  
plt.ylabel('percentange of explained variance')  
plt.xlabel('principal component')  
plt.title('scree plot')  
plt.show()

Final Thoughts

To summarize, we take a dataset with multiple dimensions and plot it (although we can’t visualize anything greater than 3 dimensions, the math will still work out). Then we calculate the covariance matrix and eigenvectors/eigenvalues. The latter will tell us which features are highly correlated and can be squashed without too great a loss of information. The resulting features can be used to train our model.