K-Fold Cross Validation Example Using Sklean

May 03, 2019

K-Fold Cross Validation Example Using Sklean

At the end of the day, machine learning models are used to make predictions on data for which we don’t already have the answer. For example, this could take the form of a recommender system that tries to predict whether the user will like the song or product.

When developing a model, we have to be very cautious not to overfit to our training data. In other words, we have to ensure that the model is capturing the underlying pattern as opposed to simply memorizing the data. Ergo, before using a model in production, it’s imperative that we check how it handles unforeseen data. This is typically done by splitting the data into two subsets, one for training and the other to test the accuracy of the model.

Certain machine learning algorithms rely on hyperparameters. In essence, a hyperparameter is a variable set by the user that dictates how the algorithm behaves. Some examples of hyperparameters are step size in gradient descent and alpha in ridge regression. There is no one size fits all when it comes to hyperparameters. A data scientist must try to determine the optimal hyperparameter values through trial and error. We call this process hyperparameter tuning.

Unfortunately, if we constantly use the test set to measure the performance of our model for different hyperparameter values, our model will develop an affinity for the data inside of the test set. In other words, knowledge about the test set can leak into the model and evaluation metrics no longer reflect generalized performance.

To solve this problem, we can break up the data further (i.e. validation, training and test sets). The training proceeds on the training set, after which evaluation is done on the validation set, and when we are satisfied with the results, the final evaluation can be performed on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for training the model. In addition, the results can depend on a particular random choice of samples. For instance, say we built a model that tried to classify hand written digits, we could end up with a scenario in which our training set contained very little samples for the number 7.

A solution to these issues is a procedure called cross-validation. In cross validation, a test set is still put off to the side for final evaluation, but the validation set is no longer needed. There are multiple kinds of cross validation, the most commonly of which is called k-fold cross validation. In k-fold cross validation, the training set is split into k smaller sets (or folds). The model is then trained using k-1 of the folds and the last one is used as the validation set to compute a performance measure such as accuracy.

Let’s take a look at an example. For the proceeding example, we’ll be using the Boston house prices dataset.

To start, import all the necessary libraries.

from sklearn.datasets import load_boston  
from sklearn.linear_model import RidgeCV  
from sklearn.model_selection import train_test_split  
import numpy as np  
import pandas as pd  
from matplotlib import pyplot as plt

Next, we’ll use sklearn to import the features and labels for our data.

boston = load_boston()

boston_features = pd.DataFrame(boston.data, columns=boston.feature_names)

X = boston_features['RM'].values.reshape(-1,1)

y = boston.target

We’ll use matplotlib to plot the relationship between the house prices and the average number of rooms per dwelling.

plt.scatter(X, y);  
plt.title('boston house prices')  
plt.xlabel('average number of rooms per dwelling')  
plt.ylabel('house prices')  
plt.show()

As mentioned previously, we’ll want to put a portion of the data aside for the final evaluation.

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

We’ll use cross validation to determine the optimal alpha value. By default, the ridge regression cross validation class uses the Leave One Out strategy (k-fold). We can compare the performance of our model with different alpha values by taking a look at the mean square error.

regressor = RidgeCV(alphas=[1, 1e3, 1e6], store_cv_values=True)

regressor.fit(train_X, train_y)

cv_mse = np.mean(regressor.cv_values_, axis=0)

print(alphas)  
print(cv_mse)

The RidgeCV class will automatically select the best alpha value. We can view it by accessing the following property.

# Best alpha  
print(regressor.alpha_)

We can use the model to predict that house prices for the test set.

predict_y = regressor.predict(test_X)

Finally, we plot the data in the test set and the line determined during the training phase.

plt.scatter(test_X, test_y);  
plt.plot(test_X, predict_y, color='red')  
plt.title('boston house prices')  
plt.xlabel('average number of rooms per dwelling')  
plt.ylabel('house prices')  
plt.show()