Linear Regression In Python
So you’ve decided to learn about machine learning. Whether you’re doing it for career reasons or strictly out of curiosity, you’ve come to the right place. In the proceeding article, we’ll take a look at, the “Hello World” of machine learning, linear regression.
In the context of machine learning, when people speak of a model, they are referring to the function used to predict a dependent variable (y) given a set of inputs (x1, x2, x3…). In the case of linear regression, the model takes on the form of y = wx + b.
Suppose we plotted the relationship between the number of hours studied and the grade on the final exam.
Next, say we arbitrarily picked y = 3x + 2 for our model**.** If we were to draw the corresponding line, it might look as follows.
The goal of linear regression is to find the best fitting line where the best fitting line is defined by the line with the minimum possible loss. One of the most commonly used loss functions is the Mean Square Error (MSE). The equation for it is written as follows.
We calculate the distance from the line to a given data point by subtracting one from the other. We take the square of the error because we don’t want the predicted values below the actual values to cancel out with those above the actual values. In other words, we want to remove the negative. You’ll also see people using the Mean Absolute Error (MAE), there isn’t any advantage to using one over the other as long as you stick with one when comparing models. We sum the distances to get the total error across the entire dataset. Then we divide it by the total number of samples because when you go to compare models, a dataset with more samples would have a tendency to have a higher error by virtue of the fact that it has more samples.
In the proceeding image, the distance from the line to each point is drawn as a red arrow.
If we drew a plot, describing the relationship between the slope w and the total loss, it would look as follows.
It takes on the shape of a parabola because as the slope w approaches infinity, the loss tends towards infinity.
In addition, as the slope w approaches negative infinity, the loss tends towards infinity.
The slope for the best fitting line will be equal to the value of w when the loss is at a minimum.
Rather than picking value for the slope at pseudorandom (i.e. looking at the graph and taking an educated guess), we can make use of the gradient descent algorithm to converge towards the global minimum.
The Gradient is a well known concept in calculus. The Gradient is a vector of partial derivatives that always points in the direction of steepest ascent (the direction of the greatest increase in f(x)).
However, in our case we’re looking to go in the direction that minimizes our loss function, thus the “Descent” in “Gradient Descent”.
To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point as shown in the following figure.
The magnitude of the gradient is multiplied by a something called the learning rate. The learning rate determines how big of a step the algorithm takes.
For example, if the magnitude gradient (partial derivative) is 10 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.1 away from the previous point.
If we’re with a substantial amount of data, and select a learning rate that is too small, it can take a long time to train our model. On the surface, that may not sound so bad, but a model that takes several days to train is very difficult to tune and conduct experiments with.
In the event the learning rate is too large, we can miss the global minimum.
For every regression problem, there’s a Goldilocks learning rate that’s neither too big nor too small. Part of a data scientist’s job is to play with the hyperparameters (i.e. learning rate and number of iterations) otherwise the learning process can take too long and/or they can end up with poor results.
Let’s have a look at the math underlying Gradient Descent. I promise it’s not that bad. First, we replace the y value of the regression line with the equation for a line.
Then we calculate the partial derivative with respect to the slope and the y intercept.
We repeat the process at every iteration in order to determine the new value for each variable. As we mentioned previously, we multiply the Gradient by some constant called the learning rate. It’s worth mentioning, that the equations were written individually but more often than not you’ll see it written as a vector.
After having trained our linear regression model (y = wx + b), we obtain the best fitting line for our data set.
Then, we can use our model to make predictions as to the grade of a student given the number of hours they studied.
Let’s see how we could go about implementing linear regression from scratch using Python. To start, import the following libraries.
from sklearn.datasets import make_regression from matplotlib import pyplot as plt from sklearn.linear_model import LinearRegression import seaborn as sns sns.set()
We can use the
scikit-learn library to generate sample data which is well suited for regression.
x, y = make_regression(n_samples=50, n_features=1, n_informative=1, n_targets=1, noise=5)
Often times you’ll want some kind of benchmark for measure the performance of your model, typically for regression problems, we use the mean.
starting_slope = 0 starting_intercept = float(sum(y)) / len(y)
matplotlib to plot our data as well as the mean.
plt.scatter(x, y) plt.plot(x, starting_slope * x + starting_intercept, c='red')
Next, we write a function to calculate the mean square error.
def mse(y_actual, y_pred): error = 0 for y, y_prime in zip(y_actual, y_pred): error += (y - y_prime) ** 2 return error
The higher the mean square error, the worse the model. Think of the arrows draw from the line to each individual data point.
mse(y, starting_slope * x + starting_intercept)
Next, let’s see how we could go about implementing gradient descent from scratch.
We create a function to calculate the partial derivatives with respect to the slope and intercept.
def calculate_partial_derivatives(x, y, intercept, slope): partial_derivative_slope = 0 partial_derivative_intercept = 0 n = len(x)
for i in range(n): xi = x[i] yi = y[i]
partial_derivative_intercept += - (2/n) * (yi - ((slope * xi) + intercept)) partial_derivative_slope += - (2/n) * xi * (yi - ((slope * xi) + intercept)) return partial_derivative_intercept, partial_derivative_slope
Then, we define a function which will iteratively improve our model by taking small steps towards the solution.
def train(x, y, learning_rate, iterations, intercept, slope):
for i in range(iterations): partial_derivative_intercept, partial_derivative_slope = calculate_partial_derivatives(x, y, intercept, slope) intercept = intercept - (learning_rate * partial_derivative_intercept) slope = slope - (learning_rate * partial_derivative_slope) return intercept, slope
We pick the learning rate and the number of iterations arbitrarily. For extra practice, I suggest that you try substituting some other values and see how the model behaves.
learning_rate = 0.01 iterations = 300
Next, we train the model and obtain the values for the intercept and slope of the best fitting line.
intercept, slope = train(x, y, learning_rate, iterations, starting_intercept, starting_slope)
We use a list comprehension to obtain the corresponding y value for each x value along our line.
linear_regression_line = [slope * xi + intercept for xi in x]
We plot our data to see how we did.
plt.scatter(x, y) plt.plot(x, linear_regression_line, c='red')
Looks much better than the mean. However, sometimes it helps to use something a little more concrete when comparing models.
As we can see, the mean square error is much lower than that produced by the mean.
Rather than implement gradient descent algorithm from scratch every time, we can use the predefined classes made available to us by the
scikit-learn library. First, we create an instance of the class
lr = LinearRegression()
Then, we train the model by calling the
To obtain the values along the line, we call the
y_pred = lr.predict(x)
As we can see, we get more or less the same result.
plt.scatter(x, y) plt.plot(x, y_pred, c='red')
The mean square error is slightly better than our implementation. I encourage you to checkout the source code to find our why.