Introduction To Machine Learning: Reducing Loss With Gradient Descent
In the following article, we’ll delve into how to train our machine learning models or in other words how to minimize loss.
In the context of machine learning, when people are speaking about a model, they are referring to the function used to predict a dependent variable (y) given a set of inputs (x1, x2, x3…). In the case of simple linear regression, the model takes on the form of y = wx + b. For those of you who are familiar with calculus, in machine learning we make use of w (weight) for the slope instead of m.
Suppose we plotted the relationship between the number of hours studied and the grade on the final exam.
Say we arbitrarily picked y = 3x + 2 for our model**.** If we were to draw the corresponding line, it might look as follows.
The goal of linear regression is to find the best fitting line. The best fitting line is the line with the minimum possible loss. Loss is the distance between our line and the data points. In the proceeding image, the loss is drawn as red arrows.
If I made a graph, describing w vs loss, it would look as follows.
As w approaches infinity, the loss tends towards infinity.
As w approaches negative infinity, the loss tends towards infinity.
The slope for the best fitting line will be equal to the value of w when the loss is at a minimum.
Rather than picking values for the weight at random, we can make use of the gradient descent algorithm to converge towards the the global minimum.
The Gradient is a well known concept in calculus. The Gradient always points in the direction of the greatest increase of f(x), that is, the direction of steepest ascent.
In our case, the gradient always points in the direction of steepest increase in loss. Ergo, the gradient descent algorithm takes a step in the opposite direction to reduce loss as quickly as possible.
To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point as shown in the following figure.
The magnitude of the gradient is multiplied by a something called the learning rate. The learning rate determines how big of a step the algorithm takes in the next iteration.
For example, if the gradient magnitude is 10 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.1 away from the previous point.
For every regression problem, there’s a Goldilocks learning rate that’s neither too big nor small. Part of a data scientist’s job is to play with the Hyperparameters otherwise the learning process will take too long.
After having trained our model (y = wx + b) it becomes the best fitting line for our data set.
We can then use our model to make predictions as to the grade of a student given the number of hours they studied.
In the real world, you will might end up working with massive sets of data. In this case, the gradient can take a very long time to compute. Not to mention, a large data set with randomly sampled examples probably contains redundant data. In other words, enormous batches don’t necessarily carry much more predictive value than large batches. Therefore, rather than using the entire data set to calculate the gradient, we can use a subset called a batch.
Stochastic Gradient Descent is gradient descent with a batch size of 1, whereas mini-batch stochastic gradient descent is typically between 10 and 1,000 observations.