Logistic Regression In Python

August 03, 2019

Logistic Regression In Python

Despite the word Regression in Logistic Regression, Logistic Regression is a supervised machine learning algorithm used in binary classification. I say binary because one of the limitations of Logistic Regression is the fact that it can only categorize data with two distinct classes. At a high level, Logistic Regression fits a line to a dataset and then returns the probability that a new sample belongs to one of the two classes according to its location with respect to the line.

Odds vs Probability

Before diving into the nitty gritty of Logistic Regression, it’s important that we understand the difference between probability and odds. Odds are calculated by taking the number of events where something happened and dividing by the number events where that same something didn’t happen. For example, if the odds of winning a game are 5 to 2, we calculate the ratio as 5/2=2.5. On the other hand, probability is calculated by taking the number of events where something happened and dividing by the total number events (including events when that same something did and didn’t happen). For example, the probability of winning a game with the same odds is 5/(5+2)=0.714.

One important distinction between odds and probabilities, which will come into play when we go to train the model, is the fact that probabilities range from 0 and 1 whereas the log of the odds can range from negative to positive infinity. We take the log of the odds because otherwise, when we calculate the odds of some event occurring (i.e. winning a game), if the denominator is larger than the numerator, the odds will range from 0 to 1. However, when the numerator is larger than the denominator, then the odds will range from 1 to infinity. For example, suppose that we compared the odds of winning a game for two different teams. Team A is composed of all-stars therefore their odds of winning a game are 5 to 1.

On the other hand, the odds of Team B winning a game are 1 to 5.

In taking the log of the odds, the distance from the origin (0) is the same for both teams.

We can go from probability to odds by dividing the probability that an event occurs by the probability that it doesn’t occur.

We write the general formula of the latter as follows:

As we’re about to see, we need to go back and forth between probabilities and odds when determining the optimal fit for our model.

Algorithm

Suppose we wanted to build a Logistic Regression model to predict whether a student would pass or fail given certain variables such as the number of hours studied. To be exact, we want a model that outputs the probability (a number between 0 and 1) that a student passes. A value of 1 implies that the student is guaranteed to pass whereas a value of 0 implies that the student will fail.

In mathematics, we call the following equation a Sigmoid function.

Where y is the equation for a line.

No matter what value we have for y, a Sigmoid function ranges from 0 to 1. For instance, when y tends towards negative infinity, the probability approaches zero.

When y tends towards positive infinity, the probability approaches one.

In Logistic Regression, we use the Sigmoid function to describe the probability that a sample belongs to one of the two classes. The shape of the Sigmoid function determines the probabilities predicted by our model. When we train our model, we are in fact attempting to select the Sigmoid function whose shape best fits our data. The actual way we go about choosing the optimal line involves lots of math. As we saw in Linear Regression, we can use Gradient Descent or some other technique to converge towards a solution. However, the derivative of the Sigmoid function is rather complicated.

What if we could optimize the equation of a line instead?

As we mentioned previously, we can go from probabilities (a function that ranges from 0 to 1) to log(odds) (a function that ranges from negative to positive infinity).

For example, suppose that the probability that a student passes is 0.8 or 80%. We can find the corresponding position on the y-axis of the new graph by dividing the probability that they pass by the probability that they fail and then taking the log of the result.

We would then repeat the process for each data point.

Once we’ve plotted every data point on the new y-axis, just like Linear Regression, we can use an optimizer to determine the y-intercept and slope of the best fitting line.

In this example, we’ll cover how to optimize the function using maximum likelihood.

First, we generate a candidate line, and then project the original data points on to it.

We look at the y value of each data point along the line and convert it from the log of the odds to a probability.

After repeating the process for each data point, we end up with the following function.

The likelihood that a student passes is the value on the y-axis at that point along the line. The likelihood of observing students with the current distribution given the shape of the Sigmoid is the product of observing each student pass individually.

Next, we include the likelihoods for the students who did not pass to the equation for the overall likelihood. Since the Sigmoid function represents the probability that a student passes, the likelihood that a student fails is 1 (the total probability) minus the y value at that point along the line.

You’ll typically see the log of the likelihood being used instead. The product of two numbers inside of a log is equivalent to the addition of their logs.

We end up with the following likelihood.

We then repeat the entire process for a different line and compare the likelihoods. We choose the line with the maximum likelihood (highest positive number).

Code

Let’s take a look at how we could go about implementing Logistic Regression in Python. To begin, import the following libraries.

from sklearn.datasets import make_classification  
from matplotlib import pyplot as plt  
from sklearn.linear_model import LogisticRegression  
import seaborn as sns  
sns.set()  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import confusion_matrix  
import pandas as pd

Next, we’ll take advantage of the make_classification function from the scikit-learn library to generate data. As we mentioned previously, Logistic Regression is only applicable to binary classification problems. Thus, the data points are composed of two classes.

x, y = make_classification(  
    n_samples=100,  
    n_features=1,  
    n_classes=2,  
    n_clusters_per_class=1,  
    flip_y=0.03,  
    n_informative=1,  
    n_redundant=0,  
    n_repeated=0  
)

We plot the relationship between the feature and classes.

plt.scatter(x, y, c=y, cmap='rainbow')

Prior to training our model, we’ll set aside a portion of our data in order to evaluate its performance.

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

We instantiate an instance of the LogisticRegression class and call the fit function with the features and the labels (since Logistic Regression is a supervised machine learning algorithm) as arguments.

lr = LogisticRegression()

lr.fit(x_train, y_train)

We can access the following properties to actually view the coefficient for the slope and y-intercept of the best fitting line.

print(lr.coef_)  
print(lr.intercept_)

Let’s see how the model performs against data that it hasn’t been trained on.

y_pred = lr.predict(x_test)

Given that this consists of a classification problem, we use a confusion matrix to measure the accuracy of our model.

confusion_matrix(y_test, y_pred)

From our confusion matrix we conclude that:

True positive: 7 (We predicted a positive result and it was positive)
True negative: 12 (We predicted a negative result and it was negative)
False positive: 4 (We predicted a positive result and it was negative)
False negative: 2 (We predicted a negative result and it was positive)

If for whatever reason we’d like to check the actual probability that a data point belongs to a given class, we can use the predict_proba function.

lr.predict_proba(x_test)

The first column corresponds to the probability that the sample belongs to the first class and the second column corresponds to the probability that the sample belongs to the second class.

Before attempting to plot the Sigmoid function, we create and sort a DataFrame containing our test data.

df = pd.DataFrame({'x': x_test[:,0], 'y': y_test})  
df = df.sort_values(by='x')

from scipy.special import expit

sigmoid_function = expit(df['x'] * lr.coef_[0][0] + lr.intercept_[0]).ravel()

plt.plot(df['x'], sigmoid_function)

plt.scatter(df['x'], df['y'], c=df['y'], cmap='rainbow', edgecolors='b')