Machine Learning Algorithms Part 10: Logistic Regression Example In Python

December 30, 2018

Image result for classification machine learning

Machine Learning Algorithms Part 10: Logistic Regression Example In Python

Logistic Regression is a supervised machine learning algorithm used in the classification of data. For example, suppose that given their income, we wanted to predict whether a customer would buy a product or not. In other words, we want to classify the customers into two categories, those who we think will purchase the product and those who will not.

By using the regression line that best fits our data, we can express the likelihood of a customer making a purchase. By assigning a threshold at 0.5 (or 50%), we can obtain reasonably accurate results.

The probability of an event occurring can never be below 0 or exceed 1 (or 100%) therefore we transform our linear function using a Sigmoid function in such a way as to create asymptotes at 0 and 1.

Some pros and cons of Logistic Regression

Pros:

Simple and efficient
Low variance
Models can be updated

Cons:

Doesn’t handle large number of categorical variables well
Requires transformation of non-linear features

Code

Let’s take a look at how we could go about classifying data using Logistic Regression in python.

import pandas as pd  
import numpy as np  
from matplotlib import pyplot as plt  
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LogisticRegression  
from sklearn.preprocessing import LabelEncoder  
from sklearn.metrics import confusion_matrix  
from sklearn.preprocessing import StandardScaler

Machine learning models can’t handle categorical data, therefore it’s necessary to encode it in terms of numbers (i.e. male=0, female=1).

We can help gradient descent converge by ensuring the mean of each of our features is close to 0. This can be achieved by standardizing by applying the proceeding formula to each of our samples.

Image result for formula machine learning standardization

``` dataset = pd.read_csv('./data.csv') ``` ``` encoder = LabelEncoder() dataset['Gender'] = encoder.fit_transform(dataset['Gender']) ``` ``` X = dataset[['Gender', 'Age', 'EstimatedSalary']] y = dataset[['Purchased']] ``` ``` scaler = StandardScaler() X = scaler.fit_transform(X) ``` ``` train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0) ```

By using the LogisticRegression class from the sklearn module, we can train our model and have it classify the customers in the test set.

classifier = LogisticRegression(random_state=0)  
classifier.fit(train_X, train_y)  
pred_y = classifier.predict(test_X)

We can compare the predictions made by our model to the actual values with the use of a confusion matrix. The numbers on the diagonal correspond to correct predictions.

confusion_matrix(test_y, pred_y)

The accuracy of our model is 63/80 = 0.7875 (or 78.75%).

Cory Maklin
_Sign in now to see your channels and recommendations!_www.youtube.com