Machine Learning Algorithms Part 10: Logistic Regression Example In Python
Logistic Regression is a supervised machine learning algorithm used in the classification of data. For example, suppose that given their income, we wanted to predict whether a customer would buy a product or not. In other words, we want to classify the customers into two categories, those who we think will purchase the product and those who will not.
By using the regression line that best fits our data, we can express the likelihood of a customer making a purchase. By assigning a threshold at 0.5 (or 50%), we can obtain reasonably accurate results.
The probability of an event occurring can never be below 0 or exceed 1 (or 100%) therefore we transform our linear function using a Sigmoid function in such a way as to create asymptotes at 0 and 1.
Some pros and cons of Logistic Regression
- Simple and efficient
- Low variance
- Models can be updated
- Doesn’t handle large number of categorical variables well
- Requires transformation of non-linear features
Let’s take a look at how we could go about classifying data using Logistic Regression in python.
import pandas as pd import numpy as np from matplotlib import pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder from sklearn.metrics import confusion_matrix from sklearn.preprocessing import StandardScaler
Machine learning models can’t handle categorical data, therefore it’s necessary to encode it in terms of numbers (i.e. male=0, female=1).
We can help gradient descent converge by ensuring the mean of each of our features is close to 0. This can be achieved by standardizing by applying the proceeding formula to each of our samples.``` dataset = pd.read_csv('./data.csv') ``` ``` encoder = LabelEncoder() dataset['Gender'] = encoder.fit_transform(dataset['Gender']) ``` ``` X = dataset[['Gender', 'Age', 'EstimatedSalary']] y = dataset[['Purchased']] ``` ``` scaler = StandardScaler() X = scaler.fit_transform(X) ``` ``` train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0) ```
By using the
LogisticRegression class from the
sklearn module, we can train our model and have it classify the customers in the test set.
classifier = LogisticRegression(random_state=0) classifier.fit(train_X, train_y) pred_y = classifier.predict(test_X)
We can compare the predictions made by our model to the actual values with the use of a confusion matrix. The numbers on the diagonal correspond to correct predictions.
The accuracy of our model is 63/80 = 0.7875 (or 78.75%).