Metrics For Evaluating Machine Learning Classification Models

June 07, 2019

Photo by Chris Liverani on Unsplash

Metrics For Evaluating Machine Learning Classification Models

In the realm of machine learning there are three main kinds of problems: regression, classification and clustering. Depending on the kind of problem you’re working with, you’ll want to use a specific set of metrics to gage the performance of your model. This can best be illustrated with the use of an example. Suppose a company claims to have developed a facial detection algorithm that can recognize terrorists with 99.9% accuracy. Ethical implications aside, this should immediately set off a red flag. Terrorists represent a minute percentage of the population (I couldn’t find the actual statistic but let’s assume it’s 0.001%). Thus, by assuming that no one is a terrorist (i.e. writing a program that returns false all the time), we can achieve an accuracy upwards of 99.9%. Accuracy is, therefore, not a good metric for for assessing the model’s performance since it incorrectly classified every single terrorist and still obtained a very high score.

Confusion Matrix

A confusion matrix, on the other hand, will make the distinction between the number of samples that were correctly classified as non terrorists and those that were correctly classified as terrorists. A confusion matrix is split into 4 quadrants.

Image result for confusion matrix

True Positive:

Interpretation: You predicted positive and it’s true.

You predicted that the person is a terrorist and they actually are.

True Negative:

Interpretation: You predicted negative and it’s true.

You predicted that the person is not a terrorist and they’re actually are not.

False Positive: (Type 1 Error)

Interpretation: You predicted positive and it’s false.

You predicted that the person is a terrorist but they actually are not.

False Negative: (Type 2 Error)

Interpretation: You predicted negative and it’s false.

You predicted that the person is not a terrorist but they actually are.

Precision / Recall

Sometimes, it’s easier to evaluate a model’s performance using numbers rather than relying on a library to visualize a confusion matrix.

Image result for f1 score formula

In the real world, you’ll encounter classification problems where the dividing line forces you to select between having a high precision or a high recall. In certain circumstances, it’s better to have a high precision. For example, a diagnosis might be better off with a few false positives rather than let anyone with the actual disease slip through the cracks and avoid getting treated. Other times, it’s better to have a higher recall as is the case in spam filters. It’s more acceptable to have a few spam emails in the user’s inbox than it is to classify important emails as junk. We can represent the tradeoff between precision and recall graphically in order to form a better judgement.

Rather than measure recall and then precision every time, it would be easier if we could use a single score. At first, we might try taking the average of the two results. For example, say a spam detector had a precision of 80% and a recall of 37% then the average would be 58.5%. Now, say we built spam detector that didn’t treat any emails as spam (analogous to the terrorist example). In the event there are significantly more emails that aren’t spam than spam, our model will be interpreted as having an ok performance. To elaborate, if 300,000 emails are ham (not spam) and 500 are spam then a model that classified all emails as ham would obtain a precision of 100% since it correctly classified all ham and a recall of 0% since it incorrectly classified all spam. If we took the average, we’d still get 50% which is somewhat misleading since the entire purpose of the spam detector is to detect spam.

It’s for the preceding reason that we make use of the harmonic mean instead of the arithmetic mean to compute the average. The harmonic mean is always closer to the smaller number than the higher number.

Going back to our example of the spam detector. If the precision is equal to 100% and the recall is equal to 0% then the harmonic mean is equal to 0%. We call this value the F1 Score.

Image result for f1 score formula


Similarly to the precision/recall curve, the Receiver Operator Characteristic (ROC) graph provides an elegant way of presenting multiple confusion matrices produced at different thresholds. A ROC plots the relationship between the true positive rate and the false positive rate.

  • True positive rate = Recall = Sensitivity = true positive / (true positive + false negative)
  • False positive rate = 1 – specificity = false positive / (false positive + true negative)

Image result for auc roc

It’s important to note that the false positive rate is 1 minus the specificity which means that the closer the false positive rate is to 0 the higher the specificity (recall). Therefore, to obtain the optimal values for specificity and sensitivity, we’ll want to select a point in the top left corner.

On the other hand, the Area Under the Curve (AUC) makes it easy to compare one ROC curve to another. For example, the AUC for the red ROC curve is greater than the AUC for the blue ROC curve. Therefore, the model associated with the red curve achieves a higher sensitivity for the same amount of specificity.


In the proceeding example, we’ll take a look at all the preceding metrics in action.

import pandas as pd  
from matplotlib import pyplot as plt'dark_background')  
from sklearn.datasets import load_breast_cancer  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import LabelEncoder  
from sklearn.metrics import roc_curve  
from sklearn.metrics import auc  
from sklearn.metrics import precision_recall_curve  
from sklearn.metrics import precision_score  
from sklearn.metrics import recall_score  
from sklearn.metrics import f1_score  
from sklearn.metrics import average_precision_score  
from inspect import signature

For simplicity, we’ll be using one of the datasets provided by sklearn.

breast_cancer = load_breast_cancer()
X = pd.DataFrame(, columns=breast_cancer.feature_names)
y = pd.Categorical.from_codes(, breast_cancer.target_names)
encoder = LabelEncoder()
y = pd.Series(encoder.fit_transform(y))

The metrics will be used to measure the difference between the predictions made by our model and the samples contained in the testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

We’ll be using the random forest classifier but any classification algorithm will do.

rf = RandomForestClassifier(), y_train)

We call the _predict_proba method r_ather than predict in order to obtain a list of probabilities which represent the likelihood that a sample falls under a given category. This is analogous to the softmax activation function commonly used in deep learning.

probs = rf.predict_proba(X_test)

The roc_curve method expects a single feature. Therefore, we take the predicted probabilities that a tumor is malignant.

malignant_probs = probs[:,1]
fpr, tpr, thresholds = roc_curve(y_test, malignant_probs)
roc_auc = auc(fpr, tpr)

It’s difficult to see in this example, but we’d typically select a point in the top left corner which would yield the best sensitivity and specificity. An AUC of 0.98 implies that there is very little tradeoff between specificity and sensitivity.

plt.title('Receiver Operating Characteristic')  
plt.plot(fpr, tpr, 'y', label = 'AUC = %0.2f' % roc_auc)  
plt.legend(loc = 'lower right')  
plt.plot([0, 1], [0, 1],'r--')  
plt.xlim([0, 1])  
plt.ylim([0, 1])  
plt.ylabel('True Positive Rate')  
plt.xlabel('False Positive Rate')

Next, let’s take a look at a few other metrics to evaluate the performance of our model. First, we use our model to categorize the data based off the probabilities from the previous step.

y_pred = rf.predict(X_test)

As a quick reminder, precision measures true positives over true positives plus false positives.

precision_score(y_test, y_pred)

Recall measures true positives over true positives plus false negatives.

recall_score(y_test, y_pred)

The F1 score combines precision and recall using the harmonic mean.

f1_score(y_test, y_pred)

Selecting the number of thresholds which correspond to the top right corner will result in the best combination of precision and recall.

precision, recall, threshold = precision_recall_curve(y_test, y_pred)  
average_precision = average_precision_score(y_test, y_pred)
step_kwargs = ({'step': 'post'} if 'step' in signature(plt.fill_between).parameters else {})  
plt.step(recall, precision, color='r', alpha=0.2, where='post')  
plt.fill_between(recall, precision, alpha=0.2, color='r', **step_kwargs)  
plt.ylim([0.0, 1.0])  
plt.xlim([0.0, 1.0])  
plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))

Final Thoughts

The choice of metrics with which we evaluate the performance of our model, varies depending on the nature of the problem. For classification models, we can use precision, recall, f1score or the ROC curve to measure performance.

Profile picture

Written by Cory Maklin Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter