Pretraining Data Bias

May 15, 2022

Photo by Dimitri Karastelev on Unsplash

Pretraining Data Bias

If you’re like me, then, whenever you hear talk of artificial intelligence ethics, you can’t help but think of a professor in a philosophy department contemplating whether robots should be given the same rights as sentient beings as opposed to a data scientist writing code. My purpose in writing this article is to show you that AI ethics (at least some subset of it) is still relevant to us in our day jobs. Hopefully, by the end, you will have a greater understanding of how to pick the right bias metrics for your use case and explain these to your stakeholders.

Increasingly, machine learning models are used to make decisions, in place of humans. These include things like whether an applicant should be granted a loan, the insurance quote given to someone based on how risky they are perceived to be and even how long an offender should be incarcerated given likelihood they are to commit another crime.

The features used to train these models include sensitive information such as race, gender and age. If we’re not careful, our model could become pseudo-racist where it unfairly attributes a label to someone based on demographic feature(s) because it learned from a biased dataset.

You may be asking yourself, if that’s the case, then, why not remove the columns all together? The reason is that these features contain useful information that aids in making predictions. For instance, we can safely say that young males are more likely to get in car accidents than any other demographic group. Therefore, we’d expect the model to attribute a higher risk to them and consequently charge them a higher rate for car insurance.

So then how does one go about avoiding bias? Fortunately for us, bias in a dataset can be measured meaning that we can take steps to mitigate it such as gathering more data and confirm whether the actions we’ve taken have had the desired effect.

I strongly recommend you give the following documentation a read as it covers the different metrics used by the AWS Sagemaker Clarify service when gauging bias as well as the notation we’ll be using for the throughout the remainder of this article.

Measure Pretraining Bias
_Measuring bias in ML models is a first step to mitigating bias. Each measure of bias corresponds to a different notion…

Before continuing, it’s really important that you understand what is a facet. A fact is a column or feature that contains the attributes with respect to which bias is measured. For example, Sex=Male could be the advantaged facet and Sex=Female could be the disadvantaged facet.

It’s worth noting that class imbalance measures are usually applied to binary classes. To generalize to the case where there are classes of more than two distinct values, you can assign each class, one at a time, to be the disadvantaged class and work out the respective pre-training metrics for each class (done by default with the AWS Clarify package).

Dataset bias in Python

To begin, we will import the required libraries

import pandas as pd  
from urllib.request import urlretrieve

We’ll be using the Adult Data Set from the UCI Machine Learning Repository. We then download the files as follows:

for file_name in ["", "adult.names", "adult.test"]:  
    urlretrieve(f"[{file_name](}", file_name)

We cannot easily parse the file containing the column names. Therefore, we explictly create a list.

adult_columns = [  
    "Marital Status",  
    "Ethnic group",  
    "Capital Gain",  
    "Capital Loss",  
    "Hours per week",  

We read the data into a Pandas DataFrame.

df = pd.read_csv(  
    "", names=adult_columns, sep=r"\s*,\s*", engine="python", na_values="?"  

Our label is the Target column, our facet is the Sex column and we’re concerned with people’s income.

predicted_column = "Target"  
label_values_or_threshold = ">50K"  
facet_name = "Sex"

Class Imbalance (CI)

As the name implies, Class Imbalance or CI for short attempts to answer the question “Could there be any demographic-based biases due to not having enough data for given subgroup?”. In other words, do I have more observations with Sex=Male than Sex=Female.

The example given in the AWS documentation to highlight the effect of Class Imbalance goes something like:

We have a dataset consisting of 1000 samples where men comprise 90% of the samples and women make up only 10%. This may occur because historically, women have not started as many small businesses that required loans as men have. This imbalance can lead a model to learn that women should not be granted loans.

The class imbalance doesn’t necessarily have to do with historical trends. It could simply be that we didn’t do a good job of acquiring data out in the field. For example, let’s assume that we are trying to train a model to determine whether a person will default on their loan. In our dataset, we have a more examples of women defaulting on their loan than men. Therefore, the model ends up learning that men are more likely to pay back their loans than girls, when in reality gender has no relation on whether a loan will be payed back or not, it’s just that it was shown less examples of men defaulting on their loans.

We calculate Class Imbalance using the following formula:


where n_a is the number of members of the advatanged facet a (e.g. male) and n_d the number for disadvantaged facet d (e.g. female).

A positive value indicates the facet a has more training samples in the dataset than the facet d whereas a negative value implies there’s less. Ideally, the value would be somewhere near zero implying the facets are balanced.

CI in Python

We begin by computing the number of rows for each distinct facet.

num_facet = df[facet_name].value_counts()
Male      20380  
Female     9782  
Name: Sex, dtype: int64

As we can see, there are more than twice as many males as females in the dataset. We use Male for the advantaged class.

num_facet_adv = num_facet["Male"]  
num_facet_disadv = num_facet["Female"]

We define a function to calculate the class imbalance based on the formula above.

def class_imbalance(n_a, n_d):  
    return (n_a - n_d) / (n_a + n_d)

We pass the number of males and the number of females to the function.

class_imbalance(num_facet_adv, num_facet_disadv)
Out[]: 0.3513692725946555

As we can see, the value is greater than zero which implies that there are disproportionately more males than females in the dataset.

Difference in Positive Proportions in Labels (DPPL)

Difference in Positive Proportions of Labels or DPPL for short attempts to answer the question “Could there be demographic-based biases due to a disproportionate number of positive outcomes for a given subgroup?”. In other words, DPPL looks at the ratio of positive outcomes and not just the number of rows present in the dataset.

Going back to our example, our datset may have just as many rows where the sex is a girl as rows where the sex is a guy. However, if the rows that are girls all have a negative label (e.g. defaulted, denied) then the model will learn this bias during training. It is therefore important to not only ensure you have roughly the same number of observations for each facet, but that you have roughly the same number of observations where the label is true and false.

We calculate DPPL using the following formula:


where q_a is the ratio of facet a that have an observed label value of 1 and q_d is the proportion of facet d that have an observed label value of 1. A DPPL value of zero indicates there is an equal proportion of positive outcomes for both facets. A positive value indicates the advantaged facet value has a higher proportion of positive outcomes than the disadvanged facet.

DPPL in Python

We start off by obtaining the number of rows with a positive label.

num_facet_and_pos_label = df[facet_name].where(df[predicted_column] == label_values_or_threshold).value_counts()

Then, we split the array into advantaged and disadvantaged facets.

num_facet_and_pos_label_adv = num_facet_and_pos_label["Male"]  
num_facet_and_pos_label_disadv = num_facet_and_pos_label["Female"]

We define a function that computes the DDPL

def difference_in_positive_proportions_of_labels(q_a, q_d):  
    return q_a - q_d

We calculate q by dividing the number of rows with positive outcomes by the total number of rows for both males and females.

q_a = num_facet_and_pos_label_adv / num_facet_adv  
q_d = num_facet_and_pos_label_disadv / num_facet_disadv

Finally, we call the function we defined earlier.

difference_in_positive_proportions_of_labels(q_a, q_d)  
Out[]: 0.20015891077100018

As we can see, the value is above zero meaning that in our dataset, there are more males with an income above 50k than females.

Kullback-Leibler Divergence (KL)

The Kullback-Leibler Divergence attempts to answer the question “How different are the distributions for positive outcomes for different demographic groups?”.

For example, let’s assume we’re dealing with college admissions where an applicant may be assigned by a model to three categories: Rejected, wait listed or accepted. We compute the Kullback-Leibler Divergence to gauge how different the distribution is for the advantaged versus the disadvantaged class across all three categories.

We calculate KL using the following formula:


The first term, P_a, refers to the distribution of the advantaged group, while P_d refers to the distribution for the disadvantaged group. A value near zero indicates the labels are similarly distributed whereas a positive value means the label distributions diverge, the more positive the larger the divergence.

KL in Python

We borrow some code from the AWS Clarify Github repository to help in calculating the KL divergence.

import numpy as np  
from functional import seq  
from typing import List
def pdf(xs) -> dict:  
    Probability distribution function  
    :param xs: input sequence  
    :return: sequence of tuples as (value, frequency)  
    counts = seq(xs).map(lambda x: (x, 1)).reduce_by_key(lambda x, y: x + y)  
    total = x: x[1]).sum()  
    result_pdf = x: (x[0], x[1] / total)).sorted().list()  
    return result_pdf
def pdfs_aligned_nonzero(*args) -> List[np.ndarray]:  
    Convert a list of discrete pdfs / freq counts to aligned numpy arrays of the same size for common non-zero elements  
    :return: pair of numpy arrays of the same size with the aligned pdfs  
    num_pdfs = len(args)  
    pdfs = []  
    for x in args:  
def keys(_xs):  
        return seq(_xs).map(lambda x: x[0])
# Extract union of keys  
    all_keys = seq(pdfs).flat_map(keys).distinct().sorted()
# Index all pdfs by value  
    dict_pdfs = seq(pdfs).map(dict).list()
# result aligned lists  
    aligned_lists: List[List] = [[] for x in range(num_pdfs)]
# fill keys present in all pdfs  
    for i, key in enumerate(all_keys):  
        for j, d in enumerate(dict_pdfs):  
            if d.get(key, 0) == 0:  
            # All keys exist and are != 0  
            for j, d in enumerate(dict_pdfs):  
    np_arrays = seq(aligned_lists).map(np.array).list()  
    return np_arrays

We define a function to calculate the KL divergence.

def kl_divergence(p, q):  
    return np.sum(p * np.log(p / q))

We obtain the probability distributions for the advantaged and disadvantaged facets.

label = df['Target']  
sensitive_facet_index = df["Sex"] == "Female"
(Pa, Pd) = pdfs_aligned_nonzero(label[~sensitive_facet_index], label[sensitive_facet_index])

Finally, we compute the KL divergence.

kl_divergence(Pa, Pd)
Out[]: 0.14306865156306434

Conditional Demographic Disparity in Labels (CDDL)

The demographic disparity checks whether a facet has a larger proportion of negative outcomes than positive outcomes.

For example, in the case of college admissions, if women applicants comprised 46% of the rejected applicants and comprised only 32% of the accepted applicants, we say that there is demographic disparity because the rate at which women were rejected exceeds the rate at which they are accepted. [1]

Conditional Demographic Disparity in Labels or CDDL for short builds on Demographic Disparity to avoid the Simpson’s paradox.

The textbook example of the Simpson’s paradox arose in the case of Berkeley admissions where men were accepted at a higher rate overall than women. Initially, it was thought that men were favoured relative to women. However, when departmental subgroups were examined, women were shown to have higher admission rates than men when conditioned by department. The explanation was that women had applied to departments with lower acceptance rates than men had. Examining the subgrouped acceptance rates revealed that women were actually accepted at a higher rate than men for the departments with lower acceptance rates. [1]

Going back to our example, we know that income is highly correlated with age. Therefore, it’s possible that the discrepency in income might be due to the fact that our dataset has a higher proportion of older men than women. We could verify this assumption using CDDL.

We calculate CDDL using the following formula:

Conditional Demographic Disparity (CDD)
Conditional Demographic Disparity (CDD)

A positive value indicates there is a demographic disparity as facet d has a greater proportion of the rejected outcomes in the dataset than of the accepted outcomes.

CDDL in Python

We again borrow some code from the AWS Clarify Github repository to help in calculating CDDL.

def divide(a, b):  
    if b == 0 and a == 0:  
        return 0.0  
    if b == 0:  
        if a < 0:  
            return -INFINITY  
        return INFINITY  
    return a / b

We define a function that computes CDDL.

def CDDL(feature, sensitive_facet_index, positive_label_index, group_variable):  
    unique_groups = np.unique(group_variable)
CDD = np.array([])  
    counts = np.array([])  
    for subgroup_variable in unique_groups:  
        counts = np.append(counts, len(group_variable[group_variable == subgroup_variable]))  
        numA = len(feature[label_index & sensitive_facet_index & (group_variable == subgroup_variable)])  
        denomA = len(feature[label_index & (group_variable == subgroup_variable)])  
        A = numA / denomA if denomA != 0 else 0  
        numD = len(feature[(~label_index) & sensitive_facet_index & (group_variable == subgroup_variable)])  
        denomD = len(feature[(~label_index) & (group_variable == subgroup_variable)])  
        D = numD / denomD if denomD != 0 else 0  
        CDD = np.append(CDD, D - A)
return divide(np.sum(counts * CDD), np.sum(counts))

Finally, we call the function using Age for the subgroups.

feature = df["Sex"]  
sensitive_facet_index = df["Sex"] == "Female"  
positive_label_index = df["Target"] == ">50K"  
group_variable = df["Age"]
CDDL(feature, sensitive_facet_index, positive_label_index, group_variable)
Out[]: 0.214915908649356

The value is greater than zero meaning that there is still a discrepency in income between males and females despite taking differences in age into account.

Amazon SageMaker Clarify

Sagemaker also offers Clarify as a standalone open source Python library which means that it can be used outside of AWS!

To install the package, you can simply do:

pip install smclarify

We import the library within our notebook as follows.

from smclarify.bias import report

Next, we specify the facet and label columns.

facet_column = report.FacetColumn(name="Sex")
label_column = report.LabelColumn(  

The group variable is required to form subgroups for the measurement of Conditional Demographic Disparity in Labels (CDDL).

group_variable = df["Age"]

We set the stage type to pre-training since we’re concerned with the metrics that can be calculated prior to training the model.

bias_report = report.bias_report(  

The report will contain the metrics for each distinct value in the facet column. We will select the second element which contains the metrics for when Female is the disadvantaged facet value.


As we can see, the values match those we calculated ourselves.

{'value_or_threshold': 'Female',  
 'metrics': [{'name': 'CDDL',  
   'description': 'Conditional Demographic Disparity in Labels (CDDL)',  
   'value': 0.214915908649356},  
  {'name': 'CI',  
   'description': 'Class Imbalance (CI)',  
   'value': 0.3513692725946555},  
  {'name': 'DPL',  
   'description': 'Difference in Positive Proportions in Labels (DPL)',  
   'value': 0.20015891077100018},  
  {'name': 'JS',  
   'description': 'Jensen-Shannon Divergence (JS)',  
   'value': 0.03075614465977302},  
  {'name': 'KL',  
   'description': 'Kullback-Liebler Divergence (KL)',  
   'value': 0.14306865156306434},  
  {'name': 'KS',  
   'description': 'Kolmogorov-Smirnov Distance (KS)',  
   'value': 0.20015891077100018},  
  {'name': 'LP', 'description': 'L-p Norm (LP)', 'value': 0.2830674462421746},  
  {'name': 'TVD',  
   'description': 'Total Variation Distance (TVD)',  
   'value': 0.20015891077100015}]}

It’s important to note that the metrics we did not cover in depth (e.g. TVD, LP, KS and JS) all measure whether there is a disparity in outcomes in the dataset across the classes (same as KL).


It’s important to take steps to mitigate bias in the context of machine learning to avoid unfair treatment of others based on demographic features. As data practitioners, we should strive to check for bias in our workflow/MLOps pipelines. Using the metrics contained in the AWS Clarify package, we can measure the bias in our data prior to training a machine learning model and take steps to ensure it stays below a certain threshold.


AWS | Conditional Demographic Disparity (CDD)

Profile picture

Written by Cory Maklin Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter