You can bet that you will be asked what kind of data issues you might encounter in your day job during one of your data engineer or data scientist interviews.
Data quality will do more for model performance than any other technique. You could train a complicated deep learning model on massive amounts of data, but if the underlying data is bad, so too will the model’s inference. In this article, we will attempt to address common data quality issues.
Completeness answers the question:
- Is the data all there?
As mentioned in the Kaggle tutorial on handling missing values, we need to distinguish between values that are missing because they were not recorded and values that are missing because they don’t exist.
In the case of the former, we should try to fill them in rather than just leaving them as NA’s. When dealing with missing categorical variables, we can replace the missing value by the most frequent value(s) for that particular column. When dealing with missing numerical variables, we can replace the missing value(s) by the mean/median value for that particular column. A more sophisticated way of replacing missing values consists of training a classifier (e.g. k-nearest neighbors, neural networks) to predict the values of the columns for which there are missing values and then replacing the missing values with the predictions prior to training the actual model.
In the event the values are missing because they actually don’t exist (e.g. the field is missing because if there was no penalty then it doesn’t make sense to say which team was penalized), we should either leave them alone (decision tree based algorithms can handle missing values, for example) or replace them with a value like “neither”.
We also need to account for data that is missing from the dataset in its entirety. For example, suppose that you are reading reports as .csv files from a data lake and writing them to an Iceberg/Delta table. For one reason or another, a handful of reports were never sent to the data lake. As a result, the table is missing rows. The only way to know that the data is indeed missing is reach out to the data owner(s)/analyst(s). It’s for this reason that it’s important to perform data quality checks using a framework such as great expectations to ensure that all the data is present.
If no business stakeholders are available, we can quickly identify whether data is missing by running a left join with a known parent table and checking whether the number of rows make sense.
When training a forecasting model, it’s vitally important to check whether there are any gaps in the data (i.e. no data for month of March) or changes to volumes of data after a known modification to the upstream logic (i.e. filtering).
If completeness refers to missing values, correctness refers to wrong values. Incorrect values could be caused by typos or a misunderstanding by the person manually entering the data. Maybe there was a change to the schema and some old data found its way into the pipeline. Correctness is more difficult to detect than missing values. We need to communicate with the stakeholders and data analysts to determine what values are acceptable for each column. Then, we could check for the presence of outliers or values not in that set of acceptable values using a framework such as great expectations.
Correctness should also address data consistency. In other words, does everyone follow the same conventions? Are the fields named differently in one data source than another? For example, suppose the data dictionary provided by the business specifies the currency attribute as US$, however, the data has been set to use a local currency in CAD$.
Uniqueness is straightforward. Uniqueness attempts to answer the questions:
- Are there any duplicates in the data?
- What column values uniquely identify every row?
Any duplicates should be removed from the dataset used to train the model.
Is the available data at a detailed enough level? For example, we might require that the time column be in days or hours instead of aggregated by month.
Manual Data Profiling
Suppose that you receive a new dataset. You should check the following:
- Number and/or percentage of missing values
Categorical & Discrete
- Number of distinct values
- Number and/or percentage of most frequent values
We should verify that all the distinct values listed within the profiling output are documented in the dictionary. If not, we need to add the new value along with its meaning to the Data Dictionary (sometimes we need to talk to the data owner/SME to clarify the meaning of this new value) or the lack thereof.
For example, suppose that the data dictionary provided by the business says possible order statuses are: “A” — Approved, “P” — Pending, but our profiling shows 10% of records have the status “C”. After talking to the subject matter expert, we found “C” means canceled and needs to be removed from our analysis.
- Quantile statistics: minimum, maximum, median, percentiles
- Extreme values (top 10, bottom 10)
We should specifically validate if minimum/maximum values are consistent with the definition and that histogram distribution makes sense.
- Number of duplicates
We should raise concerns regarding duplicates as they waste storage space and computation resources.