## MLOps

August 22, 2022At the time of this writing, organizations are still putting notebooks into production! Fortunately, the machine learning space is slowly beginning to adopt software engineering best practices. Among…

Written by **Cory Maklin** Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter

## MLOps

August 22, 2022At the time of this writing, organizations are still putting notebooks into production! Fortunately, the machine learning space is slowly beginning to adopt software engineering best practices. Among…

## Data Governance Checklist

August 22, 2022I’ve heard a lot about data governance, but I still didn’t understand what it meant to implement it on practical level. According to Google, data governance is defined as: I found that the best way…

## Transformers Explained

August 22, 2022Since their introduction in 2017, transformers have revolutionized the world of natural language processing. Prior to Transformers, LSTMs and RNNs were the state of the art. The reason Transformers…

## Transfer Learning

August 15, 2022Long gone are the days in which data practitioners trained machine learning models from scratch themselves. Unless you have a very specific use case, you’re better off leveraging the pre-trained…

## Fine Tuning Machine Learning Models

August 14, 2022Transformers have revolutionized the way data practitioners build models for natural language processing. In a similar vein, the advent of transfer learning has changed the game. Rather than training…

## Word2Vec — Skip-Gram

August 11, 2022With a few exceptions, machine learning models do not accept raw text as input. The sequences of words must first be encoded in some fashion. We could represent each sentence as a Bag of Words (BOW)…

## Memory Based Collaborative Filtering — User Based

August 10, 2022In the early 90s, recommendation systems, particularly automated collaborative filtering, started seeing more widespread use. Fast forward to today, recommendation systems are at the core of the…

## Model Based Collaborative Filtering — SVD

August 09, 2022Back in 2006, Netflix announced the Netflix Prize, a machine learning competition for predicting movie ratings. They offered a one million dollar prize to whoever improved the accuracy of their…

## SHAP (SHapley Additive exPlanations)

August 03, 2022In recent years, there have been multiple scandals involving a machine learning model that made an unjust decision on the basis of gender or race. The EU is seeking to pass legislation requiring AI…

## Data Vault

August 02, 2022Data Vault modelling is used to build data warehouses while addressing the drawbacks of 3NF (Bill Inmon), and dimensional (Ralph Kimball) modelling. Data Vault, originally conceived by Daniel…

## Data Quality

August 01, 2022You can bet that you will be asked what kind of data issues you might encounter in your day job during one of your data engineer or data scientist interviews. Data quality will do more for model…

## Latent Dirichlet Allocation

August 01, 2022Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a…

## What’s the difference between a junior and a senior engineer

August 01, 2022I’ve always wondered, if it is possible to be a senior engineer with a junior engineer title? If so, what’s the difference between the two? There have been instances in my career where, although I…

## Data Mesh Architecture

July 17, 2022The data mesh architecture is on the rave nowadays, and for good reason. The data mesh brings to the data lakehouse what microservices brought to monolithic applications that is, decoupling. Allow me…

## Production Machine Learning Code

July 17, 2022In my previous role, we had written transformations using Spark Structured Streaming in notebooks and scheduled them in Airflow using the Papermill operator. We lacked the internal expertise and…

## Isolation Forest

July 15, 2022Isolation Forest is an unsupervised machine learning algorithm for anomaly detection. As the name implies, Isolation Forest is an ensemble method (similar to random forest). In other words, it use…

## DeepAR Forecasting Algorithm

July 15, 2022To this day, forecasting remains one of the most valuable applications of machine learning. For instance, we could use a model to predict the demand of a product. This information could then be used…

## Pretraining Data Bias

May 15, 2022If you’re like me, then, whenever you hear talk of artificial intelligence ethics, you can’t help but think of a professor in a philosophy department contemplating whether robots should be given the…

## Synthetic Minority Over-sampling TEchnique (SMOTE)

May 14, 2022Synthetic Minority Over-sampling TEchnique, or SMOTE for short, is a preprocessing technique used to address a class imbalance in a dataset. In the real world, oftentimes we end up trying to train a…

## Data Lakehouses

February 22, 2022In the previous article, we discussed why the data warehouse architecture came to prominence. We also saw how it was unsuited for unstructured data and the volumes of data inherent in Big Data. We…

## Data Warehouses

February 21, 2022The term Data Warehouse was first coined in the 1970s. In essence, a data warehouse is a database management system (DBMS) that houses all of the enterprise’s data. The data warehouse serves as a…

## Date Lakehouse Time Travel

February 15, 2022It’s Tuesday afternoon, you’re sitting at your cubicle, and you’re typing away at your keyboard. Earlier in the day, you volunteered to pick up the ticket to modify the ingestion pipeline, but now…

## OLTP vs OLAP

February 13, 2022Let’s say you decide to build a Facebook clone. You and your roommate grind away for a few weeks to get the application up and running. Everything looks great, you’ve got over 100 users (including…

## Breadth First Search In Python

June 16, 2021Breadth First Search (or BFS for short) is a graph traversal algorithm. In BFS, we visit all of the neighboring nodes at the present depth prior to moving on to the nodes at the next depth. Breadth…

## Quicksort In Python

June 15, 2021Quicksort In Python. We’ve all been guilty of it. Whenever we come across a problem that requires us to sort an array, we default to implementing bubble sort. I….

## Monte Carlo Integration

October 03, 2020Often times, we can’t solve integrals analytically and must resort to numerical methods. Among these include Monte Carlo integration. As you may remember, the integral of a function can be…

## Gibbs Sampling

October 02, 2020Like other MCMC methods, the Gibbs sampler constructs a Markov Chain whose values converge towards a target distribution. Gibbs Sampling is in fact a specific case of the Metropolis-Hastings…

## Monte Carlo Markov Chain

August 24, 2020A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide…

## AES Encryption 256 Bit

August 20, 2020AES (Advanced Encryption Standard) is the most widely used symmetric encryption algorithm. AES is used in a wide array of applications that include the encryption of data at rest, and secure file…

## Diffie Hellman Key Exchange

August 17, 2020In short, the Diffie Hellman is a widely used technique for securely sending a symmetric encryption key to another party. Before proceeding, let’s discuss why we’d want to use something like the…

## XGBoost Python Example

May 09, 2020XGBoost is short for Extreme Gradient Boost (I wrote an article that provides the gist of gradient boost here). Unlike Gradient Boost, XGBoost makes use of regularization parameters that helps…

## Generative Adversarial Networks

May 06, 2020Generative Adversarial Networks or GANs for short are a type of neural network that can be used to generate data rather than attempt to classify it. Although slightly disturbing, the following site…

## Fast Fourier Transform

December 29, 2019If you have a background in electrical engineering, you will, in all probability, have heard of the Fourier Transform. In layman's terms, the Fourier Transform is a mathematical operation that…

## Independent Component Analysis (ICA) In Python

August 22, 2019Suppose that you’re at a house party and you’re talking to some cute girl. As you listen, your ears are being bombarded by the sound coming from the conversations going on between different groups…

## Random Forest In Python

August 21, 2019Random forest is one of the most popular machine learning algorithms out there. Like decision trees, random forest can be applied to both…

## KL Divergence Python Example

August 20, 2019We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions.

## Ridge Regression Python Example

August 19, 2019A tutorial on how to implement Ridge Regression from scratch in Python using Numpy.

## Least Squares Linear Regression In Python

August 16, 2019As the name implies, the method of Least Squares minimizes the sum of the squares of the residuals between the observed targets in the…

## Support Vector Machine Python Example

August 12, 2019Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification, regression and even outlier detection. The linear SVM classifier works by drawing a…

## t-SNE Python Example

August 10, 2019t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that…

## Singular Value Decomposition Example In Python

August 05, 2019Singular Value Decomposition, or SVD, has a wide array of applications. These include dimensionality reduction, image compression, and denoising data. In essence, SVD states that a matrix can be…

## Linear Discriminant Analysis In Python

August 04, 2019Linear Discriminant Analysis (LDA) is a dimensionality reduction technique which minimizes the variance and maximizes the distance between…

## Logistic Regression In Python

August 03, 2019An explanation of the Logistic Regression algorithm with an example of how to implement it in Python.

## Random Forest In R

July 30, 2019A tutorial on how to implement the random forest algorithm in R.

## Decision Tree In Python

July 27, 2019An example of how to implement a decision tree classifier in Python.

## Linear Regression In Python

July 26, 2019An example of how to implement linear regression in Python.

## MNIST Dataset Python Example Using CNN

July 22, 2019A tutorial on how to perform image classification on the MNIST dataset using convolutional neural networks (CNN) and Python.

## K Nearest Neighbor Algorithm In Python

July 22, 2019A tutorial on how to use the k nearest neighbor algorithm to classify data in python.

## Gaussian Mixture Models Clustering Algorithm Explained

July 15, 2019Gaussian mixture models can be used to cluster unlabeled data in much the same way as k-means. There are, however, a couple of advantages to using Gaussian mixture models over k-means. First and…

## Spectral Clustering Algorithm Implemented From Scratch

July 14, 2019Spectral clustering is a popular unsupervised machine learning algorithm which often outperforms other approaches. In addition, spectral clustering is very simple to implement and can be solved…

## Affinity Propagation Algorithm Explained

July 02, 2019Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…

## Affinity Propagation Algorithm Explained

July 01, 2019Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…

## Unsupervised Machine Learning: Affinity Propagation Algorithm Explained

July 01, 2019The Affinity Propagation algorithm was published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you…

## BIRCH Clustering Algorithm Example In Python

July 01, 2019Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…

## Machine Learning: BIRCH Clustering Algorithm Clearly Explained

July 01, 2019Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…

## DBSCAN Python Example: The Optimal Value For Epsilon (EPS)

June 30, 2019DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data…

## Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example

June 30, 2019Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify…

## Machine Learning At Scale With Apache Spark MLlib Python Example

June 30, 2019For most of their history, computer processors became faster every year. Unfortunately, this trend in hardware stopped around 2005. Due to limits in heat dissipation, hardware developers stopped…

## Spark MLlib Python Example — Machine Learning At Scale

June 30, 2019An example of how to train a logistic regression model at scale using Apache Spark MLlib and Python.

## LSTM Recurrent Neural Network Keras Example

June 14, 2019A step-by-by tutorial on how to perform sentiment analysis using a LSTM recurrent neural network implemented with Keras.

## Word Embeddings Python Example - Sentiment Analysis

June 10, 2019Word embeddings are used to reduce the number of features in NLP (Natural Language Processing) based problems such as sentiment analysis.

## Batch Normalization Tensorflow Keras Example

June 08, 2019An example of how to implement batch normalization using tensorflow keras in order to prevent overfitting.

## Metrics For Evaluating Machine Learning Classification Models

June 07, 2019A tutorial on the various methods for evaluating a classification model's performance.

## ARIMA Model Python Example - Time Series Forecasting

May 25, 2019An example of how to perform time series forecasting by building an ARIMA model in Python.

## Conv net - Image Classification Tensorflow Keras Example

May 23, 2019A tutorial on how to perform image classification using a conv net and tensorflow keras.

## Gradient Boosting Decision Tree Algorithm Explained

May 17, 2019An in depth explanation of the gradient boosting decision tree algorithm.

## Apache NiFi And Kafka Docker Example

May 12, 2019An example of how to publish data to kafka docker container using a nifi processor.

## Big Data: Apache Kafka, Schema Registry And Avro Records

May 11, 2019The most common ways to store data are CSV, XML and JSON. JSON is less verbose than XML, but both still use a lot of space compared to binary formats. In JSON, you repeat every field name with every…

## Naive Bayes Classifier Algorithm And Assumption Explained

May 06, 2019An explanation of the naive bayes classifier algorithm and assumption.

## TF IDF | TFIDF Python Example

May 05, 2019An example of how to implement TFIDF (TF IDF) from scratch with Python.

## Principal Component Analysis Example In Python

May 04, 2019Principal Component Analysis or PCA is used to reduce the number of features without the loss of too much information. The problem with having too many dimensions is that it makes it difficult to…

## K-Fold Cross Validation Example Using Sklean

May 03, 2019An example of how to use k-fold cross validation with sklearn to estimate hyperparameters.

## R Squared Interpretation | R Squared Linear Regression

April 30, 2019How to calculate and interpret R Squared. An example which covers the meaning of the R Squared score in relation to linear regression.

## Apache Hadoop - What Is YARN | HDFS | MapReduce

April 24, 2019A brief explanation of Apache Hadoop. A high level overview of what is YARN, HDFS and MapReduce.

## Understanding Stream Processing And Apache Kafka

April 17, 2019A high level overview of stream processing and how it relates to Apache Kafka.

## An Overview Of Monolithic, Service Oriented And Event Driven Architectures

April 04, 2019People rightly believed (although maybe a bit too optimistically at first) that the advent of the Internet would bring about a revolution in the realm of commerce. Following the dotcom boom, a new…

## Big Data Analytics —Getting Started With Elasticsearch

March 31, 2019The Elastic Stack has recently risen to fame in the realm of Big Data analytics and machine learning. The Elastic Stack is a suite of tools (i.e. Elasticsearch, Logstash, Kibana and Beats) for…

## Tech Quickie — Connecting To A Virtual Machine From The Host Using SSH

March 31, 2019For a GUI-less server, the shared clipboard functionality of VirtualBox Guest Additions does not work, as a text-based server does not have a clipboard. Therefore, if you want to use copy and paste…

## RAM Specs Explained

March 30, 2019A brief overview of the meaning behind CPU specs.

## CPU Specs Explained

March 29, 2019A brief overview of the meaning behind CPU specs.

## Local Storage In JavaScript / HTML5 Tutorial

March 20, 2019A tutorial on how access data from Local Storage, Session Storage and IndexedDB using the JavaScript API.

## Compiler vs Interpreter | Why C Is More Efficient Than Python

March 19, 2019An explanation of one of the reasons why compiled languages like C are more efficient than interpreted ones such as Python.

## Public Key vs Private Key | Asymmetric vs Symmetric Encryption

March 10, 2019An explanation of the differences between public and private keys. A comparison of asymmetric and symmetric encryption.

## Hierarchical Agglomerative Clustering Algorithm Example In Python

December 31, 2018Hierarchical clustering algorithms group similar objects into groups called clusters. Learn how to implement hierarchical clustering in Python.

## Mean Shift Clustering Algorithm Example In Python

December 31, 2018Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms, clustering attempts to group data without having first been train on labeled data. Clustering…

## Machine Learning Algorithms Part 11: Ridge Regression, Lasso Regression And Elastic-Net Regression

December 30, 2018Supervised learning problems can be further grouped into Classification and Regression problems. As opposed to classification problems, regression has the task of predicting a continuous quantity…

## Machine Learning Algorithms Part 10: Logistic Regression Example In Python

December 30, 2018Logistic Regression is a supervised machine learning algorithm used in the classification of data. For example, suppose that given their income, we wanted to predict whether a customer would buy a…

## K-means Clustering Python Example

December 28, 2018K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been…

## Machine Learning Algorithms Part 7: Linear Support Vector Machine In Python

December 27, 2018Linear Support Vector Machine (or LSVM) is a supervised learning method that looks at data and sorts it into one of two categories. LSVM works by drawing a line between two classes. All the data…

## Machine Learning Algorithms Part 6: K-Nearest Neighbors In Python

December 26, 2018K-Nearest Neighbors (or KNN) is one of the simplest machine learning algorithms and is used in a wide array of institutions. KNN is a non-parametric, lazy learning algorithm. When we say a technique…

## Machine Learning Algorithms Part 5: Random Forest Classification In Python

December 26, 2018The random forest algorithm makes use of multiple decision trees. It can solve both regression and classification problems. With Random Forest however, learning may be slow (depending on the…

## AWS Amplify, Cognito And React Example

December 14, 2018A tutorial on how to create a sign up form using AWS Amplify, Cognito and React.

## Lambda AWS Example | DynamoDB, API Gateway S3 FullStack App

December 12, 2018A tutorial on how to build a fullstack application that leverages AWS Lambda, DynamoDB API Gateway and S3.

## AWS CLI S3 Static Website Hosting

December 04, 2018A tutorial on how to host a static website from a S3 bucket using the AWS CLI.

## AWS CLI DynamoDB Query Example

December 03, 2018A tutorial on how to create and query a DynamoDB table using the AWS CLI.

## AWS CLI Lambda Function Example

December 01, 2018A tutorial on how to create Lambda functions using the AWS CLI.

## Microsoft Azure CLI Commands | Cloud Init

November 19, 2018A tutorial on how to create NGINX server in the cloud using the Azure CLI and Cloud Init.

## AWS CLI EC2 Tutorial

November 19, 2018A tutorial on how to create EC2 instances using the AWS CLI.

## Microsoft Azure CLI Commands | Virtual Machines

November 17, 2018A tutorial on how to create virtual machines in the cloud using the Azure CLI.

## Machine Learning: Convolutional Neural Networks With TensorFlow In Python

November 11, 2018It’s only a matter of time before self-driving cars become widespread. This tremendous feat of engineering wouldn’t be possible without convolutional neural networks. The algorithm used by…

## Machine Learning: Linear Regression Example With TensorFlow In Python

November 10, 2018Linear regression is the most basic form of machine learning. In linear regression we attempt to determine the best fitting line for our data. In the proceeding article, we’ll go through a simple…

## Introduction To Machine Learning: Reducing Loss With Gradient Descent

November 09, 2018In the following article, we’ll delve into how to train our machine learning models or in other words how to minimize loss. In the context of machine learning, when people are speaking about a…

## Introduction To Machine Learning: An Overview Of Deep Neural Networks

October 28, 2018The MNIST dataset is often referred to as the “Hello World” of machine learning programs for computer vision. The MNIST dataset is composed of 28x28 pixels images of handwritten digits (0, 1, 2…