All posts

Written by Cory Maklin Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter

MLOps
August 22, 2022
At the time of this writing, organizations are still putting notebooks into production! Fortunately, the machine learning space is slowly beginning to adopt software engineering best practices. Among…
Data Governance Checklist
August 22, 2022
I’ve heard a lot about data governance, but I still didn’t understand what it meant to implement it on practical level. According to Google, data governance is defined as: I found that the best way…
Transformers Explained
August 22, 2022
Since their introduction in 2017, transformers have revolutionized the world of natural language processing. Prior to Transformers, LSTMs and RNNs were the state of the art. The reason Transformers…
Transfer Learning
August 15, 2022
Long gone are the days in which data practitioners trained machine learning models from scratch themselves. Unless you have a very specific use case, you’re better off leveraging the pre-trained…
Fine Tuning Machine Learning Models
August 14, 2022
Transformers have revolutionized the way data practitioners build models for natural language processing. In a similar vein, the advent of transfer learning has changed the game. Rather than training…
Word2Vec — Skip-Gram
August 11, 2022
With a few exceptions, machine learning models do not accept raw text as input. The sequences of words must first be encoded in some fashion. We could represent each sentence as a Bag of Words (BOW)…
Memory Based Collaborative Filtering — User Based
August 10, 2022
In the early 90s, recommendation systems, particularly automated collaborative filtering, started seeing more widespread use. Fast forward to today, recommendation systems are at the core of the…
Model Based Collaborative Filtering — SVD
August 09, 2022
Back in 2006, Netflix announced the Netflix Prize, a machine learning competition for predicting movie ratings. They offered a one million dollar prize to whoever improved the accuracy of their…
SHAP (SHapley Additive exPlanations)
August 03, 2022
In recent years, there have been multiple scandals involving a machine learning model that made an unjust decision on the basis of gender or race. The EU is seeking to pass legislation requiring AI…
Data Vault
August 02, 2022
Data Vault modelling is used to build data warehouses while addressing the drawbacks of 3NF (Bill Inmon), and dimensional (Ralph Kimball) modelling. Data Vault, originally conceived by Daniel…
Data Quality
August 01, 2022
You can bet that you will be asked what kind of data issues you might encounter in your day job during one of your data engineer or data scientist interviews. Data quality will do more for model…
Latent Dirichlet Allocation
August 01, 2022
Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a…
What’s the difference between a junior and a senior engineer
August 01, 2022
I’ve always wondered, if it is possible to be a senior engineer with a junior engineer title? If so, what’s the difference between the two? There have been instances in my career where, although I…
Data Mesh Architecture
July 17, 2022
The data mesh architecture is on the rave nowadays, and for good reason. The data mesh brings to the data lakehouse what microservices brought to monolithic applications that is, decoupling. Allow me…
Production Machine Learning Code
July 17, 2022
In my previous role, we had written transformations using Spark Structured Streaming in notebooks and scheduled them in Airflow using the Papermill operator. We lacked the internal expertise and…
Isolation Forest
July 15, 2022
Isolation Forest is an unsupervised machine learning algorithm for anomaly detection. As the name implies, Isolation Forest is an ensemble method (similar to random forest). In other words, it use…
DeepAR Forecasting Algorithm
July 15, 2022
To this day, forecasting remains one of the most valuable applications of machine learning. For instance, we could use a model to predict the demand of a product. This information could then be used…
Pretraining Data Bias
May 15, 2022
If you’re like me, then, whenever you hear talk of artificial intelligence ethics, you can’t help but think of a professor in a philosophy department contemplating whether robots should be given the…
Synthetic Minority Over-sampling TEchnique (SMOTE)
May 14, 2022
Synthetic Minority Over-sampling TEchnique, or SMOTE for short, is a preprocessing technique used to address a class imbalance in a dataset. In the real world, oftentimes we end up trying to train a…
Data Lakehouses
February 22, 2022
In the previous article, we discussed why the data warehouse architecture came to prominence. We also saw how it was unsuited for unstructured data and the volumes of data inherent in Big Data. We…
Data Warehouses
February 21, 2022
The term Data Warehouse was first coined in the 1970s. In essence, a data warehouse is a database management system (DBMS) that houses all of the enterprise’s data. The data warehouse serves as a…
Date Lakehouse Time Travel
February 15, 2022
It’s Tuesday afternoon, you’re sitting at your cubicle, and you’re typing away at your keyboard. Earlier in the day, you volunteered to pick up the ticket to modify the ingestion pipeline, but now…
OLTP vs OLAP
February 13, 2022
Let’s say you decide to build a Facebook clone. You and your roommate grind away for a few weeks to get the application up and running. Everything looks great, you’ve got over 100 users (including…
Breadth First Search In Python
June 16, 2021
Breadth First Search (or BFS for short) is a graph traversal algorithm. In BFS, we visit all of the neighboring nodes at the present depth prior to moving on to the nodes at the next depth. Breadth…
Quicksort In Python
June 15, 2021
Quicksort In Python. We’ve all been guilty of it. Whenever we come across a problem that requires us to sort an array, we default to implementing bubble sort. I….
Monte Carlo Integration
October 03, 2020
Often times, we can’t solve integrals analytically and must resort to numerical methods. Among these include Monte Carlo integration. As you may remember, the integral of a function can be…
Gibbs Sampling
October 02, 2020
Like other MCMC methods, the Gibbs sampler constructs a Markov Chain whose values converge towards a target distribution. Gibbs Sampling is in fact a specific case of the Metropolis-Hastings…
Monte Carlo Markov Chain
August 24, 2020
A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide…
AES Encryption 256 Bit
August 20, 2020
AES (Advanced Encryption Standard) is the most widely used symmetric encryption algorithm. AES is used in a wide array of applications that include the encryption of data at rest, and secure file…
Diffie Hellman Key Exchange
August 17, 2020
In short, the Diffie Hellman is a widely used technique for securely sending a symmetric encryption key to another party. Before proceeding, let’s discuss why we’d want to use something like the…
XGBoost Python Example
May 09, 2020
XGBoost is short for Extreme Gradient Boost (I wrote an article that provides the gist of gradient boost here). Unlike Gradient Boost, XGBoost makes use of regularization parameters that helps…
Generative Adversarial Networks
May 06, 2020
Generative Adversarial Networks or GANs for short are a type of neural network that can be used to generate data rather than attempt to classify it. Although slightly disturbing, the following site…
Fast Fourier Transform
December 29, 2019
If you have a background in electrical engineering, you will, in all probability, have heard of the Fourier Transform. In layman's terms, the Fourier Transform is a mathematical operation that…
Independent Component Analysis (ICA) In Python
August 22, 2019
Suppose that you’re at a house party and you’re talking to some cute girl. As you listen, your ears are being bombarded by the sound coming from the conversations going on between different groups…
Random Forest In Python
August 21, 2019
Random forest is one of the most popular machine learning algorithms out there. Like decision trees, random forest can be applied to both…
KL Divergence Python Example
August 20, 2019
We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions.
Ridge Regression Python Example
August 19, 2019
A tutorial on how to implement Ridge Regression from scratch in Python using Numpy.
Least Squares Linear Regression In Python
August 16, 2019
As the name implies, the method of Least Squares minimizes the sum of the squares of the residuals between the observed targets in the…
Support Vector Machine Python Example
August 12, 2019
Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification, regression and even outlier detection. The linear SVM classifier works by drawing a…
t-SNE Python Example
August 10, 2019
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that…
Singular Value Decomposition Example In Python
August 05, 2019
Singular Value Decomposition, or SVD, has a wide array of applications. These include dimensionality reduction, image compression, and denoising data. In essence, SVD states that a matrix can be…
Linear Discriminant Analysis In Python
August 04, 2019
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique which minimizes the variance and maximizes the distance between…
Logistic Regression In Python
August 03, 2019
An explanation of the Logistic Regression algorithm with an example of how to implement it in Python.
Random Forest In R
July 30, 2019
A tutorial on how to implement the random forest algorithm in R.
Decision Tree In Python
July 27, 2019
An example of how to implement a decision tree classifier in Python.
Linear Regression In Python
July 26, 2019
An example of how to implement linear regression in Python.
MNIST Dataset Python Example Using CNN
July 22, 2019
A tutorial on how to perform image classification on the MNIST dataset using convolutional neural networks (CNN) and Python.
K Nearest Neighbor Algorithm In Python
July 22, 2019
A tutorial on how to use the k nearest neighbor algorithm to classify data in python.
Gaussian Mixture Models Clustering Algorithm Explained
July 15, 2019
Gaussian mixture models can be used to cluster unlabeled data in much the same way as k-means. There are, however, a couple of advantages to using Gaussian mixture models over k-means. First and…
Spectral Clustering Algorithm Implemented From Scratch
July 14, 2019
Spectral clustering is a popular unsupervised machine learning algorithm which often outperforms other approaches. In addition, spectral clustering is very simple to implement and can be solved…
Affinity Propagation Algorithm Explained
July 02, 2019
Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…
Affinity Propagation Algorithm Explained
July 01, 2019
Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…
Unsupervised Machine Learning: Affinity Propagation Algorithm Explained
July 01, 2019
The Affinity Propagation algorithm was published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you…
BIRCH Clustering Algorithm Example In Python
July 01, 2019
Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…
Machine Learning: BIRCH Clustering Algorithm Clearly Explained
July 01, 2019
Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…
DBSCAN Python Example: The Optimal Value For Epsilon (EPS)
June 30, 2019
DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data…
Machine Learning Clustering: DBSCAN Determine The Optimal Value For Epsilon (EPS) Python Example
June 30, 2019
Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify…
Machine Learning At Scale With Apache Spark MLlib Python Example
June 30, 2019
For most of their history, computer processors became faster every year. Unfortunately, this trend in hardware stopped around 2005. Due to limits in heat dissipation, hardware developers stopped…
Spark MLlib Python Example — Machine Learning At Scale
June 30, 2019
An example of how to train a logistic regression model at scale using Apache Spark MLlib and Python.
LSTM Recurrent Neural Network Keras Example
June 14, 2019
A step-by-by tutorial on how to perform sentiment analysis using a LSTM recurrent neural network implemented with Keras.
Word Embeddings Python Example - Sentiment Analysis
June 10, 2019
Word embeddings are used to reduce the number of features in NLP (Natural Language Processing) based problems such as sentiment analysis.
Batch Normalization Tensorflow Keras Example
June 08, 2019
An example of how to implement batch normalization using tensorflow keras in order to prevent overfitting.
Metrics For Evaluating Machine Learning Classification Models
June 07, 2019
A tutorial on the various methods for evaluating a classification model's performance.
ARIMA Model Python Example - Time Series Forecasting
May 25, 2019
An example of how to perform time series forecasting by building an ARIMA model in Python.
Conv net - Image Classification Tensorflow Keras Example
May 23, 2019
A tutorial on how to perform image classification using a conv net and tensorflow keras.
Gradient Boosting Decision Tree Algorithm Explained
May 17, 2019
An in depth explanation of the gradient boosting decision tree algorithm.
Apache NiFi And Kafka Docker Example
May 12, 2019
An example of how to publish data to kafka docker container using a nifi processor.
Big Data: Apache Kafka, Schema Registry And Avro Records
May 11, 2019
The most common ways to store data are CSV, XML and JSON. JSON is less verbose than XML, but both still use a lot of space compared to binary formats. In JSON, you repeat every field name with every…
Naive Bayes Classifier Algorithm And Assumption Explained
May 06, 2019
An explanation of the naive bayes classifier algorithm and assumption.
TF IDF | TFIDF Python Example
May 05, 2019
An example of how to implement TFIDF (TF IDF) from scratch with Python.
Principal Component Analysis Example In Python
May 04, 2019
Principal Component Analysis or PCA is used to reduce the number of features without the loss of too much information. The problem with having too many dimensions is that it makes it difficult to…
K-Fold Cross Validation Example Using Sklean
May 03, 2019
An example of how to use k-fold cross validation with sklearn to estimate hyperparameters.
R Squared Interpretation | R Squared Linear Regression
April 30, 2019
How to calculate and interpret R Squared. An example which covers the meaning of the R Squared score in relation to linear regression.
Apache Hadoop - What Is YARN | HDFS | MapReduce
April 24, 2019
A brief explanation of Apache Hadoop. A high level overview of what is YARN, HDFS and MapReduce.
Understanding Stream Processing And Apache Kafka
April 17, 2019
A high level overview of stream processing and how it relates to Apache Kafka.
An Overview Of Monolithic, Service Oriented And Event Driven Architectures
April 04, 2019
People rightly believed (although maybe a bit too optimistically at first) that the advent of the Internet would bring about a revolution in the realm of commerce. Following the dotcom boom, a new…
Big Data Analytics —Getting Started With Elasticsearch
March 31, 2019
The Elastic Stack has recently risen to fame in the realm of Big Data analytics and machine learning. The Elastic Stack is a suite of tools (i.e. Elasticsearch, Logstash, Kibana and Beats) for…
Tech Quickie — Connecting To A Virtual Machine From The Host Using SSH
March 31, 2019
For a GUI-less server, the shared clipboard functionality of VirtualBox Guest Additions does not work, as a text-based server does not have a clipboard. Therefore, if you want to use copy and paste…
RAM Specs Explained
March 30, 2019
A brief overview of the meaning behind CPU specs.
CPU Specs Explained
March 29, 2019
A brief overview of the meaning behind CPU specs.
Local Storage In JavaScript / HTML5 Tutorial
March 20, 2019
A tutorial on how access data from Local Storage, Session Storage and IndexedDB using the JavaScript API.
Compiler vs Interpreter | Why C Is More Efficient Than Python
March 19, 2019
An explanation of one of the reasons why compiled languages like C are more efficient than interpreted ones such as Python.
Public Key vs Private Key | Asymmetric vs Symmetric Encryption
March 10, 2019
An explanation of the differences between public and private keys. A comparison of asymmetric and symmetric encryption.
Hierarchical Agglomerative Clustering Algorithm Example In Python
December 31, 2018
Hierarchical clustering algorithms group similar objects into groups called clusters. Learn how to implement hierarchical clustering in Python.
Mean Shift Clustering Algorithm Example In Python
December 31, 2018
Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms, clustering attempts to group data without having first been train on labeled data. Clustering…
Machine Learning Algorithms Part 11: Ridge Regression, Lasso Regression And Elastic-Net Regression
December 30, 2018
Supervised learning problems can be further grouped into Classification and Regression problems. As opposed to classification problems, regression has the task of predicting a continuous quantity…
Machine Learning Algorithms Part 10: Logistic Regression Example In Python
December 30, 2018
Logistic Regression is a supervised machine learning algorithm used in the classification of data. For example, suppose that given their income, we wanted to predict whether a customer would buy a…
K-means Clustering Python Example
December 28, 2018
K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been…
Machine Learning Algorithms Part 7: Linear Support Vector Machine In Python
December 27, 2018
Linear Support Vector Machine (or LSVM) is a supervised learning method that looks at data and sorts it into one of two categories. LSVM works by drawing a line between two classes. All the data…
Machine Learning Algorithms Part 6: K-Nearest Neighbors In Python
December 26, 2018
K-Nearest Neighbors (or KNN) is one of the simplest machine learning algorithms and is used in a wide array of institutions. KNN is a non-parametric, lazy learning algorithm. When we say a technique…
Machine Learning Algorithms Part 5: Random Forest Classification In Python
December 26, 2018
The random forest algorithm makes use of multiple decision trees. It can solve both regression and classification problems. With Random Forest however, learning may be slow (depending on the…
AWS Amplify, Cognito And React Example
December 14, 2018
A tutorial on how to create a sign up form using AWS Amplify, Cognito and React.
Lambda AWS Example | DynamoDB, API Gateway S3 FullStack App
December 12, 2018
A tutorial on how to build a fullstack application that leverages AWS Lambda, DynamoDB API Gateway and S3.
AWS CLI S3 Static Website Hosting
December 04, 2018
A tutorial on how to host a static website from a S3 bucket using the AWS CLI.
AWS CLI DynamoDB Query Example
December 03, 2018
A tutorial on how to create and query a DynamoDB table using the AWS CLI.
AWS CLI Lambda Function Example
December 01, 2018
A tutorial on how to create Lambda functions using the AWS CLI.
Microsoft Azure CLI Commands | Cloud Init
November 19, 2018
A tutorial on how to create NGINX server in the cloud using the Azure CLI and Cloud Init.
AWS CLI EC2 Tutorial
November 19, 2018
A tutorial on how to create EC2 instances using the AWS CLI.
Microsoft Azure CLI Commands | Virtual Machines
November 17, 2018
A tutorial on how to create virtual machines in the cloud using the Azure CLI.
Machine Learning: Convolutional Neural Networks With TensorFlow In Python
November 11, 2018
It’s only a matter of time before self-driving cars become widespread. This tremendous feat of engineering wouldn’t be possible without convolutional neural networks. The algorithm used by…
Machine Learning: Linear Regression Example With TensorFlow In Python
November 10, 2018
Linear regression is the most basic form of machine learning. In linear regression we attempt to determine the best fitting line for our data. In the proceeding article, we’ll go through a simple…
Introduction To Machine Learning: Reducing Loss With Gradient Descent
November 09, 2018
In the following article, we’ll delve into how to train our machine learning models or in other words how to minimize loss. In the context of machine learning, when people are speaking about a…
Introduction To Machine Learning: An Overview Of Deep Neural Networks
October 28, 2018
The MNIST dataset is often referred to as the “Hello World” of machine learning programs for computer vision. The MNIST dataset is composed of 28x28 pixels images of handwritten digits (0, 1, 2…