Profile picture

Written by Cory Maklin Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter

  1. August 22, 2022

    At the time of this writing, organizations are still putting notebooks into production! Fortunately, the machine learning space is slowly beginning to adopt software engineering best practices. Among…

  2. August 22, 2022

    I’ve heard a lot about data governance, but I still didn’t understand what it meant to implement it on practical level. According to Google, data governance is defined as: I found that the best way…

  3. August 22, 2022

    Since their introduction in 2017, transformers have revolutionized the world of natural language processing. Prior to Transformers, LSTMs and RNNs were the state of the art. The reason Transformers…

  4. August 15, 2022

    Long gone are the days in which data practitioners trained machine learning models from scratch themselves. Unless you have a very specific use case, you’re better off leveraging the pre-trained…

  5. August 14, 2022

    Transformers have revolutionized the way data practitioners build models for natural language processing. In a similar vein, the advent of transfer learning has changed the game. Rather than training…

  6. August 11, 2022

    With a few exceptions, machine learning models do not accept raw text as input. The sequences of words must first be encoded in some fashion. We could represent each sentence as a Bag of Words (BOW)…

  7. August 09, 2022

    Back in 2006, Netflix announced the Netflix Prize, a machine learning competition for predicting movie ratings. They offered a one million dollar prize to whoever improved the accuracy of their…

  8. August 03, 2022

    In recent years, there have been multiple scandals involving a machine learning model that made an unjust decision on the basis of gender or race. The EU is seeking to pass legislation requiring AI…

  9. August 02, 2022

    Data Vault modelling is used to build data warehouses while addressing the drawbacks of 3NF (Bill Inmon), and dimensional (Ralph Kimball) modelling. Data Vault, originally conceived by Daniel…

  10. August 01, 2022

    You can bet that you will be asked what kind of data issues you might encounter in your day job during one of your data engineer or data scientist interviews. Data quality will do more for model…

  11. August 01, 2022

    Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a…

  12. July 17, 2022

    The data mesh architecture is on the rave nowadays, and for good reason. The data mesh brings to the data lakehouse what microservices brought to monolithic applications that is, decoupling. Allow me…

  13. July 17, 2022

    In my previous role, we had written transformations using Spark Structured Streaming in notebooks and scheduled them in Airflow using the Papermill operator. We lacked the internal expertise and…

  14. July 15, 2022

    Isolation Forest is an unsupervised machine learning algorithm for anomaly detection. As the name implies, Isolation Forest is an ensemble method (similar to random forest). In other words, it use…

  15. July 15, 2022

    To this day, forecasting remains one of the most valuable applications of machine learning. For instance, we could use a model to predict the demand of a product. This information could then be used…

  16. May 15, 2022

    If you’re like me, then, whenever you hear talk of artificial intelligence ethics, you can’t help but think of a professor in a philosophy department contemplating whether robots should be given the…

  17. May 14, 2022

    Synthetic Minority Over-sampling TEchnique, or SMOTE for short, is a preprocessing technique used to address a class imbalance in a dataset. In the real world, oftentimes we end up trying to train a…

  18. February 22, 2022

    In the previous article, we discussed why the data warehouse architecture came to prominence. We also saw how it was unsuited for unstructured data and the volumes of data inherent in Big Data. We…

  19. February 21, 2022

    The term Data Warehouse was first coined in the 1970s. In essence, a data warehouse is a database management system (DBMS) that houses all of the enterprise’s data. The data warehouse serves as a…

  20. February 15, 2022

    It’s Tuesday afternoon, you’re sitting at your cubicle, and you’re typing away at your keyboard. Earlier in the day, you volunteered to pick up the ticket to modify the ingestion pipeline, but now…

  21. February 13, 2022

    Let’s say you decide to build a Facebook clone. You and your roommate grind away for a few weeks to get the application up and running. Everything looks great, you’ve got over 100 users (including…

  22. June 16, 2021

    Breadth First Search (or BFS for short) is a graph traversal algorithm. In BFS, we visit all of the neighboring nodes at the present depth prior to moving on to the nodes at the next depth. Breadth…

  23. June 15, 2021

    Quicksort In Python. We’ve all been guilty of it. Whenever we come across a problem that requires us to sort an array, we default to implementing bubble sort. I….

  24. October 03, 2020

    Often times, we can’t solve integrals analytically and must resort to numerical methods. Among these include Monte Carlo integration. As you may remember, the integral of a function can be…

  25. October 02, 2020

    Like other MCMC methods, the Gibbs sampler constructs a Markov Chain whose values converge towards a target distribution. Gibbs Sampling is in fact a specific case of the Metropolis-Hastings…

  26. August 24, 2020

    A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide…

  27. August 20, 2020

    AES (Advanced Encryption Standard) is the most widely used symmetric encryption algorithm. AES is used in a wide array of applications that include the encryption of data at rest, and secure file…

  28. August 17, 2020

    In short, the Diffie Hellman is a widely used technique for securely sending a symmetric encryption key to another party. Before proceeding, let’s discuss why we’d want to use something like the…

  29. May 09, 2020

    XGBoost is short for Extreme Gradient Boost (I wrote an article that provides the gist of gradient boost here). Unlike Gradient Boost, XGBoost makes use of regularization parameters that helps…

  30. May 06, 2020

    Generative Adversarial Networks or GANs for short are a type of neural network that can be used to generate data rather than attempt to classify it. Although slightly disturbing, the following site…

  31. December 29, 2019

    If you have a background in electrical engineering, you will, in all probability, have heard of the Fourier Transform. In layman's terms, the Fourier Transform is a mathematical operation that…

  32. August 22, 2019

    Suppose that you’re at a house party and you’re talking to some cute girl. As you listen, your ears are being bombarded by the sound coming from the conversations going on between different groups…

  33. August 21, 2019

    Random forest is one of the most popular machine learning algorithms out there. Like decision trees, random forest can be applied to both…

  34. August 20, 2019

    We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions.

  35. August 12, 2019

    Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification, regression and even outlier detection. The linear SVM classifier works by drawing a…

  36. August 10, 2019

    t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that…

  37. August 05, 2019

    Singular Value Decomposition, or SVD, has a wide array of applications. These include dimensionality reduction, image compression, and denoising data. In essence, SVD states that a matrix can be…

  38. August 04, 2019

    Linear Discriminant Analysis (LDA) is a dimensionality reduction technique which minimizes the variance and maximizes the distance between…

  39. August 03, 2019

    An explanation of the Logistic Regression algorithm with an example of how to implement it in Python.

  40. July 30, 2019

    A tutorial on how to implement the random forest algorithm in R.

  41. July 02, 2019

    Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…

  42. July 01, 2019

    Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…

  43. July 01, 2019

    Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…

  44. May 04, 2019

    Principal Component Analysis or PCA is used to reduce the number of features without the loss of too much information. The problem with having too many dimensions is that it makes it difficult to…

  45. December 31, 2018

    Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms, clustering attempts to group data without having first been train on labeled data. Clustering…

  46. December 28, 2018

    K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been…