Random Forest In Python

August 21, 2019

Random Forest In Python

Random forest is one of the most popular machine learning algorithms out there. Like decision trees, random forest can be applied to both regression and classification problems. There are laws which demand that the decisions made by models used in issuing loans or insurance be explainable. The latter is known as model interpretability and is one of the reasons why we see random forest models being used heavily in industry.

Algorithm

The random forest algorithm works by aggregating the predictions made by multiple decision trees of varying depth. Every decision tree in the forest is trained on a subset of the dataset called the bootstrapped dataset.

The portion of samples that were left out during the construction of each decision tree in the forest are referred to as the Out-Of-Bag (OOB) dataset. As we’ll see later, the model will automatically evaluate its own performance by running each of the samples in the OOB dataset through the forest.

Recall how when deciding on the criteria with which to split a decision tree, we measured the impurity produced by each feature using the Gini index or entropy. In random forest, however, we randomly select a predefined number of feature as candidates. The latter will result in a larger variance between the trees which would otherwise contain the same features (i.e those which are highly correlated with the target label).

When the random forest is used for classification and is presented with a new sample, the final prediction is made by taking the majority of the predictions made by each individual decision tree in the forest. In the event, it is used for regression and it is presented with a new sample, the final prediction is made by taking the average of the predictions made by each individual decision tree in the forest.

Python Code

To begin, we import the following libraries.

from sklearn.ensemble import RandomForestClassifier  
from sklearn.datasets import load_iris  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import confusion_matrix  
import pandas as pd  
import numpy as np  
from sklearn.tree import export_graphviz  
from sklearn.externals.six import StringIO   
from IPython.display import Image   
from pydot import graph_from_dot_data

In the proceeding section, we’ll attempt to classify different species of Iris. Fortunately, the scikit-learn library provides a wrapper function for importing the dataset into our program.

iris = load_iris()  
X = pd.DataFrame(iris.data, columns=iris.feature_names)  
y = pd.Categorical.from_codes(iris.target, iris.target_names)

The RandomForestClassifier can’t handle categorical data directly. Thus, we encode each species as a number.

y = pd.get_dummies(y)

We set a portion of the total data aside for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Next, we create an instance of the RandomForestClassifier class.

rf = RandomForestClassifier(criterion='entropy', oob_score=True, random_state=1)

To spice things up, we’re going to use entropy as the decision criteria, this time around. The process is similar to the one used in the previous post, except that we use the following equation.

The impurity for the node itself is equal to the fraction of samples in the left child, plus the fraction of samples in the right child.

To calculate the impurity of the left leaf, we plug the fraction of people that are and aren’t married and that have an income less than 50,000 into the equation.

We follow the same process for the impurity of the right leaf.

The information gain (with Entropy) is written as follows.

The process is then repeated for both income and sex. We select the split with the largest information gain.

Next, we train our model.

rf.fit(X_train, y_train)

The esimators_ property contains an array of the DecisionTree objects that make up the forest. Just like before, we can run the following code block to visualize a given decision tree.

dt = rf.estimators_[0]

dot_data = StringIO()

export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)

(graph, ) = graph_from_dot_data(dot_data.getvalue())

Image(graph.create_png())

We can evaluate the accuracy of our random forest model by taking a look at the Out-Of-Bag score.

rf.oob_score_

We can also check how well our model performs on the testing set. Given that this consists of a classification problem, we make use of a confusion matrix.

y_pred = rf.predict(X_test)

species = np.array(y_test).argmax(axis=1)  
predictions = np.array(y_pred).argmax(axis=1)

confusion_matrix(species, predictions)