Random Forest In Python
Random forest is one of the most popular machine learning algorithms out there. Like decision trees, random forest can be applied to both regression and classification problems. There are laws which demand that the decisions made by models used in issuing loans or insurance be explainable. The latter is known as model interpretability and is one of the reasons why we see random forest models being used heavily in industry.
The random forest algorithm works by aggregating the predictions made by multiple decision trees of varying depth. Every decision tree in the forest is trained on a subset of the dataset called the bootstrapped dataset.
The portion of samples that were left out during the construction of each decision tree in the forest are referred to as the Out-Of-Bag (OOB) dataset. As we’ll see later, the model will automatically evaluate its own performance by running each of the samples in the OOB dataset through the forest.
Recall how when deciding on the criteria with which to split a decision tree, we measured the impurity produced by each feature using the Gini index or entropy. In random forest, however, we randomly select a predefined number of feature as candidates. The latter will result in a larger variance between the trees which would otherwise contain the same features (i.e those which are highly correlated with the target label).
When the random forest is used for classification and is presented with a new sample, the final prediction is made by taking the majority of the predictions made by each individual decision tree in the forest. In the event, it is used for regression and it is presented with a new sample, the final prediction is made by taking the average of the predictions made by each individual decision tree in the forest.
To begin, we import the following libraries.
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import pandas as pd import numpy as np from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from IPython.display import Image from pydot import graph_from_dot_data
In the proceeding section, we’ll attempt to classify different species of Iris. Fortunately, the
scikit-learn library provides a wrapper function for importing the dataset into our program.
iris = load_iris() X = pd.DataFrame(iris.data, columns=iris.feature_names) y = pd.Categorical.from_codes(iris.target, iris.target_names)
RandomForestClassifier can’t handle categorical data directly. Thus, we encode each species as a number.
y = pd.get_dummies(y)
We set a portion of the total data aside for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Next, we create an instance of the
rf = RandomForestClassifier(criterion='entropy', oob_score=True, random_state=1)
To spice things up, we’re going to use entropy as the decision criteria, this time around. The process is similar to the one used in the previous post, except that we use the following equation.
The impurity for the node itself is equal to the fraction of samples in the left child, plus the fraction of samples in the right child.
To calculate the impurity of the left leaf, we plug the fraction of people that are and aren’t married and that have an income less than 50,000 into the equation.
We follow the same process for the impurity of the right leaf.
The information gain (with Entropy) is written as follows.
The process is then repeated for both income and sex. We select the split with the largest information gain.
Next, we train our model.
esimators_ property contains an array of the
DecisionTree objects that make up the forest. Just like before, we can run the following code block to visualize a given decision tree.
dt = rf.estimators_
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)
(graph, ) = graph_from_dot_data(dot_data.getvalue())
We can evaluate the accuracy of our random forest model by taking a look at the Out-Of-Bag score.
We can also check how well our model performs on the testing set. Given that this consists of a classification problem, we make use of a confusion matrix.
y_pred = rf.predict(X_test)
species = np.array(y_test).argmax(axis=1) predictions = np.array(y_pred).argmax(axis=1)