Gradient Boosting Decision Tree Algorithm Explained

May 17, 2019

Gradient Boosting Decision Tree Algorithm Explained

In the proceeding article, we’ll take a look at how we can go about implementing Gradient Boost in Python. Gradient Boosting is similar to AdaBoost in that they both use an ensemble of decision trees to predict a target label. However, unlike AdaBoost, the Gradient Boost trees have a depth larger than 1. In practice, you’ll typically see Gradient Boost being used with a maximum number of leaves of between 8 and 32.

Algorithm

Before we dive into the code, it’s important that we grasp how the Gradient Boost algorithm is implemented under the hood. Suppose, we were trying to predict the price of a house given their age, square footage and location.

Step 1: Calculate the average of the target label

When tackling regression problems, we start with a leaf that is the average value of the variable we want to predict. This leaf will be used as a baseline to approach the correct solution in the proceeding steps.

Step 2: Calculate the residuals

For every sample, we calculate the residual with the proceeding formula.

residual = actual value – predicted value

In our example, the predicted value is the equal to the mean calculated in the previous step and the actual value can be found in the price column of each sample. After computing the residuals, we get the following table.

Step 3: Construct a decision tree

Next, we build a tree with the goal of predicting the residuals. In other words, every leaf will contain a prediction as to the value of the residual (not the desired label).

In the event there are more residuals than leaves, some residuals will end up inside the same leaf. When this happens, we compute their average and place that inside the leaf.

Thus, the tree becomes:

Step 4: Predict the target label using all of the trees within the ensemble

Each sample passes through the decision nodes of the newly formed tree until it reaches a given lead. The residual in said leaf is used to predict the house price.

It’s been shown through experimentation that taking small incremental steps towards the solution achieves a comparable bias with a lower overall vatiance (a lower variance leads to better accuracy on samples outside of the training data). Thus, to prevent overfitting, we introduce a hyperparameter called learning rate. When we make a prediction, each residual is multiplied by the learning rate. This forces us to use more decision trees, each taking a small step towards the final solution.

Step 5: Compute the new residuals

We calculate a new set of residuals by subtracting the actual house prices from the predictions made in the previous step. The residuals will then be used for the leaves of the next decision tree as described in step 3.

Step 6: Repeat steps 3 to 5 until the number of iterations matches the number specified by the hyperparameter (i.e. number of estimators)

Step 7: Once trained, use all of the trees in the ensemble to make a final prediction as to the value of the target variable

The final prediction will be equal to the mean we computed in the first step, plus all of the residuals predicted by the trees that make up the forest multiplied by the learning rate.

Code

In this tutorial, we’ll make use of the GradientBoostingRegressor class from the scikit-learn library.

from sklearn.ensemble import GradientBoostingRegressor  
import numpy as np  
import pandas as pd  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_squared_error  
from sklearn.datasets import load_boston  
from sklearn.metrics import mean_absolute_error

For the proceeding example, we’ll be using the Boston house prices dataset.

boston = load_boston()  
X = pd.DataFrame(boston.data, columns=boston.feature_names)  
y = pd.Series(boston.target)

In order to evaluate the performance of our model, we split the data into training and test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y)

Next, we construct and fit our model. max_depth refers to the number of leaves of each tree (i.e. 4) whereas n_estimators refers to the total number of trees in the ensemble. As mentioned previously, the learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, you will need more trees in the ensemble to fit the training set, but the overall variance will be lower.

regressor = GradientBoostingRegressor(  
    max_depth=2,  
    n_estimators=3,  
    learning_rate=1.0  
)  
regressor.fit(X_train, y_train)

Thestaged_predict() method measures the validation error at each stage of training (i.e. with one tree, with two trees…) to find the optimal number of trees.

errors = [mean_squared_error(y_test, y_pred) for y_pred in regressor.staged_predict(X_test)]

best_n_estimators = np.argmin(errors)

Now, we can build and fit our model using the optimal number of trees.

best_regressor = GradientBoostingRegressor(  
    max_depth=2,  
    n_estimators=best_n_estimators,  
    learning_rate=1.0  
)  
best_regressor.fit(X_train, y_train)

Sklearn provides numerous metrics to evaluate the performance of our machine learning models. What I found particularly useful, it that they categorize the each metric according to the problem domain which they’re applicable. For example, precision only makes sense in the context of classification.

https://scikit-learn.org/stable/modules/model_evaluation.html

We use the mean absolute error which can be interpreted as the average distance from our predictions and the actual values.

y_pred = best_regressor.predict(X_test)

mean_absolute_error(y_test, y_pred)