Introduction

If you’re like me - a DS/MLE with practical experience - you may be surprised by the title of this post. At first, I assumed the answer was a clear “yes.” However, when someone with more experience asked me this question, I realized that the answer might not be as obvious as I thought. And after some googling I found these quotes by Leo Breiman, the creator of the Random Forest algorithm

Random forests does not overfit. You can run as many trees as you want. (website)

and

Random forests do not overfit as more trees are added, but produce a limiting value of the generalization error. (paper)

Can random forests really not overfit? It seems counterintuitive, but in this post I will explain with math and experiments why it is possible under certain conditions.

To write this post I’ve used this PhD dissertation by Gilles Loupe (one of the creators of sklearn), this blog post that studies the same question, and the original paper by Leo Breiman.

Maths

Model definition

Let’s start by defining mathematically what we mean by a random forest. Let’s assume that we already know how to create single classification trees. Let’s call $D$ the dataset, a decision tree $T$, and $\Theta$ the hyperparameters of the tree. The model generated by these pieces is then $\Psi_{D, \Theta} = T(D, \Theta)$. For the sake of simplicity, let’s assume that all hyperparameters are fixed and that the only free hyperparameter is the random seed, ie: the values max_depth , min_sample_split , min_sample_leaf , etc. are fixed.

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Therefore, for a set of $M$ randomized trees $\{ \Psi_{D, \Theta_i} | i = 1, …, M \}$ that learn from the same dataset $D$we define a random forest model as 1

$\Phi_{D, \Theta_1, ..., \Theta_M}(x) = \frac{1}{M} \sum_i \Psi_{D, \Theta_i}(x)$

Bias-Variance decomposition

Now that we know how to build a random forest let’s study its bias and variance, which are the key ingredients of overfiting. For a pair $x$ and $y$, where $x$ is the vector of features, and $y$ is the target value, let’s define for a single tree $\Psi_{D, \Theta}$

$\mu_{D, \Theta_i}(x) = \mathbb{E}_{D, \Theta_i} \left[\Psi_{D, \Theta_i}(x)\right]$

and

$\sigma^2_{D, \Theta_i}(x) = \mathbb{V}_{D, \Theta_i} \left[\Psi_{D, \Theta_i}(x)\right]$

Also, we know that the expected generalization error of a model decomposes as

$\text{error}(x) = \text{bias}^2(x) + \text{variance}(x) + \text{noise}(x)$

We’re interested in the generalization error that our random forest has depending on its hyperparameters. Let’s start with the bias, which is defined as the difference between the actual value and the model’s expected value

$\mathbb{E}_{D, \Theta_1, ..., \Theta_M}\left[ \Phi_{D, \Theta_1, ..., \Theta_M}(x) \right] = \mathbb{E} \left[ \frac{1}{M} \sum_i \Psi_{D, \Theta_i}(x) \right] = \frac{1}{M} \sum_i \mathbb{E} \left[\Psi_{D, \Theta_i}(x)\right] = \mu_{D, \Theta}(x)$

because random variables $\Theta_i$ are independent and follow the same distribution. Therefore

$\text{bias}(x) = y - \mu_{D, \Theta}(x)$

which means that the bias of the random forest is the same as the bias of a single tree, so combining decision trees has no effect on the bias.

Let’s look now at what happens with the variance. After some maths (page 66 of this paper) it can be shown that the variance of a random forest of $M$ trees is

$\text{variance}(x) = \rho(x) \sigma^2_{D, \Theta}(x) + \frac{1 - \rho(x)}{M} \sigma^2_{D, \Theta}(x)$

where $\rho(x)$ is the Pearson correlation between the predictions of two decision trees.

Therefore, the generalization error is given by

$\text{error}(x, y; M) = (y - \mu_{D, \Theta}(x))^2 + \rho(x) \sigma^2_{D, \Theta}(x) + \frac{1 - \rho(x)}{M} \sigma^2_{D, \Theta}(x) + \text{noise}(x)$

Overfitting

Now that we know how to compute the generalization error of our model we can study how it behaves when changing the hyperparameters. Since each tree is built using only a bootstrap of the full dataset we have $\rho(x) < 1$, ie: different trees in the random forest give different predictions for the same input. Therefore, if $K > L$ we have

$\text{error}(K) < \text{error}(L)$

Also, when $M \to \infty$ we have

$\text{error}(x, y) = (y - \mu_{D, \Theta}(x))^2 + \rho(x) \sigma^2_{D, \Theta}(x) + \text{noise}(x)$

As a result, the expected generalization error of the random forest is smaller than the expected error of a single decision tree. This is, as we add more trees to the random forest the generalization error reduces.

Therefore, adding more trees to the model doesn’t make the model more prone to overfitting, in fact it reduces the generalization error! But it doesn’t mean that a random forest can’t overfit. For example, imagine the simplest case of a random forest with only one tree. If we allow the tree to have an infinite depth it can learn almost perfectly the training data (up to collisions), but then it’ll fail to generalize to other data.

Experiment

In the above section, we’ve seen mathematically that adding more trees to the model shouldn’t make the model overfit. In this section, we are going to take a more practical approach and show it directly with some experiments.

To study the effect of the number of trees on overfitting we’ll generate a synthetic dataset using the sklearn.datasets.make_regression method.

1 2 3 4 5 6 7 8 9 10 from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split X, y = make_regression(n_samples=10_000, n_features=50, n_informative=30, noise=10) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 

Now we can train multiple random forests, each one with a different number of trees.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error rf = RandomForestRegressor() n_estimators = [] train_mse = [] test_mse = [] for n in range(1, 50, 1): rf.n_estimators = n rf.fit(X_train, y_train) y_train_predicted = rf.predict(X_train) y_test_predicted = rf.predict(X_test) mse_train = mean_squared_error(y_train, y_train_predicted) mse_test = mean_squared_error(y_test, y_test_predicted) n_estimators.append(n) train_mse.append(mse_train) test_mse.append(mse_test) 

And we can finally plot the MSE for both train and test datasets.

As you can see, there’s a gap between the train and test MSE which means that there’s some overfitting. However, as we add more trees the gap between the train and test MSE doesn’t increase, which means that adding more trees doesn’t make the model more prone to overfitting. In fact, as we add more trees to the model the gap between the curves reduces. In fact, in the next plot we see how Gap = Test MSE - Train MSE reduces as n_estimators increases

Notice also that the gap between the train and test curves — aka overfitting — can be changed by tuning other hyperparameters, such as max_depth . In the next plot I show how the gap changes depending on max_depth and n_estimators.

Conclusions

• Yes, random forests can overfit since a single tree can overfit.
• The bias of random forests is the same as the bias of a single tree, however, the variance decreases as we add more trees to the model, and this is where the power of random forests comes from.
• Overfitting in a random forest model can be tuned using other hyperparameters such as max_depth , but increasing n_estimators doesn’t increase the gap between train and test performance.

1. In this blog post we’ll consider only the case of regression. The results are easily extended to classification.