Bagging vs Boosting for tree-based models

Comparing the two most powerful techniques for tabular supervised learning!

Mar 30, 2025

Hello fellow machine learners,

🌙 Eid Mubarak to all those who are celebrating! 🌙

In the past two weeks, we have discussed bagging and boosting algorithms. These are incredibly powerful algorithms that have many use cases in industry at the moment, so be sure to have a read of the relevant articles if you haven’t already:

Why random forests outperform decision trees: 'bagging' for variance reduction

Ameer Saleem

Mar 16

Read full story

Gradient boosted trees: an intro to boosting ensembles 🚀

Ameer Saleem

Mar 23

Read full story

This week, we’re keeping things light by drawing some comparisons between the two techniques, with an emphasis on random forests vs gradient-boosted trees. That is to say, we’re focussing on tree-based models in this discussion.

Let’s get to unpacking!

Sequential or parallel?

In the random forest algorithm, each tree is grown independently of the others. Indeed, the underlying bootstrap sampling technique is conducted with replacement, so it doesn’t matter what order our trees are grown. Thus, each constituent tree can be built in parallel.

Conversely, with the gradient boosting technique, we begin with a decision tree stump and evaluate the errors made. From there, we train subsequent learner on these errors, and we repeat this process until we end up with a robust ensemble. Hence, the trees must be built sequentially.

I couldn’t really think of a new animation to make for this article, so for now I have just pasted in the bagging and boosting animations from the past two weeks to really highlight the difference in the processes:

Bagging:

Boosting:

Bias-variance perspective

I plan on addressing the bias-variance tradeoff in a future article. For now though, the main ideas are the following:

Bias relates to model underfitting. ML models that are not ‘strong’ enough have high bias.
Variance is related to model overfitting. In general, a model that fails to generalise to unseen data is said to suffer from high variance.

Given the above descriptions, the best case scenario is for our ML models to have both low bias and low variance.

Since the decision tree can capture complex relationships in data, we consider fully grown trees to be low-bias. However, as discussed in the article below, these trees can easily overfit the data, demonstrating their high-variance tendencies:

What 'overfitting' looks like in decision trees, and how to prevent it

Ameer Saleem

Mar 9

Read full story

This highlights another area in which our two ensemble techniques differ.

The random forest aims to reduce the variance of the decision tree algorithm by aggregating the predictions of a group of trees.

How does the the gradient boosting technique differ? Well, the main idea is to start with a ‘weak learner’ and iteratively correct its prediction errors through the construction of the ensemble. We usually instantiate gradient boosted trees with a decision tree stump, which is just a decision tree of one split. The stump is precisely a high-bias, low-variance model.

Thus, the boosting technique aims to reduce the bias of the stump.

Packing it all up

We’ll wrap today’s article up with a pros and cons list for the two ensemble techniques:

Random forest:

✔️ Can train the constituent trees in parallel, which can help reduce the algorithm’s runtime in comparison to the gradient boosted tree

✔️ Not as susceptible to overfitting as decision trees, because the ensemble is designed to reduce variance.

✔️ Well-suited to handling outliers: each tree is trained on a bootstrapped sample, and so any individual outlier contributes less to skewing the ensemble.

❌ Not as intepretable as a standard decision tree: each tree has its own feature importances, which cannot be intuitively aggregated.

❌ It can take the model a long time to make predictions, because data must be run through all the trees in the forest.

Gradient boosted tree:

✔️ Can outperform random forests when fully fine-tuned.

✔️ Better than bagging when dealing with imbalanced datasets. This is because subsequent learners are trained on the errors of previous learners, which often constitutes minority classes.

❌ More sensitive to hyperparameters than the random forest, meaning that hyperparameter tuning is more important for the gradient boosted tree than the random forest.

❌ More prone to overfitting than the random forest. The more algorithm iterations you allow, the more likely the gradient boosted setup will overfit the noise in the data, since each iteration aims to correct the errors of the previous model.

❌ Learners must be built sequentially, which can elongate the algorithm’s runtime.

Hopefully these lists help give you a sense of which to try first for your modelling situations.

Training complete!

I hope you enjoyed reading as much as I enjoyed writing 😁

Do leave a comment if you’re unsure about anything, if you think I’ve made a mistake somewhere, or if you have a suggestion for what we should learn about next 😎

Until next Sunday,

Ameer

PS… like what you read? If so, feel free to subscribe so that you’re notified about future newsletter releases:

Sources

“The Elements of Statistical Learning”, by Trevor Hastie, Robert Tibshirani, Jerome Friedman: https://link.springer.com/book/10.1007/978-0-387-84858-7
“Popular Ensemble Methods: An Empirical Study”, by David Opitz and Richard Maclin: https://www.d.umn.edu/~rmaclin/cs5751/notes/opitz-jair99.pdf

Machine Learning Algorithms Unpacked

Why random forests outperform decision trees: 'bagging' for variance reduction

Gradient boosted trees: an intro to boosting ensembles 🚀

What 'overfitting' looks like in decision trees, and how to prevent it

Discussion about this post