Bagging vs Boosting for tree-based models
Comparing the two most powerful techniques for tabular supervised learning!
Hello fellow machine learners,
đ Eid Mubarak to all those who are celebrating! đ
In the past two weeks, we have discussed bagging and boosting algorithms. These are incredibly powerful algorithms that have many use cases in industry at the moment, so be sure to have a read of the relevant articles if you havenât already:
This week, weâre keeping things light by drawing some comparisons between the two techniques, with an emphasis on random forests vs gradient-boosted trees. That is to say, weâre focussing on tree-based models in this discussion.
Letâs get to unpacking!
Sequential or parallel?
In the random forest algorithm, each tree is grown independently of the others. Indeed, the underlying bootstrap sampling technique is conducted with replacement, so it doesnât matter what order our trees are grown. Thus, each constituent tree can be built in parallel.
Conversely, with the gradient boosting technique, we begin with a decision tree stump and evaluate the errors made. From there, we train subsequent learner on these errors, and we repeat this process until we end up with a robust ensemble. Hence, the trees must be built sequentially.
I couldnât really think of a new animation to make for this article, so for now I have just pasted in the bagging and boosting animations from the past two weeks to really highlight the difference in the processes:
Bagging:
Boosting:
Bias-variance perspective
I plan on addressing the bias-variance tradeoff in a future article. For now though, the main ideas are the following:
Bias relates to model underfitting. ML models that are not âstrongâ enough have high bias.
Variance is related to model overfitting. In general, a model that fails to generalise to unseen data is said to suffer from high variance.
Given the above descriptions, the best case scenario is for our ML models to have both low bias and low variance.
Since the decision tree can capture complex relationships in data, we consider fully grown trees to be low-bias. However, as discussed in the article below, these trees can easily overfit the data, demonstrating their high-variance tendencies:
This highlights another area in which our two ensemble techniques differ.
The random forest aims to reduce the variance of the decision tree algorithm by aggregating the predictions of a group of trees.
How does the the gradient boosting technique differ? Well, the main idea is to start with a âweak learnerâ and iteratively correct its prediction errors through the construction of the ensemble. We usually instantiate gradient boosted trees with a decision tree stump, which is just a decision tree of one split. The stump is precisely a high-bias, low-variance model.
Thus, the boosting technique aims to reduce the bias of the stump.
Packing it all up
Weâll wrap todayâs article up with a pros and cons list for the two ensemble techniques:
Random forest:
âď¸ Can train the constituent trees in parallel, which can help reduce the algorithmâs runtime in comparison to the gradient boosted tree
âď¸ Not as susceptible to overfitting as decision trees, because the ensemble is designed to reduce variance.
âď¸ Well-suited to handling outliers: each tree is trained on a bootstrapped sample, and so any individual outlier contributes less to skewing the ensemble.
â Not as intepretable as a standard decision tree: each tree has its own feature importances, which cannot be intuitively aggregated.
â It can take the model a long time to make predictions, because data must be run through all the trees in the forest.
Gradient boosted tree:
âď¸ Can outperform random forests when fully fine-tuned.
âď¸ Better than bagging when dealing with imbalanced datasets. This is because subsequent learners are trained on the errors of previous learners, which often constitutes minority classes.
â More sensitive to hyperparameters than the random forest, meaning that hyperparameter tuning is more important for the gradient boosted tree than the random forest.
â More prone to overfitting than the random forest. The more algorithm iterations you allow, the more likely the gradient boosted setup will overfit the noise in the data, since each iteration aims to correct the errors of the previous model.
â Learners must be built sequentially, which can elongate the algorithmâs runtime.
Hopefully these lists help give you a sense of which to try first for your modelling situations.
Training complete!
I hope you enjoyed reading as much as I enjoyed writing đ
Do leave a comment if youâre unsure about anything, if you think Iâve made a mistake somewhere, or if you have a suggestion for what we should learn about next đ
Until next Sunday,
Ameer
PS⌠like what you read? If so, feel free to subscribe so that youâre notified about future newsletter releases:
Sources
âThe Elements of Statistical Learningâ, by Trevor Hastie, Robert Tibshirani, Jerome Friedman: https://link.springer.com/book/10.1007/978-0-387-84858-7
âPopular Ensemble Methods: An Empirical Studyâ, by David Opitz and Richard Maclin: https://www.d.umn.edu/~rmaclin/cs5751/notes/opitz-jair99.pdf




