Tree-based machine learning methods for multilevel data: The impact of predictor levels and clustering on prediction and inference
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Machine learning methods, like decision trees and random forests, allow researchers to investigate complex non-linear and interaction effects, making them valuable tools for exploring complex psychological processes. Recently, attempts have been made to extend decision tree methods (multilevel-trees) for their application to multilevel data, e.g., with level-1 (e.g., students) and level-2 (e.g., classes) units. While these adaptations include adding random effects to address lack of independence between observations, they do not consider the influence of the levels at which predictor variables operate. We investigate how predictor level (level-1 vs. level-2) and clustering (Intraclass Correlation Coefficient (ICC), number of clusters, cluster size) affect inference and prediction in six tree-based methods: rpart, ctree, REEMtree, REEMctree, MERT, and lmertree. Using simulation studies, we evaluate variable selection, predictive performance (PMSE, r2), and predictive contribution of single predictor variables. Our results show that both standard and multilevel-tree methods are seriously affected by the level of the predictor variables and clustering in the data regarding prediction and inference. In particular, the risk of falsely selecting non-informative level-2 predictors is substantial, especially when the ICC is high. Ignoring the multilevel structure of the data might lead to erroneous research conclusions even though multilevel tree methods are applied.