Decision Trees for Big Data Analytics: Foundations, Complexity, and Applied Evaluation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Decision trees remain a core method in big data analytics and applied statistics because they train efficiently, handle heterogeneous feature types, and yield interpretable decision rules. Despite this practical success, globally optimizing decision-tree structure is computationally intractable: Hyafil and Rivest proved that constructing optimal binary decision trees is NP-complete. This paper bridges (i) the mathematical foundations and complexity barriers behind optimal trees and (ii) the scalable induction strategies used in modern pipelines. We present a unified formal framework for classification and identification trees (misclassification and average external path length), a structured account of the EC3 → decision-tree reduction clarifying why exhaustive global optimization is infeasible, and an empirical evaluation on the Titanic dataset showing how depth constraints and minimum leaf sizes improve generalization and interpretability. We also summarize key scalability practices (histogram-based split search, parallelizable counting, and complexity control) that enable tree learning on large datasets.

Article activity feed