Decision Trees for Big Data Analytics: Foundations, Complexity, and Applied Evaluation

Subrata Karmaker

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Decision trees remain a core method in big data analytics and applied statistics because they train efficiently, handle heterogeneous feature types, and yield interpretable decision rules. Despite this practical success, globally optimizing decision-tree structure is computationally intractable: Hyafil and Rivest proved that constructing optimal binary decision trees is NP-complete. This paper bridges (i) the mathematical foundations and complexity barriers behind optimal trees and (ii) the scalable induction strategies used in modern pipelines. We present a unified formal framework for classification and identification trees (misclassification and average external path length), a structured account of the EC3 → decision-tree reduction clarifying why exhaustive global optimization is infeasible, and an empirical evaluation on the Titanic dataset showing how depth constraints and minimum leaf sizes improve generalization and interpretability. We also summarize key scalability practices (histogram-based split search, parallelizable counting, and complexity control) that enable tree learning on large datasets.

Version published to 10.21203/rs.3.rs-9361538/v1 on Research Square
Apr 10, 2026

Scientific discovery as meta-optimization: a combinatorial optimization case study

This article has 3 authors:
1. Yuan-Hang Zhang
2. Chesson Sipling
3. Massimiliano Di Ventra
This article has no evaluationsLatest version Mar 16, 2026
Large Language Models for Material Science: A Systematic Review

This article has 2 authors:
1. Cecília Coelho
2. Oliver Niggemann
This article has no evaluationsLatest version Apr 14, 2026
ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference

This article has 1 author:
1. Samuel Edusa
This article has no evaluationsLatest version Apr 13, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Scientific discovery as meta-optimization: a combinatorial optimization case study

Large Language Models for Material Science: A Systematic Review

ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference