A Comparative Study of Explainable Machine Learning and Multivariate Regression for Predicting Fuel Consumption and CO2 Emissions in Multi-Brand Passenger Vehicles
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper develops and benchmarks a unified, explainable modeling framework for predicting passenger-vehicle fuel consumption and CO$_2$ emissions, and for interpreting the drivers of these outcomes. Using a multi-brand dataset spanning model years 2019–2023, we formulate the prediction tasks as supervised learning problems and compare two regression baselines (multiple linear regression and ridge regression) with two nonlinear ensemble learners (random forest and gradient boosting). Across both targets, ensemble models consistently deliver higher out-of-sample accuracy than linear methods, indicating the presence of nonlinearity and feature interactions that are not well captured by purely additive specifications. To ensure interpretability alongside performance, we employ SHAP-based attribution to decompose predictions into feature-level contributions and to provide both global and instance-wise explanations. The explanation results robustly rank engine size, vehicle class, and fuel type among the most influential predictors, and the inferred effects are directionally consistent with engineering intuition. Overall, the study demonstrates that explainable machine learning can simultaneously improve predictive fidelity and provide transparent, decision-relevant insights, supporting applications in vehicle design optimization and evidence-based environmental policy.