Comparative Analysis of Machine Learning Models for House Price Prediction: From Linear Regression to Boosted Trees
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a rigorous, apples-to-apples comparison of modern tree-ensemble learners against regularized linear baselines for house price prediction on the Kaggle “House Prices: Advanced Regression Techniques” dataset (1,460 homes, 79 predictors). A singular, leakage-safe pipeline is implemented, standardizing preprocessing steps such as median/most-frequent imputation, ordinal/one-hot encoding, and numeric scaling for linear models. This pipeline additionally incorporates a log transformation of SalePrice and introduces domain-informed feature engineering, including variables such as TotalSF, TotalBaths, Age/RemodAge, PorchSF, and selected size×quality interactions. The models employed encompass OLS, Ridge, Lasso, CART, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost, with hyperparameter tuning conducted through repeated cross-validation and randomized search. Performance is reported on an 80/20 split using RMSE (log) as the primary metric and MAE (log) and \(\:{R}^{2}\) as secondary metrics. CatBoost attains the best validation performance—RMSE (log) = 0.1275, MAE (log) = 0.0813, \(\:{R}^{2}=0.9129\)—with LightGBM/XGBoost close behind and Random Forest trailing slightly; Ridge/Lasso outperform OLS but underfit nonlinearities. To reconcile accuracy with transparency, we provide permutation importance and SHAP analyses, which consistently identify OverallQual, TotalSF/GrLivArea, garage capacity, bathroom aggregates, and recency (YearBuilt/YearRemodAdd) as dominant drivers, with clear nonlinear diminishing returns to size and size×quality interactions. An ablation demonstrates that feature engineering materially improves accuracy across model families. This study contributes a reproducible benchmark, a feature-engineering ablation, and unified interpretability that together form a deployment-ready blueprint for real-estate valuation and decision support.