Comparative Analysis of Machine Learning Models for House Price Prediction: From Linear Regression to Boosted Trees

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper presents a rigorous, apples-to-apples comparison of modern tree-ensemble learners against regularized linear baselines for house price prediction on the Kaggle “House Prices: Advanced Regression Techniques” dataset (1,460 homes, 79 predictors). A singular, leakage-safe pipeline is implemented, standardizing preprocessing steps such as median/most-frequent imputation, ordinal/one-hot encoding, and numeric scaling for linear models. This pipeline additionally incorporates a log transformation of SalePrice and introduces domain-informed feature engineering, including variables such as TotalSF, TotalBaths, Age/RemodAge, PorchSF, and selected size×quality interactions. The models employed encompass OLS, Ridge, Lasso, CART, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost, with hyperparameter tuning conducted through repeated cross-validation and randomized search. Performance is reported on an 80/20 split using RMSE (log) as the primary metric and MAE (log) and \(\:{R}^{2}\) as secondary metrics. CatBoost attains the best validation performance—RMSE (log) = 0.1275, MAE (log) = 0.0813, \(\:{R}^{2}=0.9129\)—with LightGBM/XGBoost close behind and Random Forest trailing slightly; Ridge/Lasso outperform OLS but underfit nonlinearities. To reconcile accuracy with transparency, we provide permutation importance and SHAP analyses, which consistently identify OverallQual, TotalSF/GrLivArea, garage capacity, bathroom aggregates, and recency (YearBuilt/YearRemodAdd) as dominant drivers, with clear nonlinear diminishing returns to size and size×quality interactions. An ablation demonstrates that feature engineering materially improves accuracy across model families. This study contributes a reproducible benchmark, a feature-engineering ablation, and unified interpretability that together form a deployment-ready blueprint for real-estate valuation and decision support.

Article activity feed