Comparative Analysis of Machine Learning Models for House Price Prediction: From Linear Regression to Boosted Trees

Mahim Al Muntashir Billah
Tasrifa Sarker

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper presents a rigorous, apples-to-apples comparison of modern tree-ensemble learners against regularized linear baselines for house price prediction on the Kaggle “House Prices: Advanced Regression Techniques” dataset (1,460 homes, 79 predictors). A singular, leakage-safe pipeline is implemented, standardizing preprocessing steps such as median/most-frequent imputation, ordinal/one-hot encoding, and numeric scaling for linear models. This pipeline additionally incorporates a log transformation of SalePrice and introduces domain-informed feature engineering, including variables such as TotalSF, TotalBaths, Age/RemodAge, PorchSF, and selected size×quality interactions. The models employed encompass OLS, Ridge, Lasso, CART, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost, with hyperparameter tuning conducted through repeated cross-validation and randomized search. Performance is reported on an 80/20 split using RMSE (log) as the primary metric and MAE (log) and \(\:{R}^{2}\) as secondary metrics. CatBoost attains the best validation performance—RMSE (log) = 0.1275, MAE (log) = 0.0813, \(\:{R}^{2}=0.9129\)—with LightGBM/XGBoost close behind and Random Forest trailing slightly; Ridge/Lasso outperform OLS but underfit nonlinearities. To reconcile accuracy with transparency, we provide permutation importance and SHAP analyses, which consistently identify OverallQual, TotalSF/GrLivArea, garage capacity, bathroom aggregates, and recency (YearBuilt/YearRemodAdd) as dominant drivers, with clear nonlinear diminishing returns to size and size×quality interactions. An ablation demonstrates that feature engineering materially improves accuracy across model families. This study contributes a reproducible benchmark, a feature-engineering ablation, and unified interpretability that together form a deployment-ready blueprint for real-estate valuation and decision support.

Version published to 10.21203/rs.3.rs-7840588/v1 on Research Square
Oct 14, 2025

Detecting financial misstatements in emerging markets: a machine learning approach

This article has 3 authors:
1. Hoa Thi Thanh Tieu
2. Thanh Hien Hoang
3. Hung Ngoc Tran
This article has no evaluationsLatest version Oct 1, 2025
Forecasting the Dow Jones Australia Index: A Comparative Evaluation of Machine Learning Regression Models

This article has 2 authors:
1. Sevda Kuşkaya
2. Faik Bilgili
This article has no evaluationsLatest version Aug 28, 2025
Tax Evasion Prediction Using Financial Ratios and Machine Learning: A Hybrid Model Based on MLP, Naive Bayes, SVM, and Harmony Search Optimization

This article has 2 authors:
1. MAHA MOHEY Elweshihy
2. Samir Aboul Fotuoh Saleh
This article has no evaluationsLatest version Sep 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Detecting financial misstatements in emerging markets: a machine learning approach

Forecasting the Dow Jones Australia Index: A Comparative Evaluation of Machine Learning Regression Models

Tax Evasion Prediction Using Financial Ratios and Machine Learning: A Hybrid Model Based on MLP, Naive Bayes, SVM, and Harmony Search Optimization