Ensemble Machine Learning for CO 2 Corrosion Rate Prediction with Heterogeneous Datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Corrosion accounts for billions of dollars in financial losses across the energy industry. However, limited access to quality, publicly available pipeline corrosion data significantly hinders accurate prediction, prevention, and the development of effective, data-driven maintenance strategies. This study develops an ensemble machine-learning framework to predict CO 2 corrosion rates in carbon-steel pipelines. It uses a heterogeneous dataset that integrates simulated data, experimental results, and field measurements to train ensemble machine learning models. Data preprocessing involved removal of outliers and imputation of missing values using simple imputer with Gaussian Mixture Model Expectation–Maximization, which preserves multivariate dependencies. To improve sensitivity to rare, high-consequence corrosion events, Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN) was used to address target imbalances. Feature selection identified CO 2 partial pressure, pH, flow velocity, and temperature as dominant predictors. Hyperparameters of four ensemble regressors (Extra Trees, Gradient Boosting Regressor, Random Forest, XGBoost) were tuned using grid search and 3-fold cross-validation. The Gradient Boosting Regressor outperformed other models with accuracy and generalization on the test set (R 2 test = 0.70; MSE = 6.43 mm/yr). Model validation yields R 2 = 0.82 across 0–22 mm/yr and median absolute percentage errors below 50% in operationally critical regimes (≥ 1 mm/yr). The proposed machine learning framework offers a cost-effective, data-driven approach for improving pipeline integrity management on heterogeneous datasets.