Ensemble Machine Learning for CO 2 Corrosion Rate Prediction with Heterogeneous Datasets

Joan Ejeta
Tolu Emiola-Sadiq
Robert Eshun
Kristen Rhinehardt

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Corrosion accounts for billions of dollars in financial losses across the energy industry. However, limited access to quality, publicly available pipeline corrosion data significantly hinders accurate prediction, prevention, and the development of effective, data-driven maintenance strategies. This study develops an ensemble machine-learning framework to predict CO ₂ corrosion rates in carbon-steel pipelines. It uses a heterogeneous dataset that integrates simulated data, experimental results, and field measurements to train ensemble machine learning models. Data preprocessing involved removal of outliers and imputation of missing values using simple imputer with Gaussian Mixture Model Expectation–Maximization, which preserves multivariate dependencies. To improve sensitivity to rare, high-consequence corrosion events, Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN) was used to address target imbalances. Feature selection identified CO ₂ partial pressure, pH, flow velocity, and temperature as dominant predictors. Hyperparameters of four ensemble regressors (Extra Trees, Gradient Boosting Regressor, Random Forest, XGBoost) were tuned using grid search and 3-fold cross-validation. The Gradient Boosting Regressor outperformed other models with accuracy and generalization on the test set (R ² test = 0.70; MSE = 6.43 mm/yr). Model validation yields R ² = 0.82 across 0–22 mm/yr and median absolute percentage errors below 50% in operationally critical regimes (≥ 1 mm/yr). The proposed machine learning framework offers a cost-effective, data-driven approach for improving pipeline integrity management on heterogeneous datasets.

Version published to 10.21203/rs.3.rs-8896642/v1 on Research Square
Mar 5, 2026

Comparison of Liquefaction Potential Prediction Results Using Machine Learning Methods Based on CPT Data

This article has 2 authors:
1. Sercan Tekeoğlu
2. Ender Başarı
This article has no evaluationsLatest version Apr 14, 2026
A Photovoltaic Power Forecasting Method Integrating Physical Mechanisms and Deep Learning

This article has 5 authors:
1. Hongyu Gao
2. Yuan Lu
3. Lizhong Tang
4. Jinrui Cai
5. Fei Han
This article has no evaluationsLatest version Apr 17, 2026
An Energy-Efficient Cascaded Machine Learning Framework for Predictive Network Anomaly Detection

This article has 1 author:
1. Dileesh Chandra Bikkasani
This article has no evaluationsLatest version Apr 17, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Comparison of Liquefaction Potential Prediction Results Using Machine Learning Methods Based on CPT Data

A Photovoltaic Power Forecasting Method Integrating Physical Mechanisms and Deep Learning

An Energy-Efficient Cascaded Machine Learning Framework for Predictive Network Anomaly Detection