Novel binning-based methods for model fitting and data splitting improved machine learning imbalanced data

Husam Abdulnabi
J. Timothy Westwood

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine Learning (ML) models may perform inconsistently on individual classes on nominal outputs or ranges on continuous outputs, collectively referred to here as bins. Models should be assessed through metrics that consider each bin individually, called bin metrics. Inconsistent model performance is often due to model fitting with imbalanced data. Towards improving modelling of imbalanced data, novel model fitting methods are proposed including using bin metrics as loss functions and the use of Epoch sampling. Imbalanced data also poses a challenge for appropriate data splitting. Akin split is a novel method proposed that objectively yields the most appropriate data split(s).

Existing and novel model fitting and data splitting methods were assessed in two case studies. The first case study used synthetically generated datasets with different levels of noise and imbalance. On datasets with noise and greater levels of imbalance, Epoch sampling significantly improved the model performance by up to 23.6% while significantly using less resources (computation and time) by up to 57.7% compared to a standard model fitting method. The second case study used protein-genome interactions data that are often severely right-skewed. Akin split was used to split the data more appropriately than traditional methods. Model fitting methods were tried on two model configurations. The effects of the model fitting methods varied by the model configuration, but all models were significantly improved by up to 35.3% compared to the standard model fitting.

Version published to 10.1101/2025.06.26.661884v1 on bioRxiv
Jul 1, 2025

Introducing DART: A Novel Deep Adaptive Upsampling Technique for Handling Class Imbalance

This article has 1 author:
1. Mark Lokanan
This article has no evaluationsLatest version Jun 18, 2025
Quadratic Surface Twin Support Vector Machine for Imbalanced Data

This article has 5 authors:
1. Hossein Moosaei
2. Milan Hladik
3. Ahmad Mousavi
4. Zheming Gao
5. Haojie Fu
This article has no evaluationsLatest version Jun 13, 2025
An Empowered Transfer Learning Model for Predictive Classification of Lung Cancer

This article has 6 authors:
1. Syed Thouheed Ahmed
2. Satheesha Tumakur Yoga
3. Lakshmi Hassan Nagaraja
4. Sandeep Kumar Mathivanan
5. R. Sangeetha
6. Saurav Mallik
This article has no evaluationsLatest version May 21, 2025

Listed in

Abstract

Article activity feed

Related articles

Introducing DART: A Novel Deep Adaptive Upsampling Technique for Handling Class Imbalance

Quadratic Surface Twin Support Vector Machine for Imbalanced Data

An Empowered Transfer Learning Model for Predictive Classification of Lung Cancer