Data Representation Bias and Conditional Distribution Shift Drive Predictive Performance Disparities in Multi-Population Machine Learning

Sandeep Kumar
Yan Cui

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning frequently encounters challenges when applied to population-stratified datasets, where data representation bias and data distribution shifts can substantially impact model performance and generalizability across different subpopulations. These challenges are well illustrated in the context of polygenic prediction for diverse ancestry groups, though the underlying mechanisms are broadly applicable to machine learning with population-stratified data. Using synthetic genotype-phenotype datasets representing five continental populations, we evaluate three approaches for utilizing population-stratified data, mixture learning, independent learning, and transfer learning, to systematically investigate how data representation bias and distribution shifts influence polygenic across ancestry groups. Our results show that conditional distribution shifts, in combination with data representation bias, significantly influence machine learning model performance disparities across diverse populations and the effectiveness of transfer learning as a disparity mitigation strategy, while marginal distribution shifts have a limited effect. The joint effects of data representation bias and distribution shifts demonstrate distinct patterns under different multi-population machine learning approaches, providing important insights to inform the development of effective and equitable machine learning models for population-stratified data.

Version published to 10.1101/2025.06.18.658431v1 on bioRxiv
Jun 19, 2025

Imputation Disparities Driven by Recent Selection and Their Impact on Disease Risk Estimation in East and Southeast Asian Populations

This article has 22 authors:
1. Yong-Fei Wang
2. Dingyang Li
3. Pattarin Tangtanatakul
4. Yao Lei
5. Xiaoxi Liu
6. Hsi-Yuan Huang
7. Yang-Chi-Dung Lin
8. Chengjia Li
9. Yidan Chen
10. Lizhi Cai
11. Jinglu Zhao
12. Prapaporn Pisitkun
13. Thanitta Suangtamai
14. Jinhan Yu
15. Yihang Zhou
16. Punna Kunhapan
17. Rui Sun
18. Guangjun Yu
19. Hao Sun
20. Nattiya Hirankarn
21. Hsien-Da Huang
22. Wanling Yang
This article has no evaluationsLatest version May 29, 2025
Almost Free Enhancement of Multi-Population PRS: From Data-Fission to Pseudo-GWAS Subsampling

This article has 7 authors:
1. Leqi Xu
2. Yikai Dong
3. Xiaowei Zeng
4. Zeyu Bian
5. Geyu Zhou
6. Leying Guan
7. Hongyu Zhao
This article has no evaluationsLatest version Jun 20, 2025
Multiple imputation using multivariate adaptive regression splines

This article has 1 author:
1. Jerome Sepin
This article has no evaluationsLatest version May 13, 2025

Listed in

Abstract

Article activity feed

Related articles

Imputation Disparities Driven by Recent Selection and Their Impact on Disease Risk Estimation in East and Southeast Asian Populations

Almost Free Enhancement of Multi-Population PRS: From Data-Fission to Pseudo-GWAS Subsampling

Multiple imputation using multivariate adaptive regression splines