Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies

Naishu Kui
Yao Yu
Jaihee Choi
Zachary R. McCaw
Xihao Li
Chad Huff
Ryan Sun

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Genome-wide association studies (GWAS) are crucial to human genetics research, yet their stability and reproducibility are often questioned. This work describes, analyzes, and provides tools for overcoming reproducibility challenges in two highly popular components of GWAS: set-based (a) hypothesis testing and (b) effect size estimation. Specifically, we focus on how the set-based natures of (a) and (b) often fuel non-reproducible results due to differences in data processing pipelines that are rarely discussed. First, we describe the processing challenges in a statistical model misspecification framework. Second, we analytically calculate the differences in power and amounts of bias that can arise in (a) and (b), respectively, due to small data processing choices. Third, we provide tools for quantifying and avoiding the data processing obstacles in GWAS. We validate our analytical calculations through a simulation study, and we demonstrate the aforementioned challenges empirically through analysis of a whole-exome sequencing study of pancreatic cancer.

Author Summary

The lack of reproducibility and stability in genome-wide association studies (GWAS) have been widely reported. Here, we demonstrate how such reproducibility challenges arise in a common component of GWAS, set-based hypothesis testing and estimation studies. Specifically, we show how minor, seemingly harmless decisions in how scientists prepare their data can lead to major differences in the final conclusions. These data processing steps are rarely reported in detail, further obscuring their importance. Our work precisely measures the impact of data processing steps on power and bias of common modeling approaches. As a partial solution for future GWAS, we also provide an R software package to interact with our results, which can be used to assess the impact of choices at the design stage of GWAS studies. We further analyze a pancreatic cancer dataset using two modern pipelines and show how the pipelines produce very different results for ATM , a gene that has been previously linked with pancreatic cancer. Using the tools provided by this work can help significantly improve the reproducibility and stability of set-based results, enhancing the translational potential of GWAS investigations.

Version published to 10.1101/2025.07.21.665850 on bioRxiv
Jul 22, 2025

Application of longitudinal follow-up data increases power in the identification of genetic loci for type 2 diabetes

This article has 1 author:
1. Seong Beom Cho
This article has no evaluationsLatest version Dec 18, 2025
A resource of “bottom-line” variant associations for 1,281 complex traits by integrating data across published genome-wide association studies

This article has 24 authors:
1. Trang Nguyen
2. Furkan Büyükgöl
3. Patrick Smadbeck
4. Jeffrey Massung
5. Maria Costanzo
6. Monica Ruiz
7. Peter Dornbos
8. Satoshi Yoshiji
9. Ryan Koesterer
10. Thanh Long Nguyen
11. Dongkeun Jang
12. Quy Hoang
13. Yue Ji
14. Aoife McMahon
15. Sebanti Sengupta
16. Xianyong Yin
17. Brady Ryan
18. Ryan Welch
19. Jorien Treur
20. Connie Bezzina
21. Gonçalo R. Abecasis
22. Michael Boehnke
23. Noel Burtt
24. Jason Flannick
This article has no evaluationsLatest version Jan 22, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Author Summary

Article activity feed

Related articles

Application of longitudinal follow-up data increases power in the identification of genetic loci for type 2 diabetes

A resource of “bottom-line” variant associations for 1,281 complex traits by integrating data across published genome-wide association studies

Understanding Pathways in Bioinformatics, Genomics, and Health Applications