Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genome-wide association studies (GWAS) are crucial to human genetics research, yet their stability and reproducibility are often questioned. This work describes, analyzes, and provides tools for overcoming reproducibility challenges in two highly popular components of GWAS: set-based (a) hypothesis testing and (b) effect size estimation. Specifically, we focus on how the set-based natures of (a) and (b) often fuel non-reproducible results due to differences in data processing pipelines that are rarely discussed. First, we describe the processing challenges in a statistical model misspecification framework. Second, we analytically calculate the differences in power and amounts of bias that can arise in (a) and (b), respectively, due to small data processing choices. Third, we provide tools for quantifying and avoiding the data processing obstacles in GWAS. We validate our analytical calculations through a simulation study, and we demonstrate the aforementioned challenges empirically through analysis of a whole-exome sequencing study of pancreatic cancer.

Author Summary

The lack of reproducibility and stability in genome-wide association studies (GWAS) have been widely reported. Here, we demonstrate how such reproducibility challenges arise in a common component of GWAS, set-based hypothesis testing and estimation studies. Specifically, we show how minor, seemingly harmless decisions in how scientists prepare their data can lead to major differences in the final conclusions. These data processing steps are rarely reported in detail, further obscuring their importance. Our work precisely measures the impact of data processing steps on power and bias of common modeling approaches. As a partial solution for future GWAS, we also provide an R software package to interact with our results, which can be used to assess the impact of choices at the design stage of GWAS studies. We further analyze a pancreatic cancer dataset using two modern pipelines and show how the pipelines produce very different results for ATM , a gene that has been previously linked with pancreatic cancer. Using the tools provided by this work can help significantly improve the reproducibility and stability of set-based results, enhancing the translational potential of GWAS investigations.

Article activity feed