Public RNA-seq data are not representative of global human diversity

Irene Gallego Romero
Grace Rodenberg
Audrey M. Arner
Lani Li
Isobel J. Beasley
Ryan Rossow
Nicholas Ryan
Selina Wang
Amanda J. Lea

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The field of human genetics has reached a consensus that it is important to work with diverse and globally representative participant groups. This diverse sampling is required to build a robust understanding of the genomic basis of complex traits and diseases as well as human evolution, and to ensure that all people benefit from downstream scientific discoveries. While previous work has characterized compositional biases and disparities for public genome-wide association (GWAS), microbiome, and epigenomic studies, we currently lack a comprehensive understanding of the degree of bias for transcriptomic studies. To address this gap, we analyzed the metadata for RNA-seq studies from two public databases—the Sequence Read Archive (SRA), representing 795,071 samples from 21,209 studies, and the Database of Genotypes and Phenotypes (dbGaP), representing 167,389 samples from 649 studies. We also randomly selected 620 studies from SRA for detailed, manual evaluation. We found that 3% of samples in SRA and 21% of individuals described in the literature had population descriptors (race, ethnicity, or ancestry); 28% of samples in dbGaP had paired genotype data that was used to empirically infer ancestry. In SRA, dbGaP, and the literature, race, ethnicity, and ancestry terms were frequently conflated and difficult to disambiguate. After standardizing population descriptors, we observed many clear biases: for example, among samples in SRA that were coded using US Census terms, 69.0% came from white donors, corresponding to an 1.2x overrepresentation of this group relative to the US population. Among samples in SRA coded using continental ancestry labels, 55.6% came from European ancestry donors—an 4.1x overrepresentation of this group relative to the global population. These biases were generally similar across datasets (SRA, dbGaP, literature review), and were comparable to previous reports for other ‘omics data types. However, we note that, relative to other ‘omics data subsets like GWAS, there is considerably less information, of arguably worse quality, about who is participating in RNA-seq studies. Together, these results demonstrate a critical need to improve our thoughtfulness, consistency, and effort around reporting population descriptors in RNA-seq studies, and to more generally strive for greater diversity in this important data type.

Version published to 10.1101/2024.10.11.617967 on bioRxiv
Oct 12, 2024

Benchmarking RNA-seq Tools for Real-World Diagnostic Applications

This article has 15 authors:
1. Sarah Silverstein
2. Kaushik Ganapathy
3. Sandra Donkervoort
4. Veronique Bolduc
5. Ying Hu
6. Justin Moy
7. Prech Uapinyoying
8. Svetlana Gorokhova
9. Vijay Ganesh
10. Ben Weisburd
11. Rotem OrBach
12. A. Reghan Foley
13. Pejman Mohammadi
14. David Adams
15. Carsten Bonnemann
This article has no evaluationsLatest version Jan 29, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
Quantitative evaluation of microbiome sequencing resolution under varying experimental conditions using defined mock communities

This article has 5 authors:
1. Songhee Lee
2. Hyeonah Lee
3. Jung Wook Kim
4. Hyeon-Jin Kim
5. Kwang Jun Lee
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Benchmarking RNA-seq Tools for Real-World Diagnostic Applications

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

Quantitative evaluation of microbiome sequencing resolution under varying experimental conditions using defined mock communities