Uncovering Functional Sequence Gaps in Human Reference Genomes using African Pan Genome Contig Sequences

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Predominantly used human reference genomes, including GRCh38 and the gapless T2T-CHM13 references, remain limited in their representation of African genomic diversity. We analyzed African Pan- Genome (APG) contig sequences representing 296.5 Mb of African-ancestry-specific sequence not represented in GRCh38 to assess their representation and functional potential in newer long-read assemblies. Alignments to T2T-CHM13 and the 47 Human Pangenome Reference Consortium (HPRC) linear assemblies positioned 40% and 83% of APG contigs, respectively, with high identity and coverage. Most T2T-CHM13 placements corresponded to sequences absent from GRCh38 (94.5%) and were enriched in centromeric and satellite repeats (94.2%). Functional overlap included annotated genes (2.6%) and CpG islands (3.6%), with enrichment in immune, synaptic and intracellular signaling pathways. HPRC alignments revealed ancestry-associated patterns, with African and Admixed American genomes showing the highest numbers of unique contig placements and shared alignments. A subset of 742 APG contigs showed weak or no mapping to any of the reference assemblies. These contigs were not enriched for repeat elements, and ∼58% showed predicted gene content. These findings highlight persistent gaps in even the most complete reference genomes and underscore the importance of incorporating ancestry-enriched sequences into future genome frameworks to reduce reference bias and advance equitable discovery in genomics.

Article activity feed