Rare k-mers reveal centromere haplogroups underlying human diversity and cancer translocations
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Centromeres are among the most diverse and dynamically evolving regions of the human genome and are commonly affected in various human cancers. However, organized into highly repetitive α-satellite higher-order repeats (HORs), human centromere sequences have long resisted detailed genomic analysis. Although the development of long-read sequencing platforms has enabled the analysis of complete centromere sequences, their application to a large set of samples is still largely limited, preventing our understanding of centromere variation and haplotype structures across large human populations and the structural basis of centromere-involving translocations in cancer. Here we show that rare k-mers present in centromeric regions can serve as effective markers for dissecting the complexity of centromere structure, particularly that of active α-satellite HOR arrays (aHOR arrays), across human populations and for understanding centromere-involving abnormalities in cancer. Based on rare k-mer-based clustering, centromere aHOR arrays are clustered into discrete haplogroups (aHOR-HGs) with distinct structural features. These k-mers were also used to develop a framework that enables the inference of haplogroups in a given sample based on short-read whole genome sequencing (WGS) data (ascairn). By applying ascairn to large-scale human population datasets ( n > 3,300), we revealed the diversity of aHOR-HGs and their geographic histories across populations. The rare k-mer-based approach was also applied to investigate the structure of 1p/19q co-deletion, a highly recurrent centromere-involving translocation in IDH -mutated oligodendrogliomas. Analyzing short-read WGS data from 142 cases with 1p/19q co-deletion using rare k-mers, we showed that breakpoints of 1p/19q co-deletion were mapped to aHOR arrays in chromosomes 1 ( D1Z7 ) and 19 ( D19Z3 ), which was validated by long-read sequencing of two 1p/19q co-deletion-positive cases. Notably, the translocation preferentially involved haplogroups composed of haplotypes containing larger regions susceptible to rearrangement. These results highlight the role of rare k-mers in dissecting the complexity of centromere sequences and their evolutionary history as well as understanding centromere-involving abnormalities associated with human diseases.