Genotyping TOMM40’523 Poly-T Polymorphisms Using Whole-Genome Sequencing
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The TOMM40’523 poly-T repeat polymorphism (rs10524523), located in the TOMM40 gene and in linkage disequilibrium with APOE , has been associated with cognitive decline and Alzheimer’s disease (AD) progression. Accurate genotyping of this polymorphism is crucial for understanding its role in neurodegeneration. Challenges in processing whole-genome sequencing (WGS) data traditionally require additional PCR and targeted sequencing assays to genotype these polymorphisms. Here, we introduce a novel computational pipeline that integrates multiple short tandem repeat (STR) detection tools in an ensemble machine learning model using XGBoost . This approach leverages STR tool predictions, k-mer counts, and related features to enhance poly-T repeat length estimation. Using a sample of 1,202 participants from four cohort studies, we benchmarked our method against PCR-based measures. Our ensemble model outperformed individual STR tools, improving repeat length estimation accuracy (R 2 = 0.92) and achieving an accuracy rate of 93.2% with PCR-derived genotypes as the gold standard. Additionally, we validated our WGS-derived genotypes by replicating previously reported associations between TOMM40’523 variants and cognitive decline, demonstrating consistency with prior findings. Our results suggest that computational genotyping from WGS data is a scalable and reliable alternative to PCR-based assays, enabling broader investigations of TOMM40 variation in studies where WGS data is available.