AI-Driven Synthetic Cohorts to Explore Genetic Associations: Lessons from Testicular Cancer with Relevance to Rare Conditions

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective Rare cancers often involve small patient cohorts, which limit statistical power and hinder biomarker discovery. This proof-of-concept study evaluates the feasibility of repurposing the Synthpop R package—originally designed for data anonymization—to generate synthetic datasets and improve statistical inference in rare cancer studies. We demonstrate this approach using the association between the MC4R rs79783591 variant and clinical outcomes in testicular germ cell tumors (TGCT). Materials and Methods A retrospective cohort of 220 Mexican TGCT patients was analyzed, identifying 9 heterozygous carriers of the MC4R variant. To overcome limited sample size, we used Synthpop to generate 234 synthetic carriers, maintaining the original data structure and statistical distributions. Dataset fidelity was validated using machine learning models, principal component analysis, and structural similarity metrics. Cox proportional hazards models and Kaplan–Meier survival analyses assessed associations with overall survival. Results Real carriers were diagnosed at a younger median age (22 vs. 26 years). In the synthetic cohort, the MC4R variant was associated with a threefold increased mortality risk (HR = 3.15; 95% CI: 2.06–4.82; p < 0.001), supporting findings in the real cohort (HR = 5.69; 95% CI: 1.56–20.7; p = 0.008). Synthetic data narrowed confidence intervals and improved effect size estimation. Conclusion Repurposing the Synthpop R package provides a novel approach to enhance statistical power in studies with small sample sizes. This strategy can improve inference reliability and accelerate biomarker discovery in rare cancer research; however, further validation in independent cohorts is required to confirm these findings.

Article activity feed