Predicting the Onset of Type 2 Diabetes (T2D) Based on Genetic and Clinical Risk Factors Using XGBoost ML Model
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Both hereditary and clinical risk factors influence development of T2D. Currently a rich body of research exists about the effect of the clinical factors on T2D, but less is known about how genetic factors influence the development of T2D. Therefore, we used an AI trained ML algorithm to better understand how genetic variants influence the development of T2D in the presence of high, moderate, and low risk clinical factors.We collected genetic and clinical risk factor data sets from publicly available sources. We probabilistically assigned genetic variants from our genetic dataset to the individuals in the clinical dataset to form a single dataset containing both clinical and genetic risk factors. The combined data set was then trained on XGBoost XGBClassifier. SHAP Summary plots were also generated for each risk group after model training. The model’s predictive performance (AUC scores) achieved highest accuracy with the low-risk group, while the moderate and high-risk groups performed slightly lower. According to the SHAP plots, both BMI and family history are key predictors of T2D across all risk groups. However, SNP effect sizes were more influential than other clinical risk factors, indicating that genetic contributions, while secondary, were still relevant. ROC curves assess the model’s ability to predict diabetes cases across risk groups. All models performed above the 0.7 AUC threshold, with the low risk group having an AUC score of 0.9116, the medium risk group AUC score being 0.7372, and the high risk group AUC score being 0.7366. indicating they are clinically applicable and not affected by assignment of genetic variables. While genetic treatments for diabetes remain experimental, our work supports emerging advancements in pharmacogenomics and gene-based therapies by helping to identify which patients may benefit from specific drug regimens including gene-based interventions.