Search AI: a Machine Learning algorithm for chronic kidney disease risk detection using eight readily available clinical features

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Chronic kidney disease (CKD) is a leading global cause of morbidity and mortality, particularly in low- and middle-income countries (LMIC) where access to specialized laboratory tests is limited. Early detection is essential but often delayed due to reliance on serum creatinine-based estimated glomerular filtration rate (eGFR). Artificial intelligence (AI) offers opportunities for simple, sensitive screening models using routinely available variables. Methods: We trained and tested a low-cost machine learning algorithm in a multicenter Latin American dataset of 203,067 anonymized records to identify patients at risk of CKD, defined as an eGFR <60 mL/min/1.73m² (CKD-EPI 2021). Eight routinely available, non-invasive variables were used: age, sex, systolic and diastolic blood pressure, body mass index, hypertension, presence of type 2 diabetes (T2D), and diabetes duration (T2DD). To address the imbalance between CKD-positive and CKD-negative cases, oversampling techniques were applied before splitting the dataset into training (70%), validation (12%), and testing (18%). Using the Arkangel AutoML platform, 424 candidate models were generated, including decision trees, random forests, support vector machines, XGBoost, and deep neural networks. Models were prioritized based on predefined criteria: sensitivity >90%, followed by AUC, precision, specificity, and F1 score. Results: The final model was a decision tree trained in a non-stratified sample with the SMOTE augmentation technique. Sensitivity was 90.2%, specificity 92.7%, precision (PPV) 89%, and AUC 91.4%. Binary regression demonstrated the statistical relevance of all the model’s features in predicting CKD risk in our sample. SHAP analysis identified age and diabetes duration as the most influential features in the final ML model. Conclusions : A decision tree model trained with eight routine clinical variables accurately identified individuals at risk of CKD, achieving high sensitivity and balanced performance without requiring specialized tests. This approach is feasible for large-scale screening in low-resource settings and can be integrated into electronic health records to prioritize confirmatory diagnostics and timely care. It also represents one of the first approximations to CKD diagnosis using ML models trained exclusively on Latin American data.

Article activity feed