Prediction of diabetic kidney disease risk using machine learning models: A population-based cohort study of Asian adults

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    There is an urgent need to improve prognostication of diabetic kidney disease in different diverse populations so this study is valuable in identifying specific predictive factors in a cohort of South East Asian populations whose baseline risk is higher. There are some limitations: the assumptions the authors make and the methods would benefit from some more investigation/validation.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Machine learning (ML) techniques improve disease prediction by identifying the most relevant features in multidimensional data. We compared the accuracy of ML algorithms for predicting incident diabetic kidney disease (DKD).

Methods:

We utilized longitudinal data from 1365 Chinese, Malay, and Indian participants aged 40–80 y with diabetes but free of DKD who participated in the baseline and 6-year follow-up visit of the Singapore Epidemiology of Eye Diseases Study (2004–2017). Incident DKD (11.9%) was defined as an estimated glomerular filtration rate (eGFR) <60 mL/min/1.73 m 2 with at least 25% decrease in eGFR at follow-up from baseline. A total of 339 features, including participant characteristics, retinal imaging, and genetic and blood metabolites, were used as predictors. Performances of several ML models were compared to each other and to logistic regression (LR) model based on established features of DKD (age, sex, ethnicity, duration of diabetes, systolic blood pressure, HbA1c, and body mass index) using area under the receiver operating characteristic curve (AUC).

Results:

ML model Elastic Net (EN) had the best AUC (95% CI) of 0.851 (0.847–0.856), which was 7.0% relatively higher than by LR 0.795 (0.790–0.801). Sensitivity and specificity of EN were 88.2 and 65.9% vs. 73.0 and 72.8% by LR. The top 15 predictors included age, ethnicity, antidiabetic medication, hypertension, diabetic retinopathy, systolic blood pressure, HbA1c, eGFR, and metabolites related to lipids, lipoproteins, fatty acids, and ketone bodies.

Conclusions:

Our results showed that ML, together with feature selection, improves prediction accuracy of DKD risk in an asymptomatic stable population and identifies novel risk factors, including metabolites.

Funding:

This study was supported by the Singapore Ministry of Health’s National Medical Research Council, NMRC/OFLCG/MOH-001327-03 and NMRC/HCSAINV/MOH-001019-00. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article activity feed

  1. eLife assessment

    There is an urgent need to improve prognostication of diabetic kidney disease in different diverse populations so this study is valuable in identifying specific predictive factors in a cohort of South East Asian populations whose baseline risk is higher. There are some limitations: the assumptions the authors make and the methods would benefit from some more investigation/validation.

  2. Reviewer #1 (Public Review):

    In this valuable study, authors Sabanayagam and colleagues used multiple ML models on longitudinal data from a cohort of Chinese, Malay and Indian participants with diabetes to identify predictors for incident DKD.

    The study involves a large multi-ethnic data cohort of Asian patients with diabetes and the use of machine learning methods to predict 6-yr CKD incidence risk in patients with diabetes. The final sample size for the study cohort included almost 1365 patients and 339 features. The authors tested multiple ML methods to identify which ML method provided the best prediction accuracy based on a select set of features.

    Strengths:

    The study is very interesting and timely as efforts are needed to develop prognostic methods for the incidence of Chronic Kidney disease in patients with diabetes. The strength of the study is the diversity in its cohort and the impressive breadth of associated covariates ranging from demographic, lifestyle, socioeconomic, physical, laboratory, retinal imaging, genetic, and blood metabolomics profile for patients.
    An important factor to consider when assessing a predictive risk for the progression of a disease is to consider all possible risk components ranging from environmental, metabolic, physiological, and Social determinants of health, which the authors have done very well.

    The authors also did not restrict their analysis by selecting a single algorithm upfront for their analysis which strengthens the scientific process without any bias in the outcome.

    The authors do go about a data-driven approach by recursively eliminating features that may not be significant in providing them with statistically significant results. With a data set of a given size, this would be a logical way to go about the analysis.
    The authors do accept the limitations of their study in the context of not having a validation dataset which is important to address in the scientific process.

    Shortcomings:

    However, the study does have a few shortcomings which, hopefully when addressed/clarified can help strengthen and streamline the analysis.

    1. Statistical significance versus clinical significance:
    The authors seem to use recursive feature elimination to come up with a set of top features for each Ml algorithm and select features from a varied feature set. However, the authors may need to pay attention to what the features (that come up as significant) are trying to allude to. For e.g. the authors seem to have dropped the datasets with features that contain the genetic and imaging parameters: D= B+ Genetic parameters and F= B+ Imaging parameters+ Blood metabolites+ Genetic parameters.
    They provide reasons for the low performance of the ML models for dropping the features but do not elaborate on whether they investigated the reasons for the drop in performance.
    They state this in the manuscript with no citation:
    (line 82) "Similarly, genetic abnormalities in diabetes have also been shown to increase the risk of DKD."
    ... which makes it difficult to assess which of the 76 snps were associated with CKD and in which population and to what extent.

    Similarly, the authors also have previously found features in imaging data have shown an association with CKD:

    We and several others have previously shown that retinal microvascular changes including retinopathy, vessel narrowing, or dilation, and vessel tortuosity were associated with CKD [6, 7].

    However, they also drop the dataset that includes the imaging features citing poor model performance and no investigations beyond that.

    2. The authors speak about the advantage of using ML approaches to overcome shortcomings of traditional assumptions from linear models, however, in the consideration of their covariates they might also want to understand the clinical association between some of their selected features. for e.g. BMI, HbA1c, duration of diabetes, and systolic BP may somehow not be entirely independent of each other (especially in the context of influencing one another and driving diabetes) and multi-collinearity may need to be looked into.

    3. The following sections seem to require citation:
    no citation:
    59: As CKD is asymptomatic till more than 50% of kidney function decline, early detection of individuals with diabetes who are at risk of developing DKD may facilitate prevention and appropriate intervention for DKD.

    Elaborate on rationale (what is challenging?) and citation needed:
    62 Early identification of individuals at risk of developing CKD in type 2 diabetes is challenging. Therefore, characterization of new biomarkers is urgently needed for identifying individuals at risk of progressive decline of eGFR and timely intervention for improving outcomes in DKD.

    Citation needed or rationale needs to be back:
    Machine learning methods using 'Big data', or multi-dimensional data may improve prediction as they have less restrictive statistical assumptions compared to traditional regression models which assume linear relationships between risk factors and the logit of the outcomes and absence of multi-collinearity among explanatory variables.

    Citation:
    Similarly, genetic abnormalities in diabetes have also been shown to increase the risk of DKD.

    Citation:
    81 Similarly, genetic abnormalities in diabetes have also been shown to increase the risk of DKD.

    Citation:
    The detailed methodology of the SEED has been published elsewhere.

    Citation:
    Malay ethnicity has been identified to be a high-risk group for CKD by several studies conducted in Singapore.

    4. The authors are attempting to rationalize the outcome of their findings rather than challenge them to improve the robustness of their analysis. In this section, it would help strengthen their analysis if they could find ways to eliminate reasons other than the one they provided or perform additional analysis that could show proof of their claim:
    While black ethnicity was a risk factor for CKD in the meta-analysis, in our study, we found Chinese and Malay ethnicity to be at higher risk of developing incident DKD compared to Indian ethnicity. One reason for the Indian ethnicity to be at lower risk of developing DKD could be Indian ethnicity being a high-risk group for diabetes, they may be well aware of the risk, and comply with screening, medication, etc. that could reduce their risk of developing DKD.

    5. Following up on the above point, the authors have decided to use SDOH (social determinants of health) to identify prognostic risk factors for the incidence of CKD in diabetic patients without considering what the model may be trying to say regarding ethnicity vs socioeconomic status? it would be good to look at the association of SDOH metrics against ethnicity to see if the ethnic populations at higher risk for CKD could be disadvantaged due to socioeconomic factors and if so these need to be mentioned in the analysis.

    6. EN vs other models: the authors claim that EN has much better results than other models in a study where the entire cohort has patients with diabetes possibly progressing towards CKD. usually, Risk models assume that disease progresses in a certain trajectory. However, multiple trajectories for the disease may exist due to heterogeneity of the disease and also non-linear relationships between features and disease outcome might influence this. This is what ML models can specifically address over traditional linear models. However, the pathophysiological progression from diabetes to CKD isn't as non-linear as assumed to be since heterogeneity in disease at that stage (~CKD stage 4) is primarily low and non-additive effects are most likely negligible, which also explains why EN and then LASSO perform so much better than the other models - This needs to be addressed by the authors in the paper.

    I hope that addressing these points will help strengthen the paper and streamline it while also making the analysis and the outcomes clinically and statistically significant.

  3. Reviewer #2 (Public Review):

    In this study, the authors have successfully utilized and compared various supervised machine-learning techniques to identify the risk for the development of diabetic kidney disease. The study was further able to identify some potential novel risk factors for the development of diabetic kidney disease.

    The heterogenous population and the identification of novel risk factors for diabetic kidney disease are some of the strengths of this study. Their definition of diabetic kidney disease, however, relies only on the decline in eGFR and is lacking in details of any other major significant events that may have impacted the decline in kidney function during the follow-up time period.

    Overall it is an interesting study that advances the field of kidney disease, though its results need to be interpreted with caution due to significant limitations in the study design.

  4. Reviewer #3 (Public Review):

    In this manuscript, the authors compared the accuracy of 3 machine learning (ML) algorithms for predicting incident diabetic kidney disease (DKD) by using longitudinal data from 1,365 Chinese, Malay, and Indian participants from the Singapore Epidemiology of Eye Diseases (SEED) study cohort (median follow up 6 years). They report that their ML model "Elastic Net" had the highest AUC (0.85) of the 3 ML models, compared to a logistic regression model (AUC 0.79). The LR model was based on age, sex, ethnicity, duration of diabetes, systolic blood pressure, HbA1c, and body mass index. In 3 ML models, the authors included a range of variables including > 200 blood metabolites, single nucleotide polymorphisms, and eye imaging parameters.

    A major weakness of this study is the definition of incident DKD and the lack of albuminuria data - the authors define incident DKD as eGFR < 60 cc/min/1.73 m2. This may underestimate the incidence of DKD, and further may label non-DKD as DKD (e.g. in an individual who experiences acute kidney injury without full recovery). Another major weakness is the treatment of ethnicity as a biological variable - in the strongest prediction model, Chinese vs Malay vs Indian ethnicity was one of the top 15 variables in the ML model. More explanation is needed around why ethnicity was included in both the ML models and the LR model. Further, a subgroup analysis of each of these groups was not performed. Finally, the rationale for the selection of the >200 metabolites is unclear. Several of the top 15 variables in all 3 models are these metabolites. Another top-15 variable in one of the models was noted to be "anti-diabetes medications", though the authors do not separate insulin vs non-insulin medications.