Uncovering Latent Cognitive Subgroups via MMSE Scores in a Bangladeshi Dementia Cohort: An integration of Data Coresets and Ranked Set Sampling in Gaussian Mixture Modeling

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Alzheimer's disease (AD) has an incidence of 57 million people worldwide, with it being expected to increase further in low and middle-income nations (LMICs), where Bangladesh is reported to have a prevalence of 8.0% of dementia among persons aged ≥ 60 years and older. Although widely used in cognitive screening, the Mini-Mental State Examination (MMSE) exhibits variable sensitivity (23–76% for mild cognitive impairment) and significant age-education effects, which preclude the use of fixed cutoff strategies to define cognitive subgroups. New evidence suggests that AD has heterogeneous subtypes with different cognitive pathways, other than homogeneous disease progression. Methods This article used Gaussian Mixture Models (GMMs) to determine latent cognitive subgroups among a Bangladeshi AD cohort (N = 663) of the National Institute of Neurosciences and Hospital, Dhaka (January 2019-August 2024). The scores with MMSE were divided into severe dementia (MMSE 0–10), moderate dementia (MMSE 11–20), mild cognitive impairment (MMSE 21–25), and normal cognition (MMSE > 25). Two computationally efficient subsampling methods were compared: Ranked Set Sampling (RSS; N = 320) and coreset construction (N = 198), on the overlap minimization and distributional fidelity. Four-component GMMs were estimated using the Expectation-Maximization algorithm on the entire dataset and the two subsamples. The fitted models were evaluated based on log-likelihood values, convergence behavior (ε = 10⁻⁸), and pairwise overlap percentages between components. Results The MMSE Score had significant pair-wise overlap between the neighboring severity elements (46.5% severe-moderate) as an expression of linear cognitive decline. RSS (N = 320) focused on mild impairment (58.28%), with a low degree of overlap (1.2%), but with a serious lack of severe cases (7.32%). The balanced severity representation (severe: 11.61% of original data; 30%, which contains coreset sampling) and moderate overlaps (18.23%) and weighted mean (19.79) were closest to the population mean (19.74). Conclusions Coreset sampling was better than the Ranked Set Sampling because it maintained the severity representation balance in all the levels of cognitive impairment with only 30% of the initial data. The scalable method facilitates effective CD (cognitive disease) subtype identification in resource-limited environments, supporting enhanced clinical trial design and individualized risk assessment.

Article activity feed