Discriminating the prodromal stage of multiple sclerosis using longitudinal health administrative claims data and machine learning–based sequence analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Multiple sclerosis (MS) is a chronic autoimmune disease of the central nervous system. Early detection of the prodromal phase could enable timely interventions to potentially modify disease progression. This study leverages longitudinal health administrative claim (HAC) data to identify patterns distinguishing the prodromal stage of MS from other neurological conditions.
Methods
HAC data from the Czech Health Insurance Bureau (2017–2022) was analyzed across three cohorts: a target MS cohort with confirmed diagnoses, a control cohort with inconsistent MS suspicions, and a cohort with related disorders. For healthcare utilization and diagnostic code data representation, we employed two approaches: temporal analysis using various time windows relative to the index date (including pre- and post-index date comparisons) and a separate segment-based analysis. Features were extracted using token frequencies and word embeddings. Random forest models were evaluated using Area Under the Receiver Operating Characteristic Curve (AUC) to assess performance.
Results
Each cohort included several hundred to over a thousand individuals. The models achieved AUCs around 0.9 for distinguishing the target cohort from controls, with even higher performance in differentiating pre- and post-diagnosis phases. Longer observation windows enhanced predictive accuracy, and feature extraction methods like TF-IDF and word2vec yielded the most consistent results. Segment-based analysis identified a subset of individuals for potential diagnostic reclassification. Interpretable machine learning techniques were integrated into the analysis pipeline.
Conclusions
This study highlights the potential of HAC data for detecting early prodromal indicators of MS. Unlike previous research, which often focused on the volume of healthcare utilization, this work explores the informational content within diagnostic codes and healthcare utilization patterns. The findings align with existing research on early neurological condition detection, demonstrating that administrative data could support early identification and intervention in MS and possibly other diseases.