Disentangling Confounders from Pathology in Long-COVID Trajectory Prediction for Women: An Interpretable Large-Language-Model Approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Post-acute sequelae of SARS-CoV-2 infection (PASC, “Long COVID”) disproportionately affects women, in whom hallmark symptoms—insomnia, fatigue, palpitations, cognitive difficulty—overlap with comorbidities and hormonal transitions such as menopause. This diagnostic overlap is a confounding problem: models that forecast future symptom severity risk attributing baseline physiological noise to viral pathology. We ask whether an interpretable, causally disentangled language model can separate true pathological signal from such confounders while remaining competitive with strong predictors of future PASC severity.

Materials and methods

Retrospective cohort of 1,155 adult women (median age 61) from the NIH RECOVER program, combining static clinical profiles, longitudinal symptom surveys, and four weeks of mean-aggregated consumer-wearable physiology (heart rate, sleep, activity). We render each patient as a natural-language clinical narrative and fine-tune a small open-weight language model (Qwen2.5-0.5B, LoRA) with an attention-based disentanglement layer that gates the latent state into a causal and a confounder component, trained with an environment-mixing InfoNCE objective. We predict the PASC index at 3-, 6-, and 9-month horizons and benchmark against last-value carry-forward, Lasso, Ridge, gradient-boosted trees (XGBoost), a deep MLP, a tabular ResNet, and a self-attention network, over 20 stratified resamples with paired significance tests. We further stratify by trajectory phenotype (Protected / Responder / Refractory).

Results

Long-COVID severity is strongly autocorrelated, so last-value carry-forward is a hard reference and is the most accurate method on the full cohort (MAE 3.02 / 1.99 / 1.52 at 3/6/9 months). Among learned models the LLM regressor had the lowest MAE at every horizon (e.g. 3.11 vs. XGBoost 3.57 at 3 months; paired p ≤ 0.01). In the Responder phenotype— patients whose trajectories actually move—the LLM was the most accurate method overall at 3 and 6 months (MAE 4.72, 4.06), though its advantage over carry-forward was not statistically significant ( p = 0.70, 0.29). The disentanglement layer assigned maximal saliency to direct pathology tokens ( breathlessness, malaise ; 1.00) while suppressing confounders ( menopause, diabetes ; < 0.27) and linguistic filler (< 0.17).

Conclusion

For static, slowly evolving patients a simple carry-forward forecast is hard to beat and should be the reference any PASC model is judged against. The value of a learned, disentangled model is (i) better accuracy where trajectories are dynamic and (ii) an interpretable, “clinically honest” attribution that down-weights confounders such as menopause—reducing the risk of misattributing baseline physiology to Long COVID.

Author summary

Long COVID is more common and often more severe in women, but many of its symptoms look like other common conditions or like the normal changes of menopause. When a computer model tries to predict how a patient’s symptoms will evolve, it can be fooled into blaming Long COVID for what is really background physiology. We built a model based on a small language model that reads a written summary of each patient—their history, comorbidities, and a month of wearable-device data—and is explicitly trained to separate “true disease signal” from “background noise.” We tested it against standard predictors at 3, 6, and 9 months. We report an finding that is easy to overlook: because Long COVID severity changes slowly, simply assuming a patient’s next score equals their last score is very accurate and hard to beat for stable patients. Our model’s advantage appears where it matters clinically—patients whose symptoms are actually changing—and, importantly, the model shows which words drove its prediction , correctly emphasizing symptoms like breathlessness while down-weighting confounders like menopause. We argue this interpretability, not a small accuracy gain, is the real contribution.

Article activity feed