A Simulation Study to Advance Human-Centred Artificial Intelligence via Digital Citizen Science: Can Large Language Models Transform Current Approaches to Missing Data Imputation?

Jamin Patel
Keaton Banik
Sheriff Tolulope Ibrahim
Tarun Katapally

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Missing data is a persistent challenge in digital health research, and traditional approaches like Multiple Imputation by Chained Equations (MICE) may not capture complex patterns. While large language models (LLMs) could offer a viable alternative, their use in this context remains understudied. Moreover, a critical gap remains in embedding human-centred artificial intelligence (AI) approaches that integrate equity, transparency, and stakeholder participation. Digital citizen science, which leverages citizen-owned devices for ethical, participatory big data collection, offers a foundation to advance such approaches in digital health.

Objective

To evaluate and compare the imputation accuracy of MICE with the OpenAI o3 model for categorical variables in a simulated digital health dataset under different missingness mechanisms and levels, while situating this evaluation within the broader vision of human-centred AI enabled by digital citizen science.

Methods

A complete digital health dataset collected through a digital citizen science platform was used to simulate missingness under Missing at Random (MAR) and Missing Completely at Random (MCAR) at 10%, 25%, and 50%. MICE used logistic regression with five imputations and ten iterations per chain. For the o3 model, structured prompts were generated for each missing entry using all available non-missing variables from the same record. Both methods were evaluated on each simulated dataset using classification accuracy and a closeness metric representing similarity to the original data. Statistical differences were tested with a two-sample Z-test, and misclassification patterns were examined by variable type and category frequency.

Results

Under MAR conditions, MICE and o3 performed similarly with an average accuracy of 0.60 and 0.59, and closeness metrics of 0.83 and 0.85, respectively. Under MCAR, both methods achieved 0.59 accuracy, with closeness metrics of 0.84 and 0.85. No statistically significant differences were found across conditions (all p > 0.05).

Conclusion

While MICE remains preferred for continuous data, the o3 model shows promise as a complementary tool for categorical imputation in smaller datasets. Beyond methodological comparability, this study demonstrates how digital citizen science can serve as an ethical foundation for embedding human-centred AI into digital health research, positioning large language models not only as technical tools but also as vehicles for advancing equity, transparency, and participatory innovation in healthcare.

Version published to 10.1101/2025.08.21.25334182 on medRxiv
Aug 24, 2025

Harnessing foresters’ engagement for climate change adaptation: the emerging tool of next-generation citizen science

This article has 4 authors:
1. Marjorie Bison
2. Nicole Ponta
3. Daniella Maria Schweizer
4. Katalin Csilléry
This article has no evaluationsLatest version Sep 16, 2025
Artificial Intelligence For 6P Medicine: Consolidating AI Needs of Predictive, Preventive, Personalized, Participatory, Precision, and Public Health Trajectories

This article has 2 authors:
1. Aly Khalifa
2. Rada Hussein
This article has no evaluationsLatest version Aug 22, 2025
10 recommendations for strengthening citizen science for improved societal and ecological outcomes: A co-produced analysis of challenges and opportunities in the 21 ^st century

This article has 18 authors:
1. Jack Nunn
2. Håkon da Silva Hyldmo
3. Lauren McKnight
4. Heather McCulloch
5. Jennifer Lavers
6. Julie Old
7. Laura Smith
8. Nicola Grobler
9. Cheryl Tan Kay Yin
10. Wing Yan Chan
11. Candice Raeburn
12. Nittya S. M. Simard
13. Adam Kingsley Smith
14. Sam Van Holsbeeck
15. Eleanor Drinkwater
16. Kit Prendergast
17. Emma Burrows
18. Christopher L. Lawson
This article has no evaluationsLatest version Aug 17, 2025

Discuss this preprint

Listed in

Abstract

Background

Objective

Methods

Results

Conclusion

Article activity feed

Related articles

Harnessing foresters’ engagement for climate change adaptation: the emerging tool of next-generation citizen science

Artificial Intelligence For 6P Medicine: Consolidating AI Needs of Predictive, Preventive, Personalized, Participatory, Precision, and Public Health Trajectories

10 recommendations for strengthening citizen science for improved societal and ecological outcomes: A co-produced analysis of challenges and opportunities in the 21 st century

10 recommendations for strengthening citizen science for improved societal and ecological outcomes: A co-produced analysis of challenges and opportunities in the 21 ^st century