Evaluating the accuracy and reliability of large language models in assisting with pediatric differential diagnoses: A multicenter diagnostic study

Abstract

Importance

Large language models, such as GPT-3, have shown potential in assisting with clinical decision-making, but their accuracy and reliability in pediatric differential diagnosis in rural healthcare settings remain underexplored.

Objective

Evaluate the performance of a fine-tuned GPT-3 model in assisting with pediatric differential diagnosis in rural healthcare settings and compare its accuracy to human physicians.

Methods

Retrospective cohort study using data from a multicenter rural pediatric healthcare organization in Central Louisiana serving approximately 15,000 patients. Data from 500 pediatric patient encounters (age range: 0-18 years) between March 2023 and January 2024 were collected and split into training (70%, n=350) and testing (30%, n=150) sets.

Interventions

GPT-3 model (DaVinci version) fine-tuned using OpenAI API on training data for ten epochs.

Main Outcomes and Measures

Accuracy of fine-tuned GPT-3 model in generating differential diagnoses, evaluated using sensitivity, specificity, precision, F1 score, and overall accuracy. The model’s performance was compared to human physicians on the testing set.

Results

The fine-tuned GPT-3 model achieved an accuracy of 87% (131/150) on the testing set, with a sensitivity of 85%, specificity of 90%, precision of 88%, and F1 score of 0.87. The model’s performance was comparable to human physicians (accuracy 91%; P = .47).

Conclusions and Relevance

The fine-tuned GPT-3 model demonstrated high accuracy and reliability in assisting with pediatric differential diagnosis, with performance comparable to human physicians. Large language models could be valuable tools for supporting clinical decision-making in resource-constrained environments. Further research should explore implementation in various clinical workflows.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/14057627.

This review is the result of a virtual, collaborative Live Review organized and hosted by PREreview and JMIR Publications on October 25, 2024. The discussion was joined by 21 people: 2 facilitators, 1 member of the JMIR Publications team, and 18 live review participants including 3 who agreed to be named here but did not contribute to writing this review: Nour Shaballout, Randa Salah Gomaa Mahmoud, Samaila Jackson Yaga. The authors of this review have dedicated additional asynchronous time over the course of two weeks to help compose this final report using the notes from the Live Review. We thank all participants who contributed to the discussion and made it possible for us to provide feedback on this preprint.

Summary

The study seeks to determine how accurately and reliably a fine-tuned GPT-3 model can assist with differential diagnosis in pediatric cases within rural healthcare environments. Specifically, it examines whether the AI model can match or approach the diagnostic accuracy of human physicians. By evaluating the model's diagnostic performance, the research aims to explore AI's potential to improve pediatric healthcare quality, reduce misdiagnosis, and support providers in undeserved regions where accurate, timely diagnosis is critical for patient outcomes.

To address the research questions, the authors conducted a retrospective study using data from 500 pediatric cases from a multicenter rural pediatric healthcare organization in Central Louisiana, United States. The GPT-3 model was trained on 70% of the data, including symptoms and physician-provided differential diagnoses, and tested on the remaining 30%, achieving an accuracy of 87%, with sensitivity at 85% and specificity at 90%. These results were statistically comparable to human physicians, who had an accuracy of 91%. The findings suggest that AI can support clinical decision-making in pediatric care, especially in resource-constrained environments where access to specialists is limited.

The research addresses critical gaps in pediatric care by exploring AI's potential to support clinical decision-making, particularly in resource-limited settings. It presents thorough methodological details that enhance reproducibility and offer insights into AI applications in healthcare. The authors' transparency about limitations reflects research integrity, establishing a strong base for future studies. Furthermore, the focus on integrating AI into clinical workflows shows an understanding of practical challenges and underscores opportunities for advancing healthcare delivery through technology. However, the study presents some notable weaknesses, including a lack of assessment on patient outcomes and insufficient clarity in its methodology, indicating areas for future research and improvement. Below, we list specific concerns and recommendations on how to address them.

List of major concerns and feedback

1. Concerns with Techniques and Analyses

Model Choice: It is unclear why a specific Generative AI model (i.e., GPT-3, DaVinci version) was chosen for this study. Was the GPT-3 model (DaVinci version) selected due to its extensive use in medical AI research? Or was it chosen to facilitate comparison with previous studies? A statement explaining the choice of the AI model would significantly improve the reader's understanding of the study's context and its relationship to previous research.
Normality Test: The study does not address whether data normality was assessed before statistical analysis. Determining the distribution of the data is key to select the appropriate statistical test to analyze such data. The Kolmogorov-Smirnov test could aid in understanding data distribution, specifically testing for normality. If the data is not found to meet normality criteria, non-parametric methods should be applied. Including a data normality assessment and explaining the choice of a particular statistical test would significantly strengthen the reliability of the study.
Evaluation Metrics: The study primarily uses specificity and sensitivity for evaluating LLM-generated responses, which may not capture the full quality of the outputs. Incorporating natural language processing metrics such as ROUGE and BLEU can help assess the quality of generated responses more comprehensively. ROUGE measures the correspondence between the automatically generated response vs. that of the human and what was expected. There are also issues associated with LLM generations of response such as hallucination and the lack of attribution. Please specify or comment on how those and other issues were measured.
Power Analysis Assumptions: The assumptions underlying the power analysis are unclear, particularly regarding how specific diagnoses affect this analysis. It is advised to elaborate on the power analysis methodology, including the rationale behind sample size choices and their implications for diagnosis variability.
Sample Size and Generalizability: The sample size of 500 encounters may not adequately represent the broader pediatric population, particularly in diverse settings. Furthermore, utilizing data from a single healthcare organization limits the applicability of findings to other settings. These limitations should be discussed, particularly how the validity of the results might change when it is tested with data from other healthcare centers. If possible, authors should mention and cite studies that reported on this effect. Additionally, future studies should consider expanding the sample size through multicenter collaborations and/or including data from patients with more diverse demographics to validate results across different healthcare environments thereby enhancing generalizability.

2. Details for Reproducibility of the Study

Software and Tools Documentation: The authors describe using both Python (with scikit-learn) and IBM SPSS Statistics, but it is unclear what the software's sources are. Specifying sources for Python and scikit-learn (e.g., "Python 3.8 [Python Software Foundation, Delaware, USA]") and clarifying the respective roles of Python and SPSS in the analyses would enhance transparency and allow for the reproducibility of the study.
Detailed Group Descriptions: The demographics, specifically age group cases, are under-specified, limiting the reader's understanding of the study sample. Adding a table or descriptive text detailing subgroup demographics, including age and case counts would improve the study's interpretability and allow readers to better contextualize findings.
Cross-validation Across Organizations: The model's reproducibility across various healthcare settings is not demonstrated. Evidence shows models often underperform with data from different sources. Including cross-organization validation and clearly acknowledging this limitation in the discussion citing relevant studies would enhance robustness. Furthermore, addressing this limitation in future work could pave the way for broader adoption and application of the model.

Data and Model Specifics for Replicability: The study would benefit from more thorough descriptions of dataset characteristics, fine-tuning model parameters, and preprocessing methods. For validation, consider adding multi-center dataset details. Adding this information would enable other researchers to replicate and build upon the study's findings, thereby enhancing its scientific contribution.
Diagnostic Exclusion or Inclusion Clarification: The preprocessing section does not clarify if physician diagnostics were included or excluded, leading to potential confusion for readers and impacting reproducibility. It would be helpful to know whether physician diagnostics were included in training and why. Clarifying this aspect would help standardize study replication and improve the study's transparency.

3. Figures and Tables

Figure 1 is mentioned but not included in the article, which affects comprehension of the study design and findings. Please include Figure 1 or provide an alternative reference to explain the content of the missing figure. Figures are helpful for readers to quickly grasp complex methodologies and findings.

4. Ethics

Data Privacy: It's unclear whether a private or public version of GPT-3 was used, and if the latter, this raises potential HIPAA concerns. As it was already pointed out above, it is recommended that the version of GPT-3 used is specified, with additional clarification regarding data privacy practices if a public model was used. The addition of HIPAA considerations will enhance readers' confidence in the study's privacy protocols.
Discussion of Diagnostic Risk: The discussion would benefit from a deeper exploration of diagnostic risks associated with the use of AI in healthcare and clinical decision-making settings. One example being the potential of AI models to perpetuate and affirm existing human biases thereby further exacerbating health disparities (one relevant citation could be: Mittermaier, M., Raza, M.M. & Kvedar, J.C. Bias in AI-based models for medical applications: challenges and mitigation strategies. npj Digit. Med. 6, 113 (2023) https://doi.org/10.1038/s41746-023-00858-z). The study also raises important social considerations, such as respecting human agency, particularly for vulnerable populations. Addressing parental concerns about deferring decision-making to AI is crucial, as is ensuring a socially attuned approach to build trust and understanding.
Lack of Clarity on Potential Implementation in Rural Healthcare Settings: The study could be strengthened by detailing how the AI model might be implemented in rural healthcare settings, including the specific challenges involved. Key considerations include the need for sufficient infrastructure (e.g., electricity, internet) and the necessity of training healthcare providers unfamiliar with AI tools. Additionally, discussing both the potential impact (e.g., improved diagnostic efficiency) and limitations (e.g., handling incomplete data or over-reliance on AI) would provide a more comprehensive roadmap for deployment in rural environments.

List of minor concerns and feedback

Data Distribution Gaps: No comparison of racial identity distribution between training and testing sets. Please consider adding a table or section on these demographic comparisons to ensure representation across subgroups.
Data description and context: It would be helpful to know more information regarding how physicians were selected and their specific roles in the study.
Departmental Affiliations: Authors' affiliations lack specific department details, which limits transparency. Include departmental affiliations for authors to increase transparency and traceability. Adding departmental affiliations will provide context on authors' expertise and institutional support.
Funding Transparency: The funding statement does not clearly specify whether the study was internally or externally funded. Explicitly state funding details, clarifying internal/external sources as applicable. Clear funding information will enhance transparency and address potential conflicts of interest.
Approval Number: While an ethical approval statement is present, it lacks the approval number, which is critical for ethical transparency. Please, include the ethics approval number/code to ensure proper documentation and strengthen the study's validity and trustworthiness.

Inconsistent data collection dates between the abstract and data collection section (Lines 19 and 82.
Missing figure (Line 104)
Need for more descriptive statistics (mean, median, quartiles, standard deviation)
Data Distribution: Lack of comparison for racial/Hispanic identity distribution between training and testing sets. There's insufficient detail on age subgroup distribution.
Clarification Needed: The authors needed to provide deeper discussion of power analysis methodology.
The authors assessed that the distribution of age, gender, and chief complaints was similar between the training and testing sets. Suggest this to be cited to Table 5.
Table 1: The abbreviations in the formula column should be identified in the table legend as (FN: False Negative; FP: False Positive; TN: True Negative; TP: True Positive) (m)+1
Please clarify why GPT 3.5 or 4 (instead of GPT 3) was not used despite being available at the time of the study.
Line 103 states physicians were instructed to generate DD. I thought this was obtained retrospectively. Please clarify.
Line 152: (Table 4) should be corrected to (Table 3)
Line 154: (Table 5) should be corrected to (Table 4)

Line 200 typo "may limit the of the finding"

Concluding remarks

We thank the authors of the preprint for posting their work openly for feedback. We also thank all participants of the Live Review call for their time and for engaging in the lively discussion that generated this review.

Competing interests

Daniela Saderi contributed to writing this review and was a facilitator of this call and one of the organizers. No other competing interests were declared by other reviewers who participated in discussing the preprint during the Live Review.

Read the original source

Evaluating the accuracy and reliability of large language models in assisting with pediatric differential diagnoses: A multicenter diagnostic study

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Importance

Objective

Methods

Interventions

Main Outcomes and Measures

Results

Conclusions and Relevance

Article activity feed

Summary

List of major concerns and feedback

List of minor concerns and feedback

Concluding remarks

Competing interests

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Large Language Models for Automated Icd-10 Coding of Obstetric Clinical Notes in Portuguese: Comparison With Human Coders

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Importance

Objective

Methods

Interventions

Main Outcomes and Measures

Results

Conclusions and Relevance

Article activity feed

Summary

List of major concerns and feedback

List of minor concerns and feedback

Concluding remarks

Competing interests

Related articles

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Large Language Models for Automated Icd-10 Coding of Obstetric Clinical Notes in Portuguese: Comparison With Human Coders

Evaluating Large Language Models for Translating Caries Guidelines into Clinical Decision Support