All You Need Is Context: Clinician Evaluations of various iterations of a Large Language Model-Based First Aid Decision Support Tool in Ghana

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

As advancements in research and development expand the capabilities of Large Language Models (LLMs), there is a growing focus on their applications within the healthcare sector, driven by the large volume of data generated in healthcare. There are a few medicine-oriented evaluation datasets and benchmarks for assessing the performance of various LLMs in clinical scenarios; however, there is a paucity of information on the real-world usefulness of LLMs in context-specific scenarios in resource-constrained settings. In this work, 5 iterations of a decision support tool for medical emergencies using 5 distinct generalized LLMs were constructed, alongside a combination of Prompt Engineering and Retrieval Augmented Generation techniques. 50 responses were generated from the LLMs. Quantitative and qualitative evaluations of the LLM responses were provided by 13 physicians (general practitioners) with an average of 3 years of practice experience managing medical emergencies in resource-constrained settings in Ghana. Machine evaluations of the LLM responses were also computed and compared with the expert evaluations.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/13274625.

    This review is the result of a virtual, collaborative live review discussion organized and hosted by PREreview and JMIR Publications on June 20, 2024. The discussion was joined by 15 people: 2 facilitators, 2 members of the JMIR Publications team, 2 authors, and 9 live review participants including 3 who agreed to be named: Aswathi Surendran, Khushboo Thaker, Arya Rahgozar, and Emmanuel Adamolekun but did not contribute to the final composition of this review. The authors of this review have dedicated additional asynchronous time over the course of two weeks to help compose this final report using the notes from the Live Review. We thank all participants who contributed to the discussion and made it possible for us to provide feedback on this preprint.

    Summary

    This study investigates the performance and application of Large Language Models (LLMs) as support tools for making clinical decisions during medical emergencies, in the resource-constrained settings of Low-and-Middle-Income Countries (LMICs) such as Ghana. The research's aim is to provide a premise for future research and development of LLM-based clinical decision support tools by assessing the suitability and effectiveness of five selected generalized LLMs using context-specific prompts. Thirteen medical experts with an average of three years of experience working in an environment of limited resources, evaluated the outputs of these models quantitatively by using mean ranking scores, and qualitatively using thematic analysis.

    The authors used off-the-shelf pre-trained LLMs (GPT-4 Turbo, Gemini 1.5 Pro, and Claude Sonnet) with Prompt Engineering and Retrieval Augmented Generation (RAG) techniques to develop five iterations of a decision support tool. Fifty responses were generated and evaluated. Machine evaluations were also performed and compared with theirs, using conventional machine learning metrics like BLEU and ROUGE.

    Their findings showed that Gemini 1.5 Pro + Prompt Engineering outperformed the other LLMs used in their research, while the adjustments of other LLMs using suitable parameters improved their overall performance. This may imply that LLM-based first aid assistants could provide useful instructions for the management and treatment of medical conditions, most especially in resource-constrained settings. The practitioners were generally satisfied with the diagnoses and instructions from these LLMs, demonstrating their potential and importance in managing medical emergencies. Future research should involve larger datasets, additional metrics, and more detailed evaluations to refine and enhance the use of LLMs in real-world medical emergencies.

    The discussion from participants of this live review is summarized below.

    List of major concerns and feedback

    Statistical Significance of Differences in Mean Ranking Scores

    • Concern: The paper does not assess if the difference in mean ranking scores with a change in RAG approach (result in Table 2) is statistically significant.

    • Feedback: Perform statistical tests such t-tests or Kruskal–Wallis test by ranks to determine if the differences in mean ranking scores are statistically significant. This will add robustness to the findings.

    Incomplete Figures

    • Concern: Fig 2 image is incomplete, with the right side cut off, and Fig 1 legend is incomplete. Figure 3: The data is not clear to assess the correlation.

    • Feedback: Revise the figures to ensure they are complete and clearly labeled. This will improve the clarity and comprehensibility of the visual data.

    Availability of Google Form Reference

    • Concern: The Google form (reference 15) is not available.

    • Feedback: Ensure the Google form is accessible in the supplementary files. This is crucial for transparency and reproducibility.

    List of minor concerns and feedback

    • It would be helpful for the reader to see the aim of the work, the main results and the conclusion mentioned in the abstract.

    • Participants were a bit confused about Reference 1 in the authors section and wondered if that was the most appropriate place to cite the project involved with this study.

    • It is unclear if Claude 3.5 Sonnet or Claude 3 Opus was used. Please clarify.

    • It is unclear what it is referred to with "Low-and Low-Middle-Income countries (LMICs)" Is it Low Income Countries (LICs) or "Lower Middle Income Countries (LMICs)", forms more commonly used as defined by the World Bank?

    • In Section E of the Methodology it would be helpful to mention the total number of clinicians involved in the study. In section G the text says "The first group of 30 responses were evaluated by all 13 physicians. The second group of 20 responses was evaluated by 8 of the physicians. It would be helpful to know why and how these 8 were selected out of the total 13.

    • In Section F of the Methodology section, the text presents a quote by one of the clinician involved. It would be helpful to understand why this quote is presented in the text.

    • It would be helpful to have more information about the statistical tests used for the quantitative analysis and why.

    • In the Results section there seems to be inconsistency in the labeling style of tables: Roman numerals in the text versus Arabic numerals in the figure label. It would be helpful to choose one style and be consistent throughout the manuscript so that the reader can better follow the results.

    • In the Results section, under the qualitative analysis section, the sentence "Table 3 shows the 8 codes and their descriptions." Table 3 should be corrected to Table 4. 

    • Figure 1 is a bit hard to read and understand. A bigger font and an explanation of what is plotted in the figure legend would significantly enhance comprehension.

    • In the second paragraph on page 6 the abbreviation EMS is first mentioned and it should be spelled out as the Emergency Medical Services (EMS).

    • It was expected that the RAG based approach would have performed better than the approach solely based on LLM. It would be helpful if the authors discussed the results in the context of these expectations, highlighting potential limitations of the study.

    Concluding remarks

    We thank the authors of the preprint for posting their work openly for feedback. We also thank all participants of the Live Review call for their time and for engaging in the lively discussion that generated this review.

    Competing interests

    The authors declare that they have no competing interests.

  2. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/12958884.

    RESEARCH QUESTIONS 

    This study attempts to answer the following research questions:

    1. How well do five (5) distinct generalised Large Language Models (LLMs) perform, when combined with Prompt Engineering and Retrieval Augmented Generation (RAG) techniques, in providing first aid advice for medical emergencies in resource-constrained settings?

    2. Can context-specific prompts improve the relevance and suitability of LLM responses in clinical scenarios in Low-and Middle-Income Countries (LMICs) like Ghana?

    3. How do the evaluations of LLM-generated medical advice by clinicians (experts in managing medical emergencies in resource-constrained settings) compare with machine evaluations?

    RESEARCH MAIN GOAL AND ITS IMPORTANCE

    The main goal of this study was to evaluate the suitability and usefulness of distinct generalised LLMs, in combination with Prompt Engineering and RAG techniques, for clinical decision support in medical emergencies from resource-constrained settings in Ghana.

    This study provides insights on the potential usefulness of LLMs in improving healthcare delivery systems by augmenting the limited financial, logistical, and human resources available in LMICs. It also helps us to understand how simple prompts that are context-specific can affect the performance of generalised LLMs, ensuring that the medical advice generated is not only accurate but also practical and relevant to the specific circumstances of patients. Meanwhile, the discrepancies between human and machine evaluations were highlighted in this study, to emphasise the need for human input in order to assess as well as utilise more sophisticated prompts in developing generalised LLM tools that can outperform state-of-the-art medical LLMs.

    RESEARCH MAIN APPROACH AND WHAT THE AUTHORS DID TO ADDRESS RESEARCH QUESTIONS

    The authors selected and tested top-performing LLMs {OpenAI's GPT-4 Turbo Preview through Assistant Application Programming Interface (API) and Chat Completions API, Gemini 1.5 Pro, and Claude Sonnet} based on their rankings on LMSYS Chatbox Arena Leaderboard, alongside a combination of Prompt Engineering and Retrieval Augmented Generation (RAG) techniques, to produce first aid responses for medical emergencies in resource-constrained settings of Ghana. They fine-tuned these model parameters for quick response determinations and designed context-specific prompts using two chunking approaches. They also utilised both the "CharacterTextSplitter '' tool derived from Langchain, and the "all-mpnet-base-v2" transformer model, sourced from HuggingFace model hub, to divide the text into chunks. 13 clinician evaluators from Ghana with average of 3 years working experience in resource-constrained settings then rated the responses using both quantitative and qualitative analyses performed on the collected data as follows: 50 responses that were generated from the 5 LLMs using 10 clinical scenarios were evaluated and ranked using two RAG approaches from two groups of same physicians; Approach 1: 30 responses generated by RAG in group 1 (ranked by all 13 physicians) were given "Overall score" using a 10-point Likert scale, with *0* representing "Totally Unsatisfactory and *10* "Totally Satisfactory ", Approach 2: The other 20 responses also generated by RAG in group 2 were evaluated by 8 of the 13 physicians using a more robust approach that was based on accuracy, conciseness, safety, and helpfulness, in addition to the 10-point Likert scale.

    RESEARCH MAIN FINDINGS

    The results of the quantitative analysis from this study show that Gemini 1.5 Pro combined with Prompt Engineering model (Response B) outperformed the other Large Language Models (LLMs) and their various combinations, in terms of accuracy, safety, and helpfulness, as well as having the overall best score of 7.8 on the 10-point Likert scale and with a mean ranking of 7.4. However, the use of a more sophisticated RAG approach (combination of Prompt Engineering and RAG) improved the performance of the other two models (GPT4-Turbo and Claude Sonnet) and were only better than Gemini 1.5 Pro in terms of conciseness.

    The qualitative analysis indicate that clinicians mostly value Large Language Model (LLM) responses that were considered "satisfactory", while "concise" and "QuickTransfer" responses also had significant occurrences, suggesting their importance. Although, there were occasional concerns about the accuracy of diagnosis in some areas as well as those where the responses were considered not concise enough. In general, these results emphasise the importance of developing LLM-based tools that can provide first aid advice to physicians in order to effectively communicate and respond to clinical scenarios such as medical emergencies, especially in resource-constrained settings of Low-and-Middle-Income Countries.

    WHAT IS MOST INTERESTING ABOUT THE RESEARCH

    What stands out as most interesting from this research is that it encourages and sensitises application developers in resource-constrained settings of Low-and-Middle-Income countries (LMICs) to develop cost-effective generalised LLMs, which are often more accessible to people in these settings, and that can perform at par or better than specialised medical LLMs using simpler techniques. An example provided in the study is the SnooCODE Red application being developed in Ghana.

    RELATIONSHIP OF THE MANUSCRIPT TO PUBLISHED LITERATURE AND FUTURE RESEARCH

    The study builds on existing literatures which demonstrate the potentials and applications of generalised LLMs in improving healthcare delivery, supporting clinical decision-making, and acting as virtual health assistant, among other roles. The results of these study also pave the way for several future research directions:

    1. It shows that while RAG can enhance the performance of generalised LLMs, improper implementation of it can nullify its benefits. Future research could explore advanced RAG techniques and more sophisticated Prompt Engineering to further improve the accuracy and usefulness of generalised LLMs.

    2. It also paves the way for the development of cost-effective LLMs for LMICs which are designed to meet the specific needs of people living in these settings.

    3. The integration of LLMs into healthcare delivery systems is undoubtedly crucial, however, this study reveals the discrepancies between human and machine evaluations of the performance of generalised LLMs especially in light of contextual scenarios. For example, clinicians familiar with working in rural settings were not satisfied with LLM responses that did not demonstrate a higher sense of urgency in the quick transfer of patients to nearby hospitals. Therefore, future research could develop better evaluation metrics that will improve the ability of LLMs to capture those context-specific scenarios which are applicable to rural areas with limited resources.

    RESEARCH MAIN STRENGTH AND WEAKNESS

    The main strength of this study is found in the context-specific evaluation of generalised LLMs in resource-constrained settings. Also, by using physicians familiar with local medical challenges in rural areas, this study provides a realistic assessment of the suitability and effectiveness of generalised LLMs in practical scenarios.

    The main weakness is the limited scope and relatively small sample size, especially in terms of the number of physicians (13) involved and the number of clinical scenarios (10) evaluated. This may impact on the generalisability of their findings, because a larger sample size and more diverse group of evaluators might yield different results.

    MAJOR ISSUES

    Research Methodology

    1. From the abstract, the distinct generalised LLMs reportedly used could have simply been stated as 3 (GPT4-Turbo,Gemini 1.5 Pro, and Claude Sonnet) alongside moderate Prompt Engineering and RAG instead of the 5 mentioned, to align with the research methodology, results and interpretation.

    2. Although, it was clearly stated in this study that the data generated from Gemini 1.5 pro is not compatible with RAG for retrievals, due to the particular model tools and RAG approaches used. I have reasons to believe there are specific RAG models from Google website for AI developers which are compatible with Gemini tools for retrievals, that were not considered in this study (https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding). I think adopting suitable approaches for the implementation of RAG techniques that are designed for specific and individual LLMs should be paramount in this type of research. Testing Gemini 1.5 Pro using a suitable RAG model would have provided more insights into why and how the model outperformed the other LLMs. Additionally, doing this could have justified the reasons why RAG is being touted as a highly promising approach to improving the factuality, reasoning, and interpretability of LLMs outputs as clearly opined in this study.

    3. Response evaluation and ranking by physicians could have been more evenly distributed across all experimental data without favouring one RAG approach and LLM over another, because the explanations provided by the authors for their choice of methodology appeared to be biased, favouring certain LLMs over another. This is also possibly the reason why RAG approach 2 had a higher mean ranking score than RAG approach 1. This approach does not not address one of the questions this research is trying to answer, and at the same time influencing the outcome of the study. To eliminate this kind of bias, I would suggest that, since 50 responses were generated for the two RAG approaches, It would have been better if the responses were divided equally into two groups of 25 for each RAG approach, instead of 30 and 20. Also, all 13 physicians should have been involved in evaluating the data from both RAG approaches to provide a more robust and unbiased output. Finally, both rankings should be done by evaluators on the two RAG approaches, using the Likert scale and other parameters.

    4. In the methodology, it was mentioned that 20 responses were ranked based on "accuracy", "conciseness, and "helpfulness", whereas, an additional parameter "safety" was added to the results. The authors should clarify this.

    5. Link or source code for the analyses should be provided, to ensure the reproducibility and validity of this study.

    6. The presentation and interpretation of results need to be checked and properly screened for errors to avoid misconceptions, wrong conclusions, and breach of research integrity.

    MINOR ISSUES

    1. In the introduction, "Low-and Low-Middle-Income countries (LMICs)" [2] should be Low-and-Middle-Income Countries (LLMCs).

    Results

    1. In the results, under quantitative analysis, "Table 3 shows the 8 codes and their descriptions." Table 3 should have been Table 4.

    2. Can specific names or examples of such resource-constrained settings in Ghana where these LLMs were tested be mentioned? This may add a little bit of credibility and context to this research.

    3. The title of Figure 3 has typographical errors, and it was so difficult to interpret the results.

    4. In Table 1, overall mean ranking score and standard deviation across the 3 models when rounded up to the nearest whole number was 7.0 and 1.0 respectively, as opposed to the 7.1 and 1.4 stated above the table. Also leaving the scores the way it is in the table in decimals shows a level of significance among the models.

    5. One parameter "Overall Score" is missing in the title of Table 3.

    6. Use numbers and not Roman figures to name tables. For example Table 4, not Table IV.

    7. Data in Table 3 are not recorded in the same number of decimal points. Majority of them were recorded to 1 decimal point while others were whole numbers.

    8. With the way the data in Figure 1 is presented, I find it difficult to understand.

    9. From the methodology, section G, under response evaluation and ranking, there is a typographical error in the determination of a 10-point Likert scale. 0 should represent "totally unsatisfactory" and 10 "totally satisfactory".

    RESEARCH LIMITATIONS

    1. Limited availability of Application Programming Interface (API) and difficulty of access in selecting suitable LLMs.

    2. Lack of computational resources such as advanced GPUs to run high-ranking open source medical LLMs.

    3. Lack of using appropriate RAG model and approach to potentially maximise the performance of the best-performing Gemini 1.5 pro model.

    4. The authors acknowledged that they could have evaluated a larger cohort of responses as well as utilise comprehensive evaluation framework.

    RECOMMENDATION

    I would recommend this interesting manuscript for publication and for others to read, provided that most major issues if not all, and the minor issues raised are addressed. Authors may find underlisted references helpful.

    No conflict of interest.

    REFERENCES

    • Fogel, A.L., Kvedar, J.C. Artificial intelligence powers digital medicine. npj Digital Med 1, 5 (2018). https://doi.org/10.1038/s41746-017-0012-2

    • Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25, 44–56 (2019). https://doi.org/10.1038/s41591-018-0300-7

    • Esteva, A., Robicquet, A., Ramsundar, B. et al. A guide to deep learning in healthcare. Nat Med 25, 24–29 (2019). https://doi.org/10.1038/s41591-018-0316-z

    • https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding-and-embedding

    • Junaid Bajwa, Usman Munir, Aditya Nori, Bryan Williams. Artificial intelligence in healthcare: transforming the practice of medicine. Future Healthc J. (2021) Jul; 8(2): e188–e194. doi: 10.7861/fhj.2021-0095

    Competing interests

    The author declares that they have no competing interests.