Pyramid Framework: Leveraging Large Language Model Randomness for Enhanced Complex Diagnosis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The uncertainty in large language model (LLM) responses to clinical diagnostic questions presents both a challenge and opportunity. We utilize the randomness and diversity of LLM responses to develop the Pyramid Framework to enhance performance in complex diagnosis. Using GPT-4o, Gemini-1.5-Pro, and Claude 3 Opus as sampling models, we evaluated this framework with Claude 3.5 Sonnet as a backbone LLM on 170 challenging cases from NEJM and 67 offline challenging cases. Claude 3.5 Sonnet Pyramid Framework achieved 46.1% accuracy and 79.0% coverage on the NEJM dataset, significantly outperforming Chain-of-Thought approaches (35.7% and 67.5% respectively). Similar improvements were observed on the offline dataset. When using o1-mini and o3-mini as the backbone LLM, the framework delivered accuracy improvements of 5.5–24.9% and coverage improvements of 11.9–28.9% across datasets. The framework significantly enhances LLMs' diagnostic performance in complex cases without additional expert-designed prompts, though further validation through prospective diagnostic trials is warranted.