Prompt engineering for accurate statistical reasoning with large language models in medical research

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The integration of generative artificial intelligence (AI), particularly large language models (LLMs), into medical statistics offers transformative potential. However, it also introduces risks of erroneous responses, especially in tasks requiring statistical rigor.

Objective

To evaluate the effectiveness of various prompt engineering strategies in guiding LLMs toward accurate and interpretable statistical reasoning in biomedical research.

Methods

Four prompting strategies: zero-shot, explicit instruction, chain-of-thought, and hybrid were assessed using artificial datasets involving descriptive and inferential statistical tasks. Outputs from GPT-4.1 and Claude 3.7 Sonnet were evaluated using Microsoft Copilot as an LLM-as-a-judge, with human oversight.

Results

Zero-shot prompting was sufficient for basic descriptive tasks but failed in inferential contexts due to lack of assumption checking. Hybrid prompting, which combines explicit instructions, reasoning scaffolds, and format constraints, consistently produced the most accurate and interpretable results. Evaluation scores across four criteria–assumption checking, test selection, output completeness, and interpretive quality confirmed the superiority of structured prompts.

Conclusion

Prompt design is a critical determinant of output quality in AI-assisted statistical analysis. Hybrid prompting strategies should be adopted as best practice in medical research to ensure methodological rigor and reproducibility. Additional testing with newer models, including Claude 4 Sonnet, Claude 4 Opus, o3 mini, and o4 mini, confirmed the consistency of results, supporting the generalizability of findings across both Anthropic and OpenAI model families. This study highlights prompt engineering as a core competency in AI-assisted medical research and calls for the development of standardized prompt templates, evaluation rubrics, and further studies across diverse statistical domains to support robust and reproducible scientific inquiry.

Article activity feed