The Reality of Prompt Engineering: Simplicity Often Outperforms Sophistication in Reasoning Tasks

Abdulhamid Onawole
Anandhi Vivek Dhukaram
Adetola Adeniyi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite the proliferation of prompt engineering techniques for large language models, research outputs present contradictory findings about optimal strategies, with some studies showing substantial improvements while others demonstrate minimal effects. We conducted a two-phase comprehensive evaluation to systematically assess prompt effectiveness across challenging reasoning tasks. Our primary investigation tested five prominent prompt engineering techniques (zero-shot, few-shot, chain-of-thought, role-playing, and deliberation) across 27 reasoning tasks from BIG-Bench Hard using GPT-4o-mini—chosen as a cost-efficient and accessible model for practical deployment scenarios. We selected reasoning tasks because they enable direct comparison to human cognitive performance and provide objective evaluation metrics. The initial phase tested 13,500 queries (2,700 questions across 5 prompt types). Building on these findings, our second phase introduced task-specific variants—specialized role-play prompts and task-specific few-shot examples—testing an additional 5,400 queries across the same tasks. Overall performance rankings revealed that task-specific role-play achieved the highest accuracy (87.78%), fol-lowed closely by chain-of-thought (87.63%) and role-playing (87.15%). Surprisingly, zero-shot prompting (84.85%) significantly outperformed few-shot prompting (80.70%) by 4.15 percentage points. A critical discovery emerged from response analysis: all prompt types spontaneously exhibit chain-of-thought-style reasoning patterns, even when not explicitly instructed, suggesting that step-by-step reasoning represents an internalized model behavior rather than a prompt-dependent phenomenon. Statistical analysis across both phases (18,900 total queries) revealed 14 significant performance differences (p < 0.05), with few-shot approaches consistently under-performing. Notably, role-play and zero-shot prompting proved most efficient in token usage and response time while maintaining competitive accuracy. The results demonstrate that prompt complexity does not guarantee superior performance, and that task-domain alignment matters more than assuming sophisticated prompts yield better results. Our findings provide evidence-based guidance challenging conventional assumptions about prompt engineering effectiveness.

Version published to 10.21203/rs.3.rs-7410360/v1 on Research Square
Aug 21, 2025

Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

This article has 7 authors:
1. Mury Fajar Dewantoro
2. Febri Abdullah
3. Yi Xia
4. Ibrahim Khan
5. Ruck Thawonmas
6. Wenwen Ouyang
7. Fitra Abdurrachman Bachtiar
This article has no evaluationsLatest version Sep 22, 2025
An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

This article has 1 author:
1. Xincheng Liu
This article has no evaluationsLatest version Oct 14, 2025
Aligning Prompts with Ranking Goals: A Technical Review of Prompt Engineering for LLM-Based Recommendations

This article has 3 authors:
1. Rahul Raja
2. Arpita Vats
3. Sudipta Roy
This article has no evaluationsLatest version Sep 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Exploration of Stability Judgments: From Multimodal LLMs to Human Insights

An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

Aligning Prompts with Ranking Goals: A Technical Review of Prompt Engineering for LLM-Based Recommendations