Prompt Engineering for Structured Data A Comparative Evaluation of Styles and LLM Performance

Jules White
Ashraf Elnashar
Douglas Schmidt

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The effectiveness of prompt engineering strategies for structured data generation remains an open challenge, especially as the capabilities and architectures of large language models (LLMs) continue to evolve. While prior research has examined a limited set of prompt styles using GPT-4o [8], this study expands the scope significantly by comparing six prompt styles—JSON, YAML, CSV, function calling APIs, simple prefixes, and a hybrid CSV/prefix format—across three state-of-the-art LLMs: ChatGPT-4o, Claude, and Gemini. Building upon our earlier findings, which focused on evaluating three prompt styles within GPT-4o, this study introduces a broader prompt set and performs a comparative analysis across multiple models to generalize and extend prior conclusions. These datasets are used to evaluate each prompt style for these LLMs across three critical metrics: accuracy in reproducing the expected data attributes, token cost for API usage, and time needed to generate data. Our methodology incorporates structured data validation and analysis through Python utilities that ensure precise comparison and document each style’s performance. We visualize the results via Technique vs. Accuracy, Technique vs. Token Cost, and Technique vs. Time graphs. Our results reveal trade-offs between prompt complexity and performance, suggesting that simpler formats may provide efficiency benefits with minimal loss in accuracy, while more flexible formats offer enhanced versatility for handling complex data structures. Our extended findings demonstrate that prompt selection substantially affects both quality and resource efficiency. Claude consistently yields the highest accuracy, ChatGPT-4o is the most token- and time-efficient, and Gemini offers a balanced trade-off across metrics. These results extend earlier single-model evaluations and provide practical guidelines for choosing prompt styles based on model capabilities and application-specific constraints. This work advances the field of prompt engineering by offering a comprehensive, multi-model framework for optimizing structured data generation.

Version published to 10.20944/preprints202506.1937.v1
Jun 24, 2025

Benchmarking Large Language Models for Data Pipeline Code Generation and Execution

This article has 4 authors:
1. Chiara Rucco
2. Motaz Saad
3. Tobia Martina
4. Antonella Longo
This article has no evaluationsLatest version Jul 2, 2025
A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants

This article has 2 authors:
1. Owen Graham
2. Jim Balford
This article has no evaluationsLatest version Jun 13, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Large Language Models for Data Pipeline Code Generation and Execution

A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation