Prompt Engineering for Structured Data A Comparative Evaluation of Styles and LLM Performance

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The effectiveness of prompt engineering strategies for structured data generation remains an open challenge, especially as the capabilities and architectures of large language models (LLMs) continue to evolve. While prior research has examined a limited set of prompt styles using GPT-4o [8], this study expands the scope significantly by comparing six prompt styles—JSON, YAML, CSV, function calling APIs, simple prefixes, and a hybrid CSV/prefix format—across three state-of-the-art LLMs: ChatGPT-4o, Claude, and Gemini. Building upon our earlier findings, which focused on evaluating three prompt styles within GPT-4o, this study introduces a broader prompt set and performs a comparative analysis across multiple models to generalize and extend prior conclusions. These datasets are used to evaluate each prompt style for these LLMs across three critical metrics: accuracy in reproducing the expected data attributes, token cost for API usage, and time needed to generate data. Our methodology incorporates structured data validation and analysis through Python utilities that ensure precise comparison and document each style’s performance. We visualize the results via Technique vs. Accuracy, Technique vs. Token Cost, and Technique vs. Time graphs. Our results reveal trade-offs between prompt complexity and performance, suggesting that simpler formats may provide efficiency benefits with minimal loss in accuracy, while more flexible formats offer enhanced versatility for handling complex data structures. Our extended findings demonstrate that prompt selection substantially affects both quality and resource efficiency. Claude consistently yields the highest accuracy, ChatGPT-4o is the most token- and time-efficient, and Gemini offers a balanced trade-off across metrics. These results extend earlier single-model evaluations and provide practical guidelines for choosing prompt styles based on model capabilities and application-specific constraints. This work advances the field of prompt engineering by offering a comprehensive, multi-model framework for optimizing structured data generation.

Article activity feed