An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models—ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V4.2, and Grok 4—using three structured prompt frameworks: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format).Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives.Results indicate that model selection exerted the strongest influence on linguistic accessibility and complexity, with DeepSeek producing the most readable teaching plan (FKGL ≈ 8.6) and Claude generating the densest language (FKGL ≈ 21). However, Claude’s elevated level primarily reflected its tabular formatting and condensed information density rather than excessive lexical difficulty. Prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index (≈ 2.0) and the highest incidental alignment with NGSS curriculum standards (≈ 0.082). Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom’s taxonomy. There were limited higher-order verbs in the learning objectives extracted.Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts and higher-order objectives. This research presents the first multi-model, multi-framework evaluation of AI-generated lesson plans, utilizing empirically validated pedagogical metrics from previous studies.

Article activity feed