An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

Xincheng Liu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models—ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V4.2, and Grok 4—using three structured prompt frameworks: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format).Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives.Results indicate that model selection exerted the strongest influence on linguistic accessibility and complexity, with DeepSeek producing the most readable teaching plan (FKGL ≈ 8.6) and Claude generating the densest language (FKGL ≈ 21). However, Claude’s elevated level primarily reflected its tabular formatting and condensed information density rather than excessive lexical difficulty. Prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index (≈ 2.0) and the highest incidental alignment with NGSS curriculum standards (≈ 0.082). Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom’s taxonomy. There were limited higher-order verbs in the learning objectives extracted.Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts and higher-order objectives. This research presents the first multi-model, multi-framework evaluation of AI-generated lesson plans, utilizing empirically validated pedagogical metrics from previous studies.

Version published to 10.35542/osf.io/r3xkt_v1 on OSF Preprints
Oct 14, 2025

Exploring the Quality and Effectiveness of AI-Generated Feedback in Introductory Programming

This article has 3 authors:
1. Yizhou Qian
2. Meishan Liu
3. Liye Zhu
This article has no evaluationsLatest version Oct 1, 2025
Is ChatGPT a good study companion? The role of AI-generated summaries and questions in learning from educational videos

This article has 5 authors:
1. Ayşe Candan Şimşek
2. Gerrit Anders
3. Jonathan Göth
4. Luisa Specht
5. Markus Huff
This article has no evaluationsLatest version Oct 4, 2025
Exploring Teacher-Student Interaction through Multimodal Large Language Models: An Empirical Investigation

This article has 4 authors:
1. Guanyu Chen
2. Guangxin Han
3. Juan Niu
4. Juhou He
This article has no evaluationsLatest version Oct 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Exploring the Quality and Effectiveness of AI-Generated Feedback in Introductory Programming

Is ChatGPT a good study companion? The role of AI-generated summaries and questions in learning from educational videos

Exploring Teacher-Student Interaction through Multimodal Large Language Models: An Empirical Investigation