Evidence of Impact and Interpretational Limits of Generative AI in STEM education - A Systematic Review and Meta-Analysis on Cognitive Learning Outcomes

Stefan Küchemann
Chiara Hortmann
Nina Peltzer
Salome Flegr
Jochen Kuhn

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This systematic review and meta-analysis examine the impact of generative artificial intelligence (GAI) on cognitive learning outcomes in STEM education. Prior research is growing but remains fragmented, often focusing on usability or single tools like ChatGPT rather than domain-specific cognitive effects. We therefore address these gaps by examining (1) the extent to which interactions with GAI enhance learning effectiveness and possible moderators, (2) what challenges learners face when interacting with GAI systems, and (3) which interventions support successful learner-GAI interaction. We meta-analyzed externally assessed cognitive outcomes (RQ1) and narratively synthesized reported learner challenges and supportive instructional interventions (RQ2-RQ3) when quantitative pooling was not feasible. A systematic search (ERIC, PsycINFO, Web of Science) and citation tracking yielded 52 eligible studies (N = 4906), of which 33 (N = 3153) met meta-analytic criteria. Overall, learner-GAI interaction 1 yielded a moderate-to-large effect on cognitive learning outcomes (Hedges' g = 0.739, 95% CI [0.325, 1.15], p(BH) = 0.007), though heterogeneity was substantial (I 2 = 96.6%). Funnel plot asymmetry suggested potential publication bias; however, a fail-safe test indicated that the effect remained robust. Unlike prior ChatGPT-focused meta-analyses, moderators tested in this work were non-significant after post hoc Benjamini-Hochberg correction. A meta-regression indicated that learning outcome type and intervention duration explain about 22% of the remaining variance, with learning outcome (knowledge vs. skills) significantly predicting effect sizes. A coincidence analysis suggested that no combination of learning outcome type and intervention duration fulfills a sufficient condition for large effect sizes (g > 0.6). Additionally, knowledge as a learning outcome type was found to be an almost, but not strictly, necessary condition for large effects when learning with generative AI. Certainty of evidence was rated low due to heterogeneity and reporting limitations. Evidence for RQ2-RQ3 was limited and inconsistently reported; hence, these findings are presented as transparent, caveated qualitative insights rather than generalizable effect estimates. Overall, GAI appears promising for cognitive learning in STEM, particularly for knowledge acquisition and interventions lasting longer than 4 weeks. However, substantial unexplained heterogeneity and systematic underreporting of learner-level variables (AI literacy, metacognitive skills) and process-level mechanisms (task delegation, prompt quality, verification behaviors) indicate that the field has yet to measure the factors most likely to drive effectiveness. We propose six testable hypotheses and an integrative theoretical framework to guide future research toward understanding how, for whom, and under what conditions GAI supports STEM learning.

Version published to 10.35542/osf.io/yhekz_v1 on OSF Preprints
Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed