When Educational Prompts Become Security Risks: Evidence from K-12 AI Safety Testing

Clara Gutstadt
Sofia Tang
Neha Bonney
Joyce Chen
Gwendolyn Gattuso
Sophia Gill
Anya Jani
Sonma Lala
Kate Pulsifer
Sarah Tarka
Starry Yang
Robin Zitelli
Thomas Heverin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As K-12 schools increasingly adopt large language models (LLMs) for teaching and learning, educators often rely on built-in safety guardrails to prevent harmful outputs. However, little empirical evidence exists regarding how these safeguards behave in realistic classroom scenarios. This study examines whether prompts framed as educational, curiosity-driven, or troubleshooting requests reliably prevent unsafe responses.We conducted a controlled evaluation of four common classroom-style prompt framings across two malware-related tasks (ransomware-note generation and keylogger code generation) using 24 publicly accessible LLMs. Each model-prompt interaction was evaluated for whether the output could reasonably be adapted for malicious use.Across models, overall success rates exceeded 70% for both tasks, indicating substantial baseline vulnerability. However, results diverged by task type. Ransomware requests were highly sensitive to framing, with troubleshooting prompts outperforming educational justifications by more than 20 percentage points. In contrast, keylogger requests showed consistently high success rates regardless of framing, suggesting framing-invariant vulnerability.These findings indicate that “for educational purposes” framing does not reliably mitigate risk and that AI safety performance in schools is task-dependent rather than prompt-dependent alone. The study highlights the need for task-aware AI policies, teacher training, and more nuanced safety evaluations in K–12 educational environments.

Version published to 10.35542/osf.io/q47ha_v1 on OSF Preprints
Feb 13, 2026

Cannot, Should Not, Did Anyway: Benchmarking Constraint Enforcement Failure in Frontier LLMs

This article has 2 authors:
1. Samir M. Haq
2. Shehni Nadeem
This article has no evaluationsLatest version May 24, 2026
Bridging Developer–QA Gaps Using Large Language Models and Automation: A Pilot Evaluation of AutoVisQA

This article has 1 author:
1. Tanvir Hasan
This article has no evaluationsLatest version Apr 17, 2026
Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare

This article has 6 authors:
1. James Schwoebel
2. Ingrida Semenec
3. Jenia Rousseva
4. Martin Gerbert Frasch
5. Rome Thorstenson
6. Manish Bhatt
This article has no evaluationsLatest version Jun 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cannot, Should Not, Did Anyway: Benchmarking Constraint Enforcement Failure in Frontier LLMs

Bridging Developer–QA Gaps Using Large Language Models and Automation: A Pilot Evaluation of AutoVisQA

Beyond Injection Detection: A Positive-Security Prompt Firewall that Closes the Scope and PHI Gap SOTA Classifiers Miss in Healthcare