Reducing MISRA violations in LLM-generated code by 83%: An empirical study with static analysis verification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models (LLMs) are increasingly used for C++ code generation, yet their ability to satisfy Motor Industry Software Reliability Association (MISRA) C++:2023 guidelines at scale remains unclear. This study conducts a controlled before–after, repeated-measures study on 26 C++ tasks, evaluating four models with 20 runs per condition. Compiled outputs are checked with a complete MISRA C++:2023 ruleset. Verbose rule texts are distilled into compact, actionable Top-k instruction packs (k=3,5,10) targeting each model’s most frequent violations. Primary outcomes are violations per thousand lines of code (KLOC), compile rate, and pass rate. At baseline, models cluster at 23–29 violations/KLOC, dominated by an advisory rule discouraging standard integer type names. Adding Top-k instructions reduces violations by 44–83% across models (paired permutation tests, all p < 0.01); GPT-5 and o3 reach 3.9–4.5 violations/KLOC. Functional impacts are small overall; two conditions show significant pass-rate declines (GPT-4.1/Top-3, o3/Top-10). Improvements spill over to non-targeted rules. Compact, model-aware MISRA prompts therefore offer a practical path to safer C++ generation with limited functional cost when scoped appropriately. However, full verification still requires dedicated compliance tooling to detect residual issues, quantify results for certification, and produce auditable evidence for regulators. Practitioners should adopt a step-up strategy, start with Top-3 or Top-5 rules, monitor compile and pass rates, and expand only when stable. Study artifacts are released to enable replication and reuse.