Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : Medical procedure coding is time-intensive and error-prone, with direct implications for reimbursement accuracy and operational efficiency. Large Language Models (LLMs) show promise for automating CPT code assignment, yet their accuracy on surgical procedure notes compared to physician-defined benchmarks remains understudied. Objective : To evaluate and compare the CPT-code assignment performance of some of the most popular LLMs capable of reasoning (Anthropic Claude Opus 4.5, OpenAI GPT-5.2, and Google Gemini 3 Pro) against a surgeon-labeled benchmark for orthopedic procedure notes. Methods : Thirty-three publicly available, de-identified orthopedic procedure notes were obtained from MTSamples and Medical Transcription Sample Reports. Two surgeons, blinded to AI outputs, independently assigned benchmark CPT codes to notes within their specialty scope (28/33 notes labeled). Three frontier-class LLMs (Claude Opus 4.5, GPT-5.2, and Gemini 3 Pro) were selected based on LMArena performance and configured with extended reasoning at maximum settings. Each model was queried three times per note using identical prompts (n=297 total queries). A code was considered "predicted" if it appeared in at least 2 of 3 runs. Predicted codes were validated against the 2025 CMS HCPCS/CPT database. Performance metrics included precision, recall, F1 score, hallucination rate, invalid code rate, and consistency rate. Results : Of 33 orthopedic procedure notes evaluated (28 with valid benchmark labels), Claude Opus 4.5 achieved the highest accuracy (F1: 65.9%, precision: 66.7%, recall: 65.2%), followed by Gemini 3 Pro (F1: 62.1%) and GPT-5.2 (F1: 56.8%). Consistency did not correlate with accuracy: Gemini demonstrated the highest run-to-run consistency (72.7% identical outputs across runs) despite lower benchmark alignment, while Claude showed greater variance (63.6%) yet superior accuracy. No model produced hallucinated or invalidly formatted codes (0% hallucination rate, 0% invalid rate). Performance varied substantially by procedural complexity: simple single-code procedures achieved near-perfect consistency across models, while complex multi-component procedures were more likely to show F1 scores below 40% and greater inter-run variance. Conclusion : Current frontier LLMs demonstrate moderate accuracy in CPT code assignment for orthopedic procedures but are not yet suitable for autonomous clinical use. These models may offer value as first-pass tools within human-in-the-loop workflows, particularly for straightforward procedures. Future research should evaluate prompting optimization, modifier assignment, and prospective human-AI collaborative coding in real billing environments.

Article activity feed