Large Language Models for Zero-Shot Procedure Extraction in Orthopedic Surgery: A Comparative Evaluation

Ashton Williamson
Nazgol Tavabi
Nishita Kalepalli
Ophelie Lavoie-Gagne
Andre Weiss
Shefali R. Bijwadia
Andrew Sibley
Benjamin Owens
Rafael A. García Andújar
Harsev Singh
Alexandra Santos
Alex Kim
Joseph Murray
Ariana Goli
Leili Sarmadi
Mahad M Hassan
Ata M. Kiapour

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Operative notes in electronic health records contain critical information for understanding surgical care, yet manual coding is time-consuming, costly, and inconsistent. Large language models (LLMs) promise to transform this process by automatically extracting detailed procedure information — a capability with significant implications for scaling clinical registries and advancing surgical research.

Methods

We conducted a large-scale evaluation of state-of-the-art LLMs for zero-shot structured information extraction from orthopedic clinical notes. Fourteen open-source and proprietary models were tested on 800 real operative notes, annotated by both an orthopedic surgeon and an administrator using a curated list of 74 procedure classes. We compared model outputs to human annotations, assessing accuracy and exploring the effects of model scale, reasoning capabilities, and prompt design.

Results

Across models, LLMs consistently outperformed administrator-assigned labels, achieving macro-F1 scores above 0.6 and improving over administrative coding by up to 10 points. Larger models and reasoning capabilities further boosted performance, though gains plateaued beyond 30 billion parameters. Performance varied by procedure frequency, revealing clear strengths and persistent challenges for rare or complex cases.

Conclusion

Modern LLMs can already outperform routine administrative coding in extracting detailed surgical procedure data, pointing to a future where registry curation could be faster, cheaper, and more consistent. Yet, full alignment with surgical experts remains an open challenge—especially for rare procedures —emphasizing the need for domain adaptation and thoughtful deployment. Our findings illustrate how general-purpose LLMs can advance automated clinical data curation and inform the next generation of surgical informatics.

Version published to 10.1101/2025.08.19.25333995 on medRxiv
Aug 24, 2025

Automated Prediction of Radiological Protocols Using Retrieval Augmented Generation

This article has 10 authors:
1. Conrad T. Testagrose
2. Panagiotis Korfiatis
3. Timothy L. Kline
4. Justin D. Benfield
5. Cole J. Cook
6. Peggy S. Merkel
7. Mutlu Demirer
8. Richard D. White
9. Candice W. Bolan
10. Barbaros S. Erdal
This article has no evaluationsLatest version Sep 17, 2025
AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction

This article has 2 authors:
1. Yongxin Kong
2. Katy Börner
This article has no evaluationsLatest version Sep 13, 2025
Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

This article has 8 authors:
1. Valentin POHYER
2. Constance de Margerie-Mellon
3. Laetitia PERRONNE
4. Loïc DURON
5. Constance THIBAULT
6. Stéphane Oudard
7. Laure FOURNIER
8. Bastien Rance
This article has no evaluationsLatest version Sep 23, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Automated Prediction of Radiological Protocols Using Retrieval Augmented Generation

AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction

Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification