A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but multi-modal benchmarks involving surgery in particular are often missing from prominent medical benchmark suites (specifically, those requiring visual recognition beyond just text question-answering). Since surgery requires coordinating disparate tasks—including multimodal data integration, human interaction, and physical effects—generally-capable AI models could be particularly attractive as collaborative tools if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot simply be “scaled away” with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

Results Summary

We present findings from six experiments. (1) We evaluate zero-shot surgical tool detection performance across 20 open-weight Vision Language Models (VLMs) from 2023 to 2026 on SDSC-EEA, a video dataset consisting of endoscopic endonasal approach (EEA) neurosurgical procedures. Despite dramatic increases in model scale and benchmark scores, only one model marginally exceeds the 13.4% majority class baseline on the validation set. (2) We fine-tune Gemma 3 27B with LoRA adapters to generate structured JSON predictions. The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%. (3) We replace off-the-shelf JSON generation with a specialized classification head. This approach achieves 51.08% exact match accuracy. (4) To assess the potential of increasing computational resources, we gradually increase the effective number of trainable parameters (by increasing LoRA rank) by nearly three orders of magnitude. While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift. (5) We compare zero-shot and fine-tuned VLM performance against YOLOv12-m, a specialized 26M-parameter object detection model. YOLOv12-m achieves 54.73% exact match accuracy, outperforming all VLM-based methods while using 1,000× fewer parameters. (6) We demonstrate these findings generalize to three independent and public datasets— CholecT50, PitVis-2023, and SurgVU—with additional comparisons on five proprietary frontier VLMs. On CholecT50, a dataset of laparoscopic cholecystectomy procedures, the fine-tuned open-weight model and YOLOv12-m outperform all zero-shot VLM methods including zero-shot methods using proprietary frontier VLMs. On PitVis-2023, a public endoscopic pituitary neurosurgery benchmark with 18 instrument classes, the fine-tuned open-weight model again leads (84.77% exact match accuracy) followed by YOLOv12-m (82.78%); the best closed-weight frontier model, Gemini 3.1 Pro Preview, reaches 57.65%. On SurgVU, a public benchmark of robotic-assisted surgery training sessions on porcine tissue with 17 released instrument classes, zero-shot Gemma 3 27B achieves only 2.90% exact match accuracy, well below the 16.94% majority class baseline; only two of the five frontier closed-weight models clearly exceed that baseline (Claude Sonnet 4.6 at 23.05%, Gemini 3.1 Pro Preview at 22.46%), while the remaining three sit at or below it. LoRA fine-tuning of Gemma 3 27B reaches 50.61% and YOLOv12-m reaches 51.75%, both at least 28 percentage points above every frontier model. As on SDSC-EEA, the train-validation gap on CholecT50, PitVis-2023, and SurgVU widens with LoRA rank, confirming the same pattern across four surgical domains.

Article activity feed