Identifying the Presence of Disc Herniations in Lumbar Spine MRI Using Gemini 2.5 Pro
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose Lumbar disc herniation is associated with a high degree of morbidity including low back pain, radicular leg pain (sciatica), sensory disturbances, and possible motor deficits. Magnetic resonance imaging (MRI) is valuable for confirming clinical diagnosis in patients with severe symptoms, as well as when considering surgical/interventional management. Recent advances in artificial intelligence (AI) offer promise in automating image-based diagnosis to enhance accuracy and reduce radiologist workload. This study evaluates the performance of Gemini 2.5 Pro, a large multimodal model, in identifying herniations on lumbar spine MRIs. Methods We hypothesized that in a zero-shot setting Gemini 2.5 Pro would show high sensitivity but limited specificity, and that T1 + T2 would outperform T1 alone. We used SPIDER, a public multi-center dataset of 447 sagittal T1- and T2-weighted MRI series from 218 patients with low back pain, with expert labels. Three trials with gemini-2.5-pro-preview-06-05 were conducted: Trial 1 analyzed 20 paired T1/T2 mid-sagittal slices; Trial 2 included 248 T1 + T2 series; Trial 3 included 196 T1-only series. A standardized prompt forced binary classification (herniation present vs absent). Performance was assessed by sensitivity, specificity, accuracy, and F1. Results Trial 1 (N = 20): sensitivity 0.714, specificity 0.167, accuracy 0.550, F1 0.182. Trial 2 (N = 248): sensitivity 0.768, specificity 0.313, accuracy 0.645, F1 0.323. Trial 3 (N = 196, T1-only): sensitivity 0.786, specificity 0.161, accuracy 0.607, F1 0.190. Sensitivity remained high across trials, while specificity and F1 were consistently low, reflecting frequent false positives. Conclusion Gemini 2.5 shows strong sensitivity for detecting lumbar disc herniations on MRI, suggesting potential utility as a triage or screening aid, but low specificity and modest accuracy limit current clinical applicability. Limitations include dataset heterogeneity and binary framing. Future work should explore model fine-tuning, inclusion of 3D volumetric data, and expanded training on negative cases to improve specificity.