Domain specific models outperform large vision language models on cytomorphology tasks

Ivan Kukuljan
Muhammed Furkan Dasdelen
Julia Schäfer
Michele Buck
Katharina S. Götze
Carsten Marr

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large vision-language models (LVLMs) show impressive capabilities in image understanding across domains. However, their suitability for high-risk medical diagnostics remains unclear. We systematically evaluate four state-of-the-art LVLMs and three domain-specific models on key cytomorphological benchmarks: peripheral blood cell classification, morphology assessment, bone marrow cell classification, and cervical smear malignancy detection. Performance is assessed under zero-shot, few-shot, and fine-tuned conditions. LVLMs underperform significantly: the best LVLM achieves a zero-shot F1 score of 0.057 ± 0.008 for malignancy detection—near random (0.039)—and only 0.15 ± 0.01 in few-shot. In contrast, domain-specific models reach up to 0.83 in accuracy. Even after fine-tuning, a dedicated hematology model outperforms GPT-4o. While LVLMs offer explainability via text, we find the visual-language grounding unreliable, and the morphological features mention by the model often do not match the single cell properties. Our findings suggest that LVLMs require substantial improvements before use in high-stakes diagnostic settings.

Key findings

LVLMs perform poorly on cytomorphology tasks, often near chance level and far below domain-specific models.
Even after fine-tuning, LVLMs lag behind domain-specific models.
While LVLMs provide textual justifications, these often reflect generic descriptions rather than image-specific morphological features.

Version published to 10.1101/2025.05.05.25326989v1 on medRxiv
May 6, 2025

Evaluating the performance of large language & visual-language models in cervical cytology screening

This article has 16 authors:
1. Qi Hong
2. Shijie Liu
3. Liying Wu
4. Qiqi Lu
5. Pinglan Yang
6. Dingyu Chen
7. Gong Rao
8. Xinyi Liu
9. Hua Ye
10. Peiqi Zhuang
11. Wenxiu Yang
12. Shaoqun Zeng
13. Qianjin Feng
14. Xiuli Liu
15. Jing Cai
16. Shenghua Cheng
This article has no evaluationsLatest version May 23, 2025
ALPaCA: Adapting Llama for Pathology Context Analysis to enable slide-level question answering

This article has 16 authors:
1. Zeyu Gao
2. Kai He
3. Weiheng Su
4. Ines P. Machado
5. William McGough
6. Mercedes Jimenez-Linan
7. Brian Rous
8. Chunbao Wang
9. Chengzu Li
10. Xiaobo Pang
11. Tieliang Gong
12. Ming Y. Lu
13. Faisal Mahmood
14. Mengling Feng
15. Chen Li
16. Mireia Crispin-Ortuzar
This article has no evaluationsLatest version Apr 22, 2025
Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study

This article has 8 authors:
1. Rohan Bareja
2. Francisco Carrillo-Perez
3. Yuanning Zheng
4. Marija Pizurica
5. Tarak Nath Nandi
6. Jeanne Shen
7. Ravi Madduri
8. Olivier Gevaert
This article has no evaluationsLatest version May 12, 2025

Listed in

Abstract

Key findings

Article activity feed

Related articles

Evaluating the performance of large language & visual-language models in cervical cytology screening

ALPaCA: Adapting Llama for Pathology Context Analysis to enable slide-level question answering

Evaluating Vision and Pathology Foundation Models for Computational Pathology: A Comprehensive Benchmark Study