ALPaCA: Adapting Llama for Pathology Context Analysis to enable slide-level question answering
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Vision Language Models (LVLMs) have recently revolutionized computational pathology. LVLMs transform pathology image embeddings into tokens recognizable by large language models, facilitating zero-shot image classification, description generation, question answering, and interactive diagnostics. In clinical practice, pathological assessments often require the analysis of entire tissue slides, integrating information from multiple sub-regions and magnification levels. However, existing LVLM frameworks have been restricted to the analysis of small, predefined regions of interest, lacking the ability to analyze pyramidal, gigapixel-scale whole-slide images (WSIs). In this work, we introduce ALPaCA ( A dapting L lama for Pa thology C ontext A nalysis), and train the first general-purpose slide-level LVLM, leveraging 35,913 WSIs with curated descriptions alongside 341,051 question and answer pairs encompassing diverse diagnoses, procedures, and tissue types. By developing LongFormer, a vision-text interactive slide-level adaptor, and integrating it with a Gaussian mixture model-based prototyping adaptor, followed by training with Llama3.1, ALPaCA achieves superior performance in slide-level question answering, achieving over 90% accuracy in close-ended tests and high accuracy in open-ended questions as evaluated by expert pathologists, highlighting its potential for slide-level computer-aided diagnosis systems. Additionally, we show that ALPaCA can be readily fine-tuned on in-depth, organ-specific, or disease-specific datasets, underscoring its adaptability and utility for specialized pathology tasks.