To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

Fatma Alickovic
Stefan Lenz
Arsenij Ustjanzew
Lakisha Ortiz Rosario
Georg Vollmar
Thomas Kindler
Torsten Panholzer

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Introduction

Coding tumor diagnoses from free-text clinical documentation currently requires substantial manual effort. Promising approaches for automating this process include large language models (LLMs), embedding models, and retrieval-augmented generation (RAG). While previous studies often focus on a single method, we directly compare these approaches on a real-world dataset of tumor diagnosis descriptions to assess their strengths and limitations.

Methods

We evaluated nine different embedding models using similarity search and embedding-based classification, as well as LLM-based coding, with and without RAG, on a real-world dataset of 2,024 unique German tumor diagnosis descriptions labeled with ICD-10 and ICD-O topography codes. The retrieval knowledge base was constructed exclusively from standardized Alpha-ID, ICD-10-GM, and ICD-O-3 classifications. Performance was assessed for exact (full-code) and partial (three-character) code prediction. For RAG, we evaluated base and fine-tuned versions of Llama 3.1 8B and Llama 3.3 70B.

Results

Qwen3-Embedding-8B, the largest embedding model, yielded the best results. It achieved 47.8% exact-match and 72.1% partial-match accuracy for ICD-10 coding with classification, and 42.7% exact-match and 73.5% partial-match accuracy for ICD-O coding with similarity search. The other embedding models, including medically specialized ones, showed varied but lower performance. RAG improved base LLM performance and outperformed embedding-based approaches on partial-match accuracy (80.6% partial-match accuracy for ICD-10 and 75.0% for ICD-O with Llama 3.3 70B), but not on exact-match accuracy.

Conclusion

A direct comparison with embedding-based approaches is essential to determine whether the additional effort of RAG is justified. The strong variation in performance also highlights the importance of model selection. Further advances in embedding-based methods, potentially supported by larger and more diverse training data, may offer a promising direction for future work.

Version published to 10.64898/2026.05.27.26353695 on medRxiv
Jun 3, 2026

Augmenting Structured Diagnoses through Effective Use of Pre-trained Large Language Models on Clinical Notes

This article has 6 authors:
1. Hanieh Razzaghi
2. Nhat Nguyen
3. Mohan Pargi
4. Kaleigh Wieand
5. H. Timothy Bunnell
6. L. Charles Bailey
This article has no evaluationsLatest version Jun 2, 2026
Automatic Classification of Medical Artificial Intelligence Articles by Their Level of Translational Maturity: An Interpretable Supervised Text-Classification Approach

This article has 2 authors:
1. Sandeep Reddy
2. Alix Héritier
This article has no evaluationsLatest version Jul 13, 2026
Performance of Google NotebookLM for AI-assisted data extraction and consensus statement generation in a heterogenous systematic review on inflammatory bowel disease, obesity, and cardiometabolic comorbidities: A Methodological Report

This article has 11 authors:
1. Sami Samaan
2. Jalpa Devi
3. Matthew Vincent
4. Shannon Coombs
5. Priya Sehgal
6. Mouhand Mouhamed
7. Victoria Rai
8. Amanda M. Johnson
9. Andres J. Yarur
10. Edward L. Barnes
11. Parakkal Deepak
This article has no evaluationsLatest version Jun 26, 2026

Discuss this preprint

Listed in

Abstract

Introduction

Methods

Results

Conclusion

Article activity feed

Related articles

Augmenting Structured Diagnoses through Effective Use of Pre-trained Large Language Models on Clinical Notes

Automatic Classification of Medical Artificial Intelligence Articles by Their Level of Translational Maturity: An Interpretable Supervised Text-Classification Approach

Performance of Google NotebookLM for AI-assisted data extraction and consensus statement generation in a heterogenous systematic review on inflammatory bowel disease, obesity, and cardiometabolic comorbidities: A Methodological Report