To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Introduction

Coding tumor diagnoses from free-text clinical documentation currently requires substantial manual effort. Promising approaches for automating this process include large language models (LLMs), embedding models, and retrieval-augmented generation (RAG). While previous studies often focus on a single method, we directly compare these approaches on a real-world dataset of tumor diagnosis descriptions to assess their strengths and limitations.

Methods

We evaluated nine different embedding models using similarity search and embedding-based classification, as well as LLM-based coding, with and without RAG, on a real-world dataset of 2,024 unique German tumor diagnosis descriptions labeled with ICD-10 and ICD-O topography codes. The retrieval knowledge base was constructed exclusively from standardized Alpha-ID, ICD-10-GM, and ICD-O-3 classifications. Performance was assessed for exact (full-code) and partial (three-character) code prediction. For RAG, we evaluated base and fine-tuned versions of Llama 3.1 8B and Llama 3.3 70B.

Results

Qwen3-Embedding-8B, the largest embedding model, yielded the best results. It achieved 47.8% exact-match and 72.1% partial-match accuracy for ICD-10 coding with classification, and 42.7% exact-match and 73.5% partial-match accuracy for ICD-O coding with similarity search. The other embedding models, including medically specialized ones, showed varied but lower performance. RAG improved base LLM performance and outperformed embedding-based approaches on partial-match accuracy (80.6% partial-match accuracy for ICD-10 and 75.0% for ICD-O with Llama 3.3 70B), but not on exact-match accuracy.

Conclusion

A direct comparison with embedding-based approaches is essential to determine whether the additional effort of RAG is justified. The strong variation in performance also highlights the importance of model selection. Further advances in embedding-based methods, potentially supported by larger and more diverse training data, may offer a promising direction for future work.

Article activity feed