MIRACLE - Medical Information Retrieval using Clinical Language Embeddings for Retrieval Augmented Generation at the point of care

Kamyar Arzideh
Henning Schäfer
Ahmad Idrissi-Yaghi
Bahadır Eryılmaz
Mikel Bahn
Cynthia Sabrina Schmidt
Olivia Barbara Pollok
Eva Hartmann
Philipp Winnekens
Katarzyna Borys
Johannes Haubold
Felix Nensa
René Hosch

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Most sentence transformer models have been trained in English on publicly accessible datasets. Integration of these models into Retrieval Augmented Generation systems is limited in terms of their ability to retrieve relevant patient-related information. In this study, multiple embedding models were fine-tuned on approximately eleven million question and chunk pairs from 400,000 documents documented in diverse medical categories. The questions and corresponding answers were generated by prompting a large language model. The fine-tuned model demonstrated superior performance on real-world German and translated English evaluation datasets, surpassing the state-of-the-art multilingual-e5-large model. Furthermore, models were trained on a pseudonymized dataset and made publicly available for other healthcare institutions to utilize.

Version published to 10.21203/rs.3.rs-5453999/v1 on Research Square
Dec 18, 2024

MultiMed-ST Datasets for Machine Translation in Medical Applications

This article has 2 authors:
1. Giridhar Gowda
2. Suma R
This article has no evaluationsLatest version Jan 9, 2026
Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026
Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance

This article has 3 authors:
1. Nimet Aksoy
2. Zekeriya Anıl Güven
3. Murat Osman Ünalır
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

MultiMed-ST Datasets for Machine Translation in Medical Applications

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance