Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Background

Large language models (LLMs) have significantly advanced natural language processing in biomedical research, however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.

Results

To overcome these limitations, we developed BTE-RAG, a retrieval-augmented generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BioThings Explorer, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug–biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51% to 75.8% for GPT-4o mini and from 69.8% to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug–biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o.

Conclusion

Federated knowledge retrieval provides transparent improvements in accuracy for large language models, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.

Article activity feed

  1. AbstractBackground Large language models (LLMs) have significantly advanced natural language processing in biomedical research, however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.Results To overcome these limitations, we developed BTE-RAG, a retrieval-augmented generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BioThings Explorer, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug–biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51% to 75.8% for GPT-4o mini and from 69.8% to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug–biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o.Conclusion Federated knowledge retrieval provides transparent improvements in accuracy for large language models, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Sajib Acharjee Dip

    This paper introduces BTE-RAG, a system that combines large language models with biomedical knowledge from BioThings Explorer. Tested on three benchmarks built from DrugMechDB (genes, metabolites, and drug-process links), it shows clear accuracy gains compared to using LLMs alone.

    Strengths: The work demonstrates that retrieval improves both small and large models, suggesting cost-efficiency and scalability. This paper also curated multi-scale QA datasets (gene, metabolite, drug) from DrugMechDB provide structured, reproducible evaluation.

    Weaknesses:

    1. This dual-route design is conceptually sound but too narrow a baseline. A stronger evaluation would compare against other RAG systems (PubMed-based retrieval, BiomedRAG, SPOKE-RAG) instead of just "LLM-only."
    2. For Entity Recognition step, using pre-annotated entities in benchmarks artificially simplifies the problem. In real-world biomedical QA, entity recognition itself is a major challenge (e.g., ambiguous drug synonyms, rare disease names). Besides, the zero-shot extraction module is described but not evaluated. The paper should report precision/recall of entity recognition to show feasibility beyond curated inputs.
    3. No error analysis of BTE retrieval quality is provided. If BTE returns wrong or noisy triples, how often does this mislead the LLM? Adding experiment to show that would strengthen the study.
    4. Though the authors used SOTA LLMs, however, the choice of only OpenAI GPT-4o family is narrow. No comparison with open-source biomedical LLMs (e.g., BioGPT, Meditron, PubMedBERT-RAG). Comparison with these model would increase the generalizability
    5. Reliance on one source (DrugMechDB) makes evaluation narrow. The authors should demonstrate performance on at least one independent dataset (e.g., BioASQ, PubMedQA, SPOKE-based tasks) to show broader utility.
    6. Cosine similarity ≥0.9 is arbitrary; should provide ROC/AUC or threshold sensitivity.
    7. Benchmarks enforce exactly one correct gene, metabolite, or drug per question. Real mechanisms often involve multiple parallel or interacting entities. The single-answer design hides biological complexity and creates an artificial task.
    8. Ground truth relies on exact HGNC, CHEBI, or DrugBank IDs. Why the ambiguities (synonyms, deprecated IDs, overlapping terms) are filtered out rather than addressed? This may bias the dataset toward easier, cleaner cases.
    9. The paper cited recent biomedical RAG systems such as BiomedRAG, GeneTuring but didn't compare with them (e.g., BiomedRAG). BioRAG (2024) is also highly relevant. These works are highly relevant baselines, showing retrieval from knowledge graphs, APIs, or literature, and including them in comparison would better position BTE-RAG within the current state of the art and highlight its unique contributions.
  2. AbstractBackground Large language models (LLMs) have significantly advanced natural language processing in biomedical research, however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts.Results To overcome these limitations, we developed BTE-RAG, a retrieval-augmented generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BioThings Explorer, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug–biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51% to 75.8% for GPT-4o mini and from 69.8% to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug–biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o.Conclusion Federated knowledge retrieval provides transparent improvements in accuracy for large language models, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag007), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Christopher Tabone

    Dear Authors,

    Thank you for the opportunity to review "Federated Knowledge Retrieval Elevates Large Language Model Performance on Biomedical Benchmarks." The paper tackles a timely and important problem: grounding large language models in mechanistic evidence to reduce unsupported claims. It does so with a thoughtful design that layers BTE-RAG over a federation of approximately 60 biomedical APIs and evaluates three complementary DrugMechDB-derived benchmarks (gene, metabolite, drug to process). The manuscript is clearly written, the technical contribution is meaningful, and the experimental results are promising.

    Recommendation: Major revision.

    Below are concrete, actionable changes that would bring the work in line with GigaScience's standards for FAIR availability, licensing, documentation, testing, and reproducibility. Many are straightforward, but together they matter for long-term reuse and auditability.

    1. Statistical rigor: paired inference, uncertainty, variance The manuscript reports compelling descriptive gains. Because each benchmark item is answered under both conditions (LLM-only and BTE-RAG), the study is a paired design. In paired settings, descriptive plots and point estimates are not sufficient to establish that improvements exceed sampling noise or threshold tuning. Please add paired statistical evidence that quantifies: (i) whether the gains are reliable, (ii) how large they are in practical terms, and (iii) how stable they are under repeated runs or under a fully deterministic pipeline. Gene task (binary): Report McNemar's test on the existing 2×2 tables, along with 95 percent Wilson confidence intervals for each condition and a Newcombe confidence interval for the accuracy difference. Keep the flip counts in the text.

    Metabolite and drug-to-process tasks (similarity): Report paired bootstrap confidence intervals or Wilcoxon signed-rank tests on per-item similarity differences (BTE-RAG minus baseline). Include a nonparametric effect size such as Cliff's delta with its confidence interval.

    Threshold validation: Treat the greater-than-or-equal-to 0.90 "high-fidelity" threshold as a choice that should be validated. Show sensitivity across nearby cutoffs such as 0.85, 0.90, and 0.95, and add a small blinded expert adjudication (about 50 to 100 items) to confirm that the high-cosine band corresponds to acceptable correctness.

    Variance or determinism: Either document end-to-end determinism (frozen retrieval caches, fixed ordering, pinned embeddings) or run at least three replicates and report mean and standard deviation.

    These additions convert the current descriptive story into paired inference with uncertainty and effect sizes and clarify robustness around thresholding and reproducibility.

    1. Benchmark scope and generalizability All three evaluations are derived from DrugMechDB, which makes the study internally consistent but also couples the tasks to a single curation philosophy and evidence distribution. Please acknowledge this limitation explicitly in the Discussion and, ideally, add an external validation on at least one independent source to demonstrate generalizability. Options include CTD (drug-gene-process links), Reactome or GO (pathway and process grounding), DisGeNET (gene-disease associations), or a lightweight question answering set sourced outside DrugMechDB. Even a modest external set of about 100 to 200 items, evaluated with the same paired protocols and identifier-based scoring, would strengthen the claim. If full external validation is not feasible for this revision, please include robustness checks such as a date-based split, entity-family holdouts, and per-source ablations.

    2. Licensing, attribution, and persistent identifiers The project is MIT-licensed and adapts components from BaranziniLab/KG_RAG (Apache-2.0) and SuLab/DrugMechDB (CC0-1.0). To meet license obligations and align with FAIR and the Joint Declaration of Data Citation Principles, please: (i) keep Apache-licensed code under Apache with the upstream LICENSE and NOTICE files, noting any modifications; (ii) include the CC0 dedication text for any DrugMechDB artifacts and note that CC0 provides no patent grant; (iii) archive with DOIs (GigaDB preferred?) the three benchmarks, the exact evaluation caches used in the paper, and a tagged software release of the repository; (iv) license datasets under CC0 or CC BY while keeping the code MIT; (v) add a short Data and Software Availability table listing artifact, DOI or URL, license, and version or date.

    3. Error analysis and degradation cases Please add a brief failure analysis focused on where BTE-RAG reduces accuracy relative to LLM-only. At minimum, report the total number and percent of right-to-wrong flips per task and include a small set of representative cases. For each example, show the input, expected and predicted outputs, the top retrieved evidence with identifiers and timestamps, and a one-line diagnosis of the likely cause (for example normalization mismatch, retrieval coverage gap, ranking or filtering that hid relevant context, or long-context truncation). A short summary that groups the main causes into two or three buckets will make the results more interpretable and point to practical fixes.

    4. Methodological transparency: embedding and scoring models Please add two or three sentences in Methods explaining why S-PubMedBERT-MS-MARCO is used for filtering retrieved context while a BioBERT-based model is used for semantic similarity scoring, and what advantages each provides over plausible alternatives. A brief rationale will strengthen methodological transparency.

    5. Reproducibility workflow and archived caches Because BTE federates live APIs, results can drift as sources update. Please archive the exact retrieval caches used in evaluation with DOIs and minimal provenance if at all possible (query identifier, subject and object identifiers, predicate, source name and version or access date, any confidence score, and a retrieval timestamp).

    In summary, this is a promising and well-motivated study that could make a useful contribution once the statistical evidence, FAIR availability, and reproducibility pieces are tightened as outlined above. I recommend Major Revision and am happy to re-review a revised version.