Benchmarking MeSH-Augmented Embeddings for Biomedical Document Similarity

Rohitha Ravinder
Lukas Geist
Nelson Quiñones
Suhasini Venkatesh
Leyla Jael Castro
Dietrich Rebholz-Schuhmann

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background : The extensive volume of biomedical scientific literature requires efficient methods for retrieving relevant documents based on semantic technologies and biomedical concepts. While embedding-based methods have shown improvements over traditional keyword-based methods, the integration of domain-specific terminologies like Medical Subject Headings (MeSH) into these models remain underexplored. Methods : This study compares three hybrid methods that integrate MeSH-based annotations with document embeddings (called ''pre-annotation", ''post-annotation" and''post-reduction"). We benchmark these hybrid methods against the following traditional methods: TF-IDF, standard neural embeddings (Word2Vec, fastText, Doc2Vec) and publicly available transformer-based models (BioBERT, SciBERT, SPECTER), using cosine similarity and Word Mover’s Distance (WMD) as evaluation metrics. The benchmark experiments are based on the RELISH corpus, a manually curated dataset of PubMed articles where experts have labeled pairs of documents with regards to their relevance to each other, providing a 2-class (relevant vs. non-relevant) as well as a 3-class (relevant, partially relevant, non-relevant) judgment. Results : Transformer-based models, particularly fine-tuned BioBERT and SciBERT align best with the expert judgements after fine-tuning. Among non-transformer methods, Doc2Vec and MeSH-based hybrid methods perform also well, demonstrating the benefits from combining structured biomedical vocabularies with embedding methods. Our experiments deliver extensive results showing that the baseline performance of 76-78~\% precision at position 5 can be achieved through almost all approaches, improvements of 2-4~\% with MeSH concepts can be achieved, but performances up to 90~\% is left to the fine-tuned large-scale public models. Conclusion : The performance gains from the integration of concepts may be underwhelming, however the benefits lie in the successful integration and benchmarking of structured vocabularies with embedding methods, the applicability of these techniques to aligning literature with other data sources via a controlled vocabulary and the potential for stronger performance on tasks and corpora where concept-based resources are better suited. All experiments have been conserved as a Dockerized pipeline, making the full benchmarking workflow reproducible and supporting future research in biomedical document retrieval.

Version published to 10.21203/rs.3.rs-9116846/v1 on Research Square
Apr 13, 2026

Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities

This article has 9 authors:
1. Manuel Watter
2. Felix Engel
3. Aref Kalantari
4. Claudia Giuliani
5. Karin Schuller
6. Claus-Werner Franzke
7. Markus Sperandio
8. Harald Binder
9. Klaus Kaier
This article has no evaluationsLatest version Mar 30, 2026
Graph-Based RAG for Manuscript Collections: A LangGraph Approach

This article has 4 authors:
1. Yahya Momtaz
2. Guido Russo
3. Massimo Brescia
4. Luisa Di Landa
This article has no evaluationsLatest version Apr 3, 2026
Bayesian Optimization of ASCII Structural Anchors for Improving Large Language Model Performance in Biomedical Knowledge Mining

This article has 4 authors:
1. Dong Xu
2. Muhammad Azam
3. Shuai Zeng
4. Hasanain Aldihis
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities

Graph-Based RAG for Manuscript Collections: A LangGraph Approach

Bayesian Optimization of ASCII Structural Anchors for Improving Large Language Model Performance in Biomedical Knowledge Mining