Benchmarking MeSH-Augmented Embeddings for Biomedical Document Similarity
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background : The extensive volume of biomedical scientific literature requires efficient methods for retrieving relevant documents based on semantic technologies and biomedical concepts. While embedding-based methods have shown improvements over traditional keyword-based methods, the integration of domain-specific terminologies like Medical Subject Headings (MeSH) into these models remain underexplored. Methods : This study compares three hybrid methods that integrate MeSH-based annotations with document embeddings (called ''pre-annotation", ''post-annotation" and''post-reduction"). We benchmark these hybrid methods against the following traditional methods: TF-IDF, standard neural embeddings (Word2Vec, fastText, Doc2Vec) and publicly available transformer-based models (BioBERT, SciBERT, SPECTER), using cosine similarity and Word Mover’s Distance (WMD) as evaluation metrics. The benchmark experiments are based on the RELISH corpus, a manually curated dataset of PubMed articles where experts have labeled pairs of documents with regards to their relevance to each other, providing a 2-class (relevant vs. non-relevant) as well as a 3-class (relevant, partially relevant, non-relevant) judgment. Results : Transformer-based models, particularly fine-tuned BioBERT and SciBERT align best with the expert judgements after fine-tuning. Among non-transformer methods, Doc2Vec and MeSH-based hybrid methods perform also well, demonstrating the benefits from combining structured biomedical vocabularies with embedding methods. Our experiments deliver extensive results showing that the baseline performance of 76-78~\% precision at position 5 can be achieved through almost all approaches, improvements of 2-4~\% with MeSH concepts can be achieved, but performances up to 90~\% is left to the fine-tuned large-scale public models. Conclusion : The performance gains from the integration of concepts may be underwhelming, however the benefits lie in the successful integration and benchmarking of structured vocabularies with embedding methods, the applicability of these techniques to aligning literature with other data sources via a controlled vocabulary and the potential for stronger performance on tasks and corpora where concept-based resources are better suited. All experiments have been conserved as a Dockerized pipeline, making the full benchmarking workflow reproducible and supporting future research in biomedical document retrieval.