cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

Jamie Campbell
Antoine D. Lain
T. Ian Simpson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-specific corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text files, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the fidelity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central files and the files retrieved from cadmus , resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available.

Availability and implementation

cadmus is a freely available package for non-commercial research at https://github.com/biomedicalinformaticsgroup/cadmus and released under the MIT License.

Version published to 10.64898/2026.05.16.725623 on bioRxiv
May 19, 2026

Benchmarking MeSH-Augmented Embeddings for Biomedical Document Similarity

This article has 6 authors:
1. Rohitha Ravinder
2. Lukas Geist
3. Nelson Quiñones
4. Suhasini Venkatesh
5. Leyla Jael Castro
6. Dietrich Rebholz-Schuhmann
This article has no evaluationsLatest version Apr 13, 2026
To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

This article has 7 authors:
1. Fatma Alickovic
2. Stefan Lenz
3. Arsenij Ustjanzew
4. Lakisha Ortiz Rosario
5. Georg Vollmar
6. Thomas Kindler
7. Torsten Panholzer
This article has no evaluationsLatest version Jun 3, 2026
All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories

This article has 3 authors:
1. Inessa Cohen
2. Hongyi Yu
3. Robert A. McDougal
This article has no evaluationsLatest version Apr 27, 2026

Discuss this preprint

Listed in

Abstract

Availability and implementation

Article activity feed

Related articles

Benchmarking MeSH-Augmented Embeddings for Biomedical Document Similarity

To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

All Models are Wrong, Some are Annotated: Automating Metadata in Biomedical Repositories