Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models

Puranjani Das
Jodi Schneider
Evan Mayo-Wilson
Halil Kilicoglu
Joe D. Menke
Dongin Nam
Kiran Ninan
Jean-Pierre Oberste
Ang Michael Troy
Xiangji Ying
Arthur W. Holt
Neil R. Smalheiser

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM’s probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design.

Materials and Methods

Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs—cohort, case-control, cross-sectional, and case report.

Results

For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission.

Discussion and Conclusion

TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.

Version published to 10.64898/2026.06.03.26354854 on medRxiv
Jun 4, 2026

Utilising Large Language Models for the Automated Mapping of Medical Research to Translational Stages

This article has 4 authors:
1. Matthew Clapham
2. Christopher Oldmeadow
3. Simon Deeming
4. Carlos Riveros
This article has no evaluationsLatest version Jul 14, 2026
To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

This article has 7 authors:
1. Fatma Alickovic
2. Stefan Lenz
3. Arsenij Ustjanzew
4. Lakisha Ortiz Rosario
5. Georg Vollmar
6. Thomas Kindler
7. Torsten Panholzer
This article has no evaluationsLatest version Jun 3, 2026
Automatic Classification of Medical Artificial Intelligence Articles by Their Level of Translational Maturity: An Interpretable Supervised Text-Classification Approach

This article has 2 authors:
1. Sandeep Reddy
2. Alix Héritier
This article has no evaluationsLatest version Jul 13, 2026

Discuss this preprint

Listed in

Abstract

Objectives

Materials and Methods

Results

Discussion and Conclusion

Article activity feed

Related articles

Utilising Large Language Models for the Automated Mapping of Medical Research to Translational Stages

To RAG, or Not to RAG? A Comparative Evaluation of Retrieval-Augmented Generation for ICD Coding of German Tumor Diagnoses

Automatic Classification of Medical Artificial Intelligence Articles by Their Level of Translational Maturity: An Interpretable Supervised Text-Classification Approach