Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Objective

Searching for biomedical articles by publication type or study design is essential for tasks like evidence synthesis. Prior work has relied solely on PubMed information or addressed a limited set of types (e.g., randomized controlled trials). In this study, we build on previous work by lever-aging full-text features, enriched text representations, and advanced optimization techniques for comprehensive indexing.

Methods

Using a dataset of PubMed articles published between 1987 and 2023 with human-annotated indexing terms, we fine-tuned BERT-based encoders (PubMedBERT, BioLinkBERT, SPECTER, SPECTER2-Base, SPECTER2-Clf) to investigate whether text representations based on different pre-training objectives could benefit the task. We incorporated textual and verbalized metadata features, full-text extraction (rule-based, extractive, and abstractive summarization), and additional topical information about the articles. To mitigate potential label noise and improve calibration, we used asymmetric loss and label smoothing. We also explored contrastive learning approaches (SimCSE, ADNCE, HeroCon, WeighCon). Models were evaluated using precision, recall, F1 score (both micro- and macro-), and area under ROC curve (AUC).

Results

Fine-tuning SPECTER2-Base with asymmetric loss, label smoothing and contrastive learning (ADNCE and HeroCon) improved performance significantly over the previous best model (micro-F1: 0.658 → 0.670 [+1.8%]; macro-F1: 0.643 → 0.677 [+5.3%]; p < 0.001). Asymmetric loss and using SPECTER2-Base instead of PubMedBERT contributed most to this gain, while contrastive learning provided more moderate gains. Full-text features boosted performance by 2.4% (micro-F1) and 0.8% (macro-F1) over the baseline (micro-F1: 0.656 → 0.672; macro-F1: 0.595 → 0.600; p < 0.001).

Conclusion

Full-text features, citation-aware encoders, and fine-tuning optimizations significantly improve publication type and study design indexing. Future work should refine label accuracy, better distill relevant full-text information, and expand label sets to meet needs of the research community. Data, code, and models are available at https://github.com/ScienceNLP-Lab/MultiTagger-v2 .

Highlights

  • We trained and validated Transformer-based models for automatic indexing of publication types and study designs in biomedical articles, using a dataset with 61 labels derived primarily from expert-assigned PubMed indexing terms.

  • We investigated whether enriched article representations, advanced optimization techniques, and fine-grained labels could enhance model performance.

  • The largest performance improvement came from using citation-aware article representations and asymmetric loss.

  • Models trained using full-text features outperformed models trained using PubMed-only features, demonstrating the utility of full-text content for this task.

Graphical Abstract

Article activity feed