Enhancing automated indexing of publication types and study designs in biomedical literature using full-text features
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objective
Searching for biomedical articles by publication type or study design is essential for tasks like evidence synthesis. Prior work has relied solely on PubMed information or a limited set of types (e.g., randomized controlled trials). This study builds on our previous work by leveraging full-text features, alternative text representations, and advanced optimization techniques.
Methods
Using a dataset of PubMed articles published between 1987 and 2023 with human-curated indexing terms, we fine-tuned BERT-based en-coders (PubMedBERT, BioLinkBERT, SPECTER, SPECTER2, SPECTER2-Clf) to investigate whether text representations based on different pre-training objectives could benefit the task. We incorporated textual and verbalized metadata features, full-text extraction (rule-based, extractive, and abstractive summarization), and additional topical information about the articles. To improve calibration and mitigate label noise, we used asymmetric loss and label smoothing. We also explored contrastive learning approaches (SimCSE, ADNCE, HeroCon, WeighCon). Models were evaluated using precision, recall, F1 score (both micro- and macro-), and area under ROC curve (AUC).
Results
Fine-tuning SPECTER2-base with adding the MeSH term “Animals”, asymmetric loss with label smoothing, and WeighCon contrastive loss improved performance significantly over the previous best architecture (micro-F1: 0.664 → 0.679 [+2.2%]; macro-F1: 0.663 → 0.690 [+4.1%]; p < 0.0001). Asymmetric loss and using SPECTER2-base instead of PubMedBERT contributed most to this gain. Full-text features boosted performance by 2.4% (micro-F1) and 1.8% (macro-F1) over the baseline (micro-F1: 0.616 → 0.631; macro-F1: 0.556 → 0.566; p < 0.0001). Topical label splitting and contrastive learning provided minor, non-significant improvements.
Conclusion
Full-text features, enhanced document representations, and fine-tuning optimizations improve publication type and study design indexing. Future work should refine label accuracy, better distill relevant article information, and expand label sets to meet needs of the research community. Data, code, and models are available at https://github.com/ScienceNLPLab/MultiTagger-v2 .