Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity

Daniela Raciti
Kimberly M. Van Auken
Valerio Arnaboldi
Christopher J. Tabone
Hans-Michael Muller
Paul W. Sternberg

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labor-intensive and thus high-performing machine learning methods that improve biocuration efficiency are needed. Here we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and used this characterization to define three tasks for classifying sentences as either 1) fully curatable, 2) fully and partially curatable, or 3) all language-related. We evaluated various machine learning (ML) models applied to these tasks and found that GPT and BioBERT achieve the highest average performance, resulting in F1 performance scores ranging from 0.89 to 0.99 depending upon the task. Our findings demonstrate the feasibility of extracting biocuration-relevant sentences from full text. Integrating these models into professional biocuration workflows, such as those used by the Alliance of Genome Resources and the ACKnowledge community curation platform, might well facilitate efficient and accurate annotation of the biomedical literature.

Version published to 10.1101/2025.01.06.631539 on bioRxiv
Jan 8, 2025

PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models

This article has 2 authors:
1. Peng Wang
2. Kai Wang
This article has no evaluationsLatest version Oct 15, 2025
Ontology pre-training improves machine learning-based predictions for metabolites

This article has 7 authors:
1. Charlotte Tumescheit
2. Martin Glauer
3. Simon Flügel
4. Martin Larralde
5. Fabian Neuhaus
6. Till Mossakowski
7. Janna Hastings
This article has no evaluationsLatest version Oct 2, 2025
Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis

This article has 8 authors:
1. Dhylan Patel
2. Antoine D. Lain
3. Avish Vijayaraghavan
4. Nazanin Faghih Mirzaei
5. Monica N. Mweetwa
6. Meiqi Wang
7. Tim Beck
8. Joram M. Posma
This article has no evaluationsLatest version Aug 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models

Ontology pre-training improves machine learning-based predictions for metabolites

Microbial Named Entity Recognition and Normalisation for AI-assisted Literature Review and Meta-Analysis