High-precision Biomedical Text Corpora for Multi-Entity Recognition: A CoDiet study

Antoine D. Lain
Stephanie Go
Alisha Mahmud
Shruti Rajendra
Ainara Cano San José
Katerina Loupasaki
Georgios Theodoridis
Maider Bizkarguenaga Uribiarte
Yajie Gu
Olga Deda
Ricardo Diogo Alves Conde
Nieves Embade
Ángela de Diego Rodríguez
Nerea Burguera
Danai Rossiou
Rubén Gil Redondo
Domniki Gallou
Itziar Tueros
Rakesh Velmurugan
Vasiliki Gkanali
Mercedes Caro Burgos
Petros Pousinis
George Alektoridis
Sara Arranz
Nasos Nikolopoulos
Xingchen Yan
Rebeca Fernández Carrión
Thomas Rowlands
Donghee Choi
Marek Rei
Chris Cave-Ayland
Adrian D’Alessandro
The CoDiet consortium
Tim Beck
Joram M. Posma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present here four biomedical, multi-entity corpora that can be used as benchmarks for named-entity recognition (NER), targeted to literature on metabolic syndrome. The CoDiet-Gold corpus (348,413 annotations) contains 500 re-distributable full-text publications, of which each document was independently annotated by two human experts, with disagreements fully adjudicated by a third expert. The CoDiet-Electrum corpus (2,349,499 annotations) contains 3,688 publications that were annotated using the entities from CoDiet-gold. Finally, for the same 3,688 documents, two fully machine-annotated corpora CoDiet-Bronze (2,399,647 annotations) and CoDiet-Silver (1,868,422 annotations), were created by utilising existing NER algorithms to annotate these. These corpora contain categories (organisms, disease, genes, proteins, metabolites) that add depth to existing corpora, as well as new categories that do not have other corpora (food, dietary methods, sample types, computational methods, study methodology, population characteristics, data types, and microbiome).

Version published to 10.1101/2025.09.04.673740 on bioRxiv
Sep 9, 2025

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024)

This article has 3 authors:
1. Antoine Claude Lemor
2. Alizée Pillod
3. Matthew Taylor
This article has no evaluationsLatest version Jan 27, 2026
MultiMed-ST Datasets for Machine Translation in Medical Applications

This article has 2 authors:
1. Giridhar Gowda
2. Suma R
This article has no evaluationsLatest version Jan 9, 2026
ASRD: Development and Validation of a Large-Scale Arabic Semantic Relation Dataset

This article has 6 authors:
1. Randah Alharbi
2. Tarek Helmy
3. Atika Al-Saghyir
4. Safa Aglan
5. Abdulrahman Alosaimy
6. Husni Al-Muhtaseb
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CCF Database: A Machine-Learning-Annotated Corpus of 266,271 Canadian Climate Articles (1978–2024)

MultiMed-ST Datasets for Machine Translation in Medical Applications

ASRD: Development and Validation of a Large-Scale Arabic Semantic Relation Dataset