From manual entry to machine precision: challenges and evolution of metadata schema development in collaborative research centers

Felix Engel
Claudia Giuliani
Manuel Watter
Aref Kalantari
Karin Schuller
Harald Binder
Klaus Kaier

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective Metadata standardization often struggles to balance broad interoperability with the specific needs of diverse research domains. This study demonstrates a "parent template" approach to schema development within large-scale German Collaborative Research Centers. We describe the iterative adaptation of a baseline schema developed for a nephrology-focused consortium (NephGen) to meet the distinct data management requirements of subsequent initiatives in tumor immunology (OncoEscape) and perinatal immune development (Pilot). Results The derivation process produced three structurally compatible schemas with varying vocabularies. The Pilot schema required the highest granularity (342 levels), followed by NephGen and OncoEscape (283). Despite the shared template, quantitative overlap was limited. Only 47% of OncoEscape levels and less than 34% of the Pilot vocabulary were retained from the NephGen baseline. Consequently, the Pilot schema had to introduce unique items like "Timeline" to capture immune development, while OncoEscape focused on "Oncogenes". While this strategy balanced standardization with specificity, limiting schema complexity just to suit manual entry is becoming obsolete. Since Large Language Models (LLMs) don't suffer from human fatigue, we argue that future schemas should stop being static forms. Instead, they need to become flexible, high-dimensional "instruction sets" tailored for AI extraction.

Version published to 10.21203/rs.3.rs-8711752/v1 on Research Square
Feb 10, 2026

Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities

This article has 9 authors:
1. Manuel Watter
2. Felix Engel
3. Aref Kalantari
4. Claudia Giuliani
5. Karin Schuller
6. Claus-Werner Franzke
7. Markus Sperandio
8. Harald Binder
9. Klaus Kaier
This article has no evaluationsLatest version Mar 30, 2026
Leveraging Unsupervised Learning for Automated Schema Matching and Data Harmonization in Multi-Source Electronic Health Record Integration

This article has 3 authors:
1. Faith Harris
2. James Mcburnie
3. Mike Edwards
This article has no evaluationsLatest version Mar 6, 2026
Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering

This article has 14 authors:
1. Ariel Yuhan Ong
2. Quang Nguyen
3. Ishani Barai
4. Justin Engelmann
5. Fares Antaki
6. Mertcan Sevgi
7. David A Merle
8. Lie Ju
9. Eliot Dow
10. Yukun Zhou
11. Gregory Maniatopoulos
12. Yemisi Takwoingi
13. Alastair K Denniston
14. Pearse A Keane
This article has no evaluationsLatest version Mar 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Accelerating metadata annotation in collaborative research centers: A hybrid AI workflow for biomedical entities

Leveraging Unsupervised Learning for Automated Schema Matching and Data Harmonization in Multi-Source Electronic Health Record Integration

Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering