Development and automated deployment of a specialised machine learning schema within a collaborative research centre: an explorative approach using large language models

Klaus Kaier
Gita Benadi
Sophia Nolde
Cristóbal Tagle Ludwig
Claudia Giuliani
Felix Engel
Manuel Watter
Harald Binder

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Achieving interoperability in machine learning (ML) workflows remains a significant challenge due to the heterogeneity of data types, algorithms, and application domains, as well as the lack of standardized metadata. In this study, we present the development of a specialized ML metadata schema within the context of the Small Data Initiative, a Collaborative Research Center characterized by diverse scientific approaches. We employed an interdisciplinary process combining expert input, iterative refinement, and schema validation using large language models (LLMs). A two-step LLM-based annotation methodology was applied to 14 representative scientific publications, using six different LLMs to identify both predefined (step 1) and additional ML-related metadata elements (step 2). Manual validation through face-to-face interviews with the main authors to the publications confirmed high precision rates of 70%–85% in the initial step and 86%–98% in the second step, with notable performance variation across models. This approach enabled both the identification of schema inconsistencies and the integration of previously overlooked concepts, leading to the refinement of the metadata schema. The process supports an “AI-by-design” paradigm, ensuring that metadata schemas and annotation workflows are optimized from the outset for downstream AI/ML applications. Our findings also highlight the value of LLM benchmarking in selecting suitable models for domain-specific tasks. Overall, the proposed methodology enhances metadata quality, fosters reproducibility, and contributes to making research data more AI-ready.

Version published to 10.1101/2025.10.06.25337418 on medRxiv
Oct 7, 2025

A Methodological Framework for the Use of AI Tools in Automated Workflows for Generating Validated and Structured Historical Datasets

This article has 11 authors:
1. José Antonio Motilla-Chávez
2. Diego Espitia
3. Edgardo Galán-Vásquez
4. Edgardo Ugualde
5. Diego Perez-Martinez
6. Marcela Lomelí-Jasso
7. Eduardo Perez-Martinez
8. Hector Gerardo Perez-Gonzalez
9. Fernando Carlín-Loza
10. Valeria Martínez-Ramírez
11. Martín Zumaya Hernández
This article has no evaluationsLatest version Nov 17, 2025
Identification of biomedical entities from multiple repositories using a specialized metadata schema and search-augmented large language models

This article has 10 authors:
1. Klaus Kaier
2. Felix Engel
3. Gita Benadi
4. Claudia Giuliani
5. Manuel Watter
6. Aref Kalantari
7. Karin Schuller
8. Claus-Werner Franzke
9. Markus Sperandio
10. Harald Binder
This article has no evaluationsLatest version Oct 23, 2025
MLwrap: Simplifying Machine Learning workflows in R

This article has 4 authors:
1. Rafael Jiménez
2. Javier Martínez-García
3. Juan José Montaño
4. Albert Sesé
This article has no evaluationsLatest version Nov 13, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Methodological Framework for the Use of AI Tools in Automated Workflows for Generating Validated and Structured Historical Datasets

Identification of biomedical entities from multiple repositories using a specialized metadata schema and search-augmented large language models

MLwrap: Simplifying Machine Learning workflows in R