Metappuccino: Large Language Model-driven Reconstruction of Sequence Read Archive Metadata for Cancer Research

Fiona Hak
Camille Marchet
Daniel Gautheret
Mélina Gallopin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in on-cology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read Archive-SRA. However, fragmented, ambiguous or missing metadata can severely limit accurate cohort selection, introduce bias and delay discoveries.

Results

To address these issues, we introduce Metappuccino : a metadata enrichment tool based on a fine-tuned Mistral-7B-Instruct large language model with low-rank-adaptation (LoRA). Metappuccino can extract or infer 19 key metadata classes (e.g. organ, disease, cell type) from unstructured text. Fine-tuning was conducted with careful partitioning and training design to preserve the model’s generalisation capacity, reduce data leakage, and ensure robust, context-aware inference rather than memorisation. When possible, the inferred outputs are mapped to standardised ontologies, such as Cellosaurus, Disease Ontology and Uberon, to produce consistent metadata. As a result, the fine-tuned model achieves significantly improved class prediction accuracy over the base model, performing at least as well as recent large open-source models. Furthermore, it reduces inference time by up to at least two compared to the baseline models. As a pipeline, Metappuccino complements the LLM with well-established Natural Language Processing techniques from the literature to further improve performance. By enriching the metadata of under-annotated sequences, Metappuccino creates greater value from public RNA-seq datasets, with potential applications extending beyond oncology transcriptomics.

Availability and Implementation

The source code of Metappuccino is available on GitHub : github. com/chumphati/Metappuccino. The fine-tuned LLM, MetappuccinoLLModel, is available on Hugging Face : huggingface.co/chumphati/MetappuccinoLLModel. Both repositories are released under Apache-2.0 license.

Contact

fiona.hak@i2bc.paris-saclay.fr , daniel.gautheret@universite-paris-saclay.fr , melina.gallopin@i2bc.paris-saclay.fr

Version published to 10.1101/2025.10.31.685769 on bioRxiv
Nov 1, 2025

PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models

This article has 2 authors:
1. Peng Wang
2. Kai Wang
This article has no evaluationsLatest version Oct 15, 2025
DeepOS: pan-cancer prognosis estimation from RNA-sequencing data

This article has 8 authors:
1. Marie Pavageau
2. Louis Rebaud
3. Charles Tanguy
4. Daphné Morel
5. Eric Deutsch
6. Christophe Massard
7. Hélène Vanacker
8. Loic Verlingue
This article has no evaluationsLatest version Nov 3, 2025
From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

This article has 2 authors:
1. Khairul Alam
2. Banani Roy
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Results

Availability and Implementation

Contact

Article activity feed

Related articles

PubMind: Literature-Based Genetic Variant Extraction and Functional Annotation Using Large Language Models

DeepOS: pan-cancer prognosis estimation from RNA-sequencing data

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics