Metappuccino: Large Language Model-driven Reconstruction of Sequence Read Archive Metadata for Cancer Research

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in on-cology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read Archive-SRA. However, fragmented, ambiguous or missing metadata can severely limit accurate cohort selection, introduce bias and delay discoveries.

Results

To address these issues, we introduce Metappuccino : a metadata enrichment tool based on a fine-tuned Mistral-7B-Instruct large language model with low-rank-adaptation (LoRA). Metappuccino can extract or infer 19 key metadata classes (e.g. organ, disease, cell type) from unstructured text. Fine-tuning was conducted with careful partitioning and training design to preserve the model’s generalisation capacity, reduce data leakage, and ensure robust, context-aware inference rather than memorisation. When possible, the inferred outputs are mapped to standardised ontologies, such as Cellosaurus, Disease Ontology and Uberon, to produce consistent metadata. As a result, the fine-tuned model achieves significantly improved class prediction accuracy over the base model, performing at least as well as recent large open-source models. Furthermore, it reduces inference time by up to at least two compared to the baseline models. As a pipeline, Metappuccino complements the LLM with well-established Natural Language Processing techniques from the literature to further improve performance. By enriching the metadata of under-annotated sequences, Metappuccino creates greater value from public RNA-seq datasets, with potential applications extending beyond oncology transcriptomics.

Availability and Implementation

The source code of Metappuccino is available on GitHub : github. com/chumphati/Metappuccino. The fine-tuned LLM, MetappuccinoLLModel, is available on Hugging Face : huggingface.co/chumphati/MetappuccinoLLModel. Both repositories are released under Apache-2.0 license.

Contact

fiona.hak@i2bc.paris-saclay.fr , daniel.gautheret@universite-paris-saclay.fr , melina.gallopin@i2bc.paris-saclay.fr

Article activity feed