Metappuccino: Large Language Model-driven Reconstruction of Sequence Read Archive Metadata for Cancer Research
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
High-throughput RNA-sequencing has significantly advanced transcriptomic profiling in on-cology. Millions of RNA-seq datasets have accumulated in public databases such as the Sequence Read Archive-SRA. However, fragmented, ambiguous or missing metadata can severely limit accurate cohort selection, introduce bias and delay discoveries.
Results
To address these issues, we introduce Metappuccino : a metadata enrichment tool based on a fine-tuned Mistral-7B-Instruct large language model with low-rank-adaptation (LoRA). Metappuccino can extract or infer 19 key metadata classes (e.g. organ, disease, cell type) from unstructured text. Fine-tuning was conducted with careful partitioning and training design to preserve the model’s generalisation capacity, reduce data leakage, and ensure robust, context-aware inference rather than memorisation. When possible, the inferred outputs are mapped to standardised ontologies, such as Cellosaurus, Disease Ontology and Uberon, to produce consistent metadata. As a result, the fine-tuned model achieves significantly improved class prediction accuracy over the base model, performing at least as well as recent large open-source models. Furthermore, it reduces inference time by up to at least two compared to the baseline models. As a pipeline, Metappuccino complements the LLM with well-established Natural Language Processing techniques from the literature to further improve performance. By enriching the metadata of under-annotated sequences, Metappuccino creates greater value from public RNA-seq datasets, with potential applications extending beyond oncology transcriptomics.
Availability and Implementation
The source code of Metappuccino is available on GitHub : github. com/chumphati/Metappuccino. The fine-tuned LLM, MetappuccinoLLModel, is available on Hugging Face : huggingface.co/chumphati/MetappuccinoLLModel. Both repositories are released under Apache-2.0 license.
Contact
fiona.hak@i2bc.paris-saclay.fr , daniel.gautheret@universite-paris-saclay.fr , melina.gallopin@i2bc.paris-saclay.fr