Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.

Article activity feed