Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.

Article activity feed

  1. BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    **Reviewer: Christopher Tabone **

    This manuscript evaluates the use of large language models (LLMs) to improve the consistency and usefulness of BioSample metadata. The authors focus on extracting specific biological terms from freetext sample descriptions: first, identifying cell line names (using a curated gold-standard for evaluation), and second, identifying experimentally modulated gene names (in a scenario without prior manual curation). An open-source 70B LLM (Llama 3.1) was used and its performance was compared against a conventional ontology-mapping pipeline (MetaSRA). Overall, the study is well-motivated - addressing the challenge of heterogeneous metadata - and the approach is generally sound and well documented. Below, I address specific aspects of the work in detail: Methodological Appropriateness and Controls: The methods are appropriate to the study's aims and are described with detail. The two-part evaluation (cell line extraction and gene name extraction without prior curation) aligns well with the goal of demonstrating LLM utility in metadata curation. The authors took care to construct a gold-standard dataset for cell line extraction by leveraging ChIP-Atlas's manually curated sample annotations. This approach avoids starting from scratch and ensures the evaluation is grounded in experimental metadata. The sample selection strategy is well justified: using equal numbers of ChIP-seq and ATAC-seq samples to control for the presence/absence of protein names (a potential confounder for detecting cell lines), avoiding duplicate projects and identical terms, and restricting to human samples to leverage the Cellosaurus ontology. These controls strengthen the evaluation by preventing bias (e.g. one project dominating results or trivial cases duplicating answers). The LLM pipeline is clearly outlined (Figure 2) - the model is prompted with BioSample attributes to extract a representative cell line term. Importantly, the authors compare this LLM-assisted pipeline against an existing rule-based method (the MetaSRA ontology mapping pipeline). This serves as an essential control/baseline to quantify the improvement gained by using an LLM. For the second task (extracting modulated gene names), where no curated baseline exists, the authors sample thousands of BioSample entries and perform manual evaluation of the LLM's outputs. While manual checking is necessary here, the manuscript could clarify the evaluation procedure (e.g. how many evaluators or what criteria were used) to assure readers of consistency. Overall, the experimental design is solid. The necessary details (model used, prompt design, parameter settings like temperature=0 for reproducibility) are all provided, and the authors have made their code publicly available, which aids reproducibility. The methodology is transparent and should allow others to replicate or build upon the work. Support for Conclusions by Data: The conclusions are, for the most part, well supported by the data presented. In the cell line extraction task, the LLM-based method clearly outperforms the traditional MetaSRA pipeline in both accuracy and coverage (Table 4). For example, the LLM pipeline achieved substantially higher coverage (93.0% vs 72.1% for MetaSRA) without sacrificing accuracy (~92.3% vs 90.3%), and it also showed improved precision in identifying non-cell line samples. These results validate the authors' claim that LLMs can more flexibly and comprehensively interpret metadata, mapping many more actual cell line samples to ontology terms while maintaining low false-positive rates. The data support the conclusion that the LLM approach enhances metadata findability (since far more samples get correctly annotated) and does so with high reliability. The authors appropriately note that the conventional method's conservative strategy yields high precision at the cost of leaving many samples unmapped, whereas the LLM can confidently map a greater portion of samples. This finding is well substantiated by the numbers and the error analysis in Table 5 (which categorizes the few failure cases of the LLM, such as confusion with derivative cell lines or missing a cell line when certain keywords were absent). In the gene name extraction task, the authors report that the LLM identified at least one gene in 600 out of 3,723 tested samples, with an overall accuracy of ~80.3% for those outputs (about 91.6% accuracy on gene names themselves, and 84.7% on the associated modulation method). This demonstrates that the LLM can successfully parse complex descriptions to find gene perturbations in a majority of cases. While there is no baseline for direct comparison here, these results are consistent with the idea that LLMs can extend curation to new information types not yet curated (in this case, finding manipulated genes where an ontology or curated list didn't exist). The authors' conclusions about the utility of this - for example, that it could allow users to filter out experiments with gene knockouts/knockdowns to avoid confounding effects - are reasonable extrapolations from the data. The discussion correctly notes that coverage for this gene task wasn't evaluated (since no gold standard exists) and acknowledges that some fraction of relevant cases might be missed. All major conclusions (LLM outperforms rule-based methods; LLM extraction of new metadata is feasible and useful) are backed by the evidence provided. The authors also contextualize their findings by noting limitations and practical considerations (e.g. the processing throughput of ~400 samples/hour and the challenge of scaling to 40 million records). This adds credibility to their interpretation that LLM-based curation will need further resources or model improvements to handle the entire database. In summary, the data presented are analyzed in depth (with relevant tables, figures, and a breakdown of error types), and they support the paper's conclusions well. I have no concerns that the authors are overstating their results. Language Clarity and Quality: The manuscript is written in generally clear and professional English. The authors note that they translated the draft from Japanese with assistance from ChatGPT, and the result is readable and scientifically appropriate. The overall clarity is good - important terms are defined, and the narrative flows logically from the motivation to methods, results, and discussion. I did not encounter ambiguities that impede understanding of the science. There are only a few minor issues in language usage and grammar that require attention. For example, there is a small typo in the description of gene overexpression ("achieved by trasfection of a plasmid…" on page 19) - "trasfection" should be "transfection" (unless this typo was carried over from the original prompt). Another example is the sentence "the outcomes of this study can handle these errors to rescue the affected published data for further use," which is a bit awkward in phrasing - perhaps reword to clarify that the methods developed can help correct metadata errors from submitted data. These are relatively minor edits; the manuscript does not require heavy language revision, just light editing for a few misspellings and stylistic "smoothing". The structure of the paper is appropriate, with a clear Introduction and well-labeled sections (Methods, Results/Discussion, Limitations, etc.). Data presentation is also clear: figures and tables are easy to interpret, and captions are explanatory. For example, the flowchart in Figure 2 and the definitions in Figure 3 clearly help in the understanding of the pipeline and metrics. In summary, with minor editorial changes, the quality of language and presentation will be suitable for publication. Statistical Analysis and Data Presentation: I am able to assess all the statistics and quantitative analyses in the manuscript, and they appear appropriate. The study primarily uses descriptive performance metrics (accuracy, coverage, precision, recall) to evaluate the extraction tasks - these are standard and well defined (the text and Figure 3 provide clear definitions of each metric in the context of the task). The comparisons between the LLM pipeline and the MetaSRA pipeline are straightforward to interpret. The authors did not perform complex statistical tests (e.g., no p-values are reported), which can be justified given that the magnitude and consistency of the improvements are evident and the evaluation emphasizes practical performance metrics rather than hypothesis testing. However, the manuscript states in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq subsets. If the authors intend "significant" to indicate statistical significance, it would be necessary to include the specific statistical test used along with associated test statistics and p-values to substantiate this claim. If no formal statistical testing was conducted, it would be more accurate and clearer to rephrase this as a qualitative observation rather than implying formal statistical support. All underlying data needed to interpret the results are provided either in the main figures/tables or supplementary material. The presentation of results is clear and transparent: Table 4 quantitatively summarizes the performance of each pipeline, and Table 5 qualitatively categorizes the errors made by the LLM. I have no other concerns about the appropriateness of statistical methods used - the evaluation metrics are suitable for information extraction tasks, and the sample sizes (600 samples for the cell line task, and thousands scanned for the gene task) are adequate to support the conclusions. In terms of data transparency, the manuscript indicates that outputs and code are available (with a GitHub repository provided), which will allow others to reproduce the analysis. Additional comments and suggestions: Beyond the points above, I have a few minor suggestions to further strengthen the manuscript. First, it would be helpful if the authors could clarify in the Methods how the manual evaluation of gene name extraction was performed—for example, whether multiple curators independently reviewed the outputs or if any consensus procedure was employed to resolve ambiguous cases. Providing this detail would add transparency to the accuracy figures reported, although the existing explanation about handling ambiguous cases (e.g., fusion genes) is already helpful. Second, given the manuscript's emphasis on a zero-shot LLM approach, it would be beneficial for the authors to briefly discuss whether alternative strategies, such as fine-tuning smaller language models, were considered. This would more clearly position the study within the broader landscape of metadata curation techniques. Third, the authors describe the use of the locally deployed Llama 3.1 model and emphasize its advantages regarding data privacy and scalability. Since these benefits are significant for practical adoption, it would further strengthen the manuscript if the authors explicitly highlight practical considerations, such as specific hardware requirements (in addition to the graphics card usage already included) and runtime performance benchmarks. Finally, as mentioned earlier, the authors mention in Supplementary Table 1 that "no significant differences were observed" between ChIP-seq and ATAC-seq samples. If the term "significant" here is meant to indicate statistical significance, please include details of the specific statistical test and associated values (e.g., test statistics and p-values) that substantiate this conclusion. If no formal statistical testing was performed, it would be more appropriate to rephrase this statement to indicate a qualitative observation rather than imply statistical testing. These points are relatively minor and do not indicate fundamental issues with the manuscript. Recommendation: In summary, this is a strong manuscript that addresses a pertinent problem in biological data management using modern LLM tools. The methods are sound and well controlled, the results are convincing, and the authors have been appropriately cautious and thorough in their analysis. I recommend minor revisions for this manuscript. The revisions needed are primarily editorial (minor language fixes and clarifications), with one note about statistics, and do not require additional experiments. With those addressed, the work should be suitable for publication in GigaScience.

  2. BioSample is a comprehensive repository of experimental sample metadata, playing a crucial role in providing a comprehensive archive and enabling experiment searches regardless of type. However, the difficulty in comprehensively defining the rules for describing metadata and limited user awareness of best practices for metadata have resulted in substantial variability depending on the submitter. This inconsistency poses significant challenges to the findability and reusability of the data. Given the vast scale of BioSample, which hosts over 40 million records, manual curation is impractical. Rule-based automatic ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of BioSample metadata. Recently, large language models (LLMs) have gained attention in natural language processing and have been expected as promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data, which manually curates samples. Our results demonstrated that LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended this approach to extraction of information about experimentally manipulated genes from metadata where manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results for the usage of the database, which facilitates more precise filtering of data and prevents misinterpretation caused by inclusion of unintended data. These findings underscore the potential of LLMs to improve the findability and reusability of experimental data in general, significantly reducing user workload and enabling more effective scientific data management.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf070 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer: Sajib Acharjee Dip

    1. The gold-standard dataset constructed for evaluation, though carefully validated by experts, was limited to 600 samples (300 ChIP-seq and 300 ATAC-seq). Such a limited scope may introduce selection bias or fail to capture the full variability present across the entire BioSample database (>40 million records). It is unclear how representative these samples are of real-world metadata submissions.Clearly demonstrate the representativeness of the sample selection or increase sample size to better represent BioSample's diversity.

    2. The manuscript predominantly compares the proposed LLM-based approach to the MetaSRA pipeline. While MetaSRA is a relevant baseline, the omission of comparisons with other contemporary methods like ChIP-GPT, and Bioformer is a notable oversight. These tools represent significant advancements in the field and have demonstrated efficacy in tasks closely related to the study's objectives. A comprehensive evaluation against these methods or comparative discussions would provide a clearer understanding of the proposed approach's relative performance and contributions. https://academic.oup.com/bib/article/25/2/bbad535/7600389 https://pmc.ncbi.nlm.nih.gov/articles/PMC10029052/

    3. "LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage." While the study reports improved performance over MetaSRA, the absence of comparisons with other SOTA methods renders this assertion less robust. Without such comparative analyses, it's challenging to attribute the observed improvements solely to the proposed approach.​ Rephrasing claims to accurately reflect the scope of the comparisons made would strengthen clarity.

    4. Despite high accuracy, complex cases (fusion proteins, inhibitors mentioned indirectly, ambiguous terminology) were recognized as difficult, yet were excluded from primary accuracy evaluations. By excluding these ambiguous cases from performance metrics, the accuracy results might be artificially improved. Provide additional metrics that include these complex or ambiguous cases, clearly quantifying performance drops. This would offer more realistic insights into real-world applicability.

    5. The error categorization provided (derivation issues, overlooked terms, selection failures, etc.) is helpful, but somewhat superficial. The deeper root causes—such as the LLM's lack of biological context knowledge, tokenization errors, or prompt ambiguity—were not thoroughly explored or explained. Discuss or perform deeper qualitative analysis on specific error instances, highlighting precisely why the LLM made incorrect decisions (e.g., lack of biological understanding, misinterpretation of abbreviations, limitations of prompt wording).

    6. Temperature settings were fixed at zero for deterministic outputs. While deterministic settings are valuable for reproducibility, exploring or reporting the effect of temperature variations on accuracy and robustness would have strengthened this methodological choice significantly.

    7. The authors have not sufficiently explored or justified their prompt engineering choices which are critical for reproducibility and optimization. I recommend providing additional experiments or discussions on alternative prompting strategies tested, including prompt variants that failed and reasons why particular prompts were selected.