Enhanced semantic classification of microbiome sample origins using large language models (LLMs)

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Background

Over the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without fine-tuning. This effort directly contributes to improving the FAIRness—findability, accessibility, interoperability, and reusability—of microbiome sequencing metadata, thereby enhancing their “AI readiness” for downstream computational analyses.

Results

We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pre-trained Transformer models, and assessed scalability, time- and cost-effectiveness, as well as performance against a diverse, hand-curated benchmark with 1,000 examples that span a wide range of complexity in metadata interpretation. Annotation performance markedly outperformed that of a baseline, manually curated, non-ML keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task. Furthermore, when we compared proprietary OpenAI models with open-weight alternatives (e.g., Qwen, meta-Llama, and Microsoft-Phi-4), we found comparable accuracy for both biome and sub-biome classification, indicating that open-weight architectures can match the performance of proprietary models for large-scale ecological metadata re-annotation. We validated the pipeline with 1,000 hand-curated samples, and we applied the optimized pipeline to 2 million sequencing records from the environment, providing coarse-grained yet standardized sample origin annotations covering the globe.

Conclusions

Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.

Article activity feed

  1. AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3:

    The ability to reuse scientific data for secondary analysis is an extremely important topic. Since the promotion of the FAIR Guiding Principles a decade ago, the central importance of standards-adherent metadata has received considerable attention. Although this paper surprisingly doesn't mention the FAIR principles, the work is important in understanding what it takes to make datasets FAIR and "AI ready."

    A core problem with the paper is that it is unsure who its audience is. The paper is motivated by the needs to scientists to search for and reuse online datasets for secondary analysis, but much of the paper concerns highly technical issues that are related to fine tuning LLM performance. It is laudable that the manuscript annotates its discussion of the authors' methods with pointers to actual Python scripts that would allow third parties to replicate the authors' work. The detailed presentation, however, may make it hard for many readers to understand the computational strategy that all the scripts are implementing. The organization of the paper weaves from discussions of the ability of LLMs to extract scientific standards from "legacy" experimental metadata to details of how to enhance computational efficiency to make the use of LLMs more cost-effective. The title and abstract of the paper suggest that the authors are aiming for a more scientific audience, but much of the manuscript focuses on arcane implementation details that will be less important to such readers.

    Missing from the paper is a detailed discussion of what the metadata in SRA are really like. The reader never sees complete examples of the metadata that are processed in the authors' work, and thus it is hard to have intuition about the problem that the authors are trying to solve. In particular, the paper doesn't present information about the range of attributes in user-defined metadata fields in SRA. The paper would benefit from a discussion of the structure of scientific metadata in general, and of how the authors' work fits into the larger effort in the research community to make datasets FAIR. (Full disclosure: My own laboratory is involved in such activity. See https://arxiv.org/abs/2504.05307v2)

    The abstract of the paper states that the authors "test to what extent LLMs can be used to cost-effectively automate the re-annotation of sequencing records." Alas, the paper really examines re-annotation of only the fields for "biome" and "location." A weakness of the paper is that the reader doesn't learn what other fields may be relevant in these metadata records, and why the authors chose to focus on the particular fields that they studied. Overall, much more attention should be placed on discussion of the limitations of the work and how well the results might scale to more general problems in standardization of scientific metadata.

    Minor comments:

    The term "biome" is never well defined.

    Frequently, parenthetical remarks begin with "e.g." and end with "etc." This style is redundant; you need only one of these abbreviations in each instance.

    Page 6, para 1: "last three digits" or "last three characters"? What is the motivation for consolidating reference ontologies into a single dictionary?

    Page 13, para 2: The notion of "lenient matches" requires much more discussion. If the goal is to make the legacy metadata standards-adherent, then a "lenient" match would not seem to be valid. The operative question is, "What metadata terms will users invoke to search for datasets?", and presumably users will be searching for standard terms only.

    Page 17, para 3: It's unclear what is meant by "most samples have fewer misclassifications." Fewer than what?

    Page 24, para 2: It's not clear what you mean when you say, "In half of the cases GPT correctly predicted the location, while the lat/lon coordinates parsed from the metadata were incorrect." Are you saying that GPT gave correct results when the lat/lon data were incorrect?

    Figure 1 is very busy and the tiny font is hard to read.

  2. AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2:

    Reproducibility report for: Enhanced semantic classification of microbiome sample origins using Large Language Models Journal: Gigascience ID number/DOI: GIGA-D-25-00316 ​​​Reviewer(s): Laura Caquelin, Department of Clinical Neuroscience, Karolinska Institutet, Sweden [Wrote the report and reproduced the results]


    1. Summary of the Study

    This study evaluates whether Large Language Models (LLMs) can help re-annotate sequencing records. Using GPT models, the authors tested scalability, time, cost, and performance against a benchmark of 1,000 hand-curated examples. They then applied this approach to million environmental sequencing records, producing standardized annotations.


    1. Scope of reproducibility

    According to our assessment the primary objective is: to evaluate how closely GPT's annotation performance approached that a human expert when classifying environmental sequencing samples into biomes and sub-biomes.

    • Outcome: Accuracy of biome and sub-biome classification compared against a hand-curated benchmark dataset.

    • Analysis method outcome: As described to validate biome classifications: "For paired comparisons of repeated sample IDs, we use the McNemar test, which is appropriate for paired binary outcomes (True/False)", while "for comparisons across different sample sets, we employ the t-test for independent samples. In both scenarios, a Bonferroni correction is applied to adjust for multiple comparisons".

    For sub-biomes, comparisons across different sets were performed with independent t-tests, while "for runs involving the same sample IDs, comparisons are performed using the paired t-test". Section "Validation statistics" page 13-14.

    • Main result: "The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). No significantly different performances were detected for the sub-biome classification, neither between GPT and the human, nor between prompt versions (adjp-value=1)." Section "Human versus GPT classification accuracy" pages 18-19.

    1. Availability of Materials

    a. Data

    • Data availability: Open
    • Data completeness: Complete = all data necessary to reproduce main results are available
    • Access Method: Repository
    • Repository: https://zenodo.org/records/16100607
    • Data quality: Complete but no metadata associated with the file

    b. Code


    1. Computational environment of reproduction analysis
    • Operating system for reproduction: MacOS 15.6.1
    • Programming Language(s): Python
    • Code implementation approach: Using shared code
    • Version environment for reproduction: Python 3.13.7

    1. Results

    5.1 Original study results

    • Results:

    "The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). No significantly different performances were detected for the sub-biome classification, neither between GPT and the human, nor between prompt versions (adjp-value=1)."

    (The authors identified an error in the manuscript text during the review. Therefore, the following part of the manuscript needs to be updated (see email exchange with the authors below).

    5.2 Steps for reproduction

    -> Run the two first scripts of the container 4 in the Github: validate_biomes_subbiomes.py and overall_analysis.py

    • Issue 1: The README instructions for setting up the ~/MicrobeAtlasProject directory can lead to a nested folder structure (~/MicrobeAtlasProject/MicrobeAtlasProject) if followed literally. This causes the Docker container to fail when attempting to access required files like gpt_file_label_map.tsv, since they are not found at the expected path /MicrobeAtlasProject/.

    -- Resolved: The issue was resolved by manually renaming and flattening the directory structure after extraction, ensuring that the contents of MicrobeAtlasProject_Zenodo are directly placed inside ~/MicrobeAtlasProject/. However, the current instructions can mislead users, so a clarification in the README would be helpful.

    • Issue 2: During the execution of the overall_analysis.py script, multiple files with the same label were found, requiring manual selection of the file to use for the analysis.

    -- Resolved: The manuscript does not specify which file should be selected to reproduce the results, leading to potential ambiguity. By default, I chose the most recent file among the options, assuming it reflects the final data version used in the manuscript. It would be helpful if the documentation or manuscript explicitly stated this to ensure exact reproducibility.

    -> Compare the results reproduced to the results presented in the manuscript

    • Issue 3: The results obtained by running validate_biomes_subbiomes.py are two files: biome_subbiome_results.csv and biome_subbiome_stats.csv, which contain a large amount of output (1,284 and 48,197 rows respectively). The script overall_analysis.py provides overall performance metrics in the terminal output, but does not produce the adjusted p-values relevant to the scope of this review.

    -- Unresolved: It was difficult to identify where to find the results presented in the manuscript, so an email was sent to the authors.

    Message sent by the authors

    Dear reviewer,

    By running validate_biomes_subbiomes.py as described (using gpt_file_label_map.tsv as --map_file), the output will be two .csv files named biome_subbiome_results.csv and biome_subbiome_stats.csv.

    The latter file will contain the stats (hence the adjusted p-values). Were you able to reproduce such files?

    We did notice there are a few mistakes.

    Mistake 1.

    In the manuscript it says:

    " A trained molecular biologist, with no prior exposure to the project, was given the same prompt instructions as GPT and was asked to classify sample biomes and sub-biomes. While against the benchmark dataset, GPT achieved an accuracy of 79.76% (n=499; SD=40.0), the human annotator reached 78.0% (n=250; SD=33.0). "

    The second standard deviation should be replaced with SD=42.0.

    Mistake 2.

    In the manuscript it says:

    " The improvement in accuracy between GPT's initial classification and the human's performance with the improved prompt was statistically significant (adj p-value=0.031), but also between the human's attempt first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). "

    The first adjusted p-value should not be 0.031 but 0.134 hence not significant so this sentence should be adjusted to:

    " There was an improvement in accuracy between the human's first attempt (with the initial prompt) and the human's second attempt (better prompt) (adj p-value≤0.001). "

    Mistake 3.

    We built on biome_subbiome_results.csv and biome_subbiome_stats.csv further than necessary so the two files on Zenodo should be "cut" earlier to avoid confusion. This was a problem of the script validate_biomes_subbiomes.py which concatenates on existing files (e.g.: biome_subbiome_results.csv and biome_subbiome_stats.csv) instead of creating new ones. We should probably proceed by replacing these two files with the files without repetitions.

    We thank you for your work and please do let us know if everything works out now.

    Thank you and kind regards, ####

    The authors confirm that the results presented in the manuscript can be found in the biome_subbiome_stats.csv file. This file contains 48197 rows. According to the authors, the file includes data from both existing files and new data. As a result, it is difficult to determine which data have been reproduced. Even when using a Ctrl+F search for the reported p-value (e.g, pvalue = 0.134) in the Excel file, this value appears in several rows labeled under different configurations such as (label1/label2): --- chunk_size3000/sync_chunkN_presp0.0; --- chunk_size3000/sync_chunkN_temp1.5; --- chunk_size5000/gpt4-0613; --- machine/ sync_chunkY_topp0.0, etc…

    5.3 Statistical comparison Original vs Reproduced results

    • Results: The biome_subbiome_stats.csv file was reproduced, but it is difficult to distinguish between the newly reproduced data and the existing data already present in the file. Additionally, the data presented in the manuscript are also hard to identify due to the size of the file. No comparison was performed.
    • Comments: -
    • Errors detected: Authors identified an error in the manuscript text during the review with the first adjusted p-value that is not 0.031 but 0.134.
    • Statistical Consistency: No comparison was performed.

    1. ​​Conclusion
    • Summary of the computational reproducibility review

    The main scripts to reproduced the results were successfully executed and the output files were generated. However, due to the size of the output files and the lack of precise references in the manuscript, it was difficult to identify which parts of the output correspond to the results presented in the paper. Moreover, authors mentionned that the script adds data to existing output files rather than generating new ones, making it hard to distinguish between old and new data. This led to confusion when trying to compare the reproduced results with those in the manuscript. Then a comparison of statistical values was not possible.

    • Recommendations for authors

    To improve the reproducibility of the manuscript, we recommend the authors to:

    -- Clarify instructions in the README about the MicrobeAtlasProject folder. -- Ensure scripts generate new outputs or clarify which data is new vs. existing in the files with for example a column indicating the origin (e.g., "new" or "existing"). -- Link results in the manuscript to specific rows/sections in the output files to easily locate the exact data used. Another solution could be to consider including a smaller, or a filtered version of the output files with only the rows used for key reults, figures or tables, to make checking results easier and avoid error. -- Metadata: For the data used or generated by the scripts, it would be helpful to include accompanying metadata files that explain: --- The definition of each variable name. --- The origin of each dataset (raw, processed, etc). --- Any preprocessing steps applied before analysis.

  3. AbstractOver the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag015), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1:

    The manuscript presents a carefully executed study using non-finetuned GPT models to classify microbiome sample metadata. It is very well written, and both the analyses and the interpretations are generally sound. I found the evaluation thorough and the presentation clear.

    1. The study provides a detailed evaluation of LLM-based metadata curation, clearly advancing over keyword-based approaches. However, it is surprising that recent related studies using LLMs for metadata curation are not cited. For completeness, I suggest including references such as:
    1. The scale of the processed data is impressive. However, there appears to be a discrepancy: the Zenodo repository file metadata.out contains 2,254,619 accession IDs (presumably the input), while the GPT output files (gpt_clean_output) include only around 1,000 samples (presumably the benchmark dataset), whereas the manuscript states that 3.8 million samples were processed. It would be helpful to clarify these numbers and, if applicable, explain why fewer outputs are provided. I also recommend reorganizing the Zenodo repository so that readers can download individual files rather than the entire large archive.

    2. The data processing pipeline on GitHub is very useful. The repository currently indicates a CC0 (public domain) license. Since CC0 is typically intended for datasets rather than source code, please clarify whether this was intentional or if a software-specific license (e.g., MIT, Apache 2.0) would be more appropriate.

    3. A different typeface appears in some paragraphs (e.g., pp. 19 and 21). Please check whether this was intentional.

    4. The finding that grouping 5-17 samples per request does not substantially affect accuracy is interesting. Given that GPT models often fail with counting or item listing, the observed quality decline with larger chunk sizes seems reasonable and aligns with expectations.

    5. On p. 26, the observed variability in field usage may be linked to the BioSample package system used for submission (see: https://www.ncbi.nlm.nih.gov/biosample/docs/packages/). Some fields, such as env_biome and env_feature, were once mandatory for environmental samples but are currently optional, I suppose. Such historical changes may partly explain biases in field usage.

    6. The manuscript appropriately highlights the presence of ambiguous or unresolvable sample descriptions. We reached a similar conclusion in our own work with local LLMs: in many cases, even expert curators cannot determine a "correct" label, and the right answer may depend on context or application.

    7. The observation that JSON output significantly improves sub-biome classification accuracy is intriguing and consistent with our internal experience with local LLMs. Since output format may also affect processing speed, it would be useful to report whether response times differed between JSON and inline formats.

    8. One major limitation of the study is the dependence on proprietary GPT models accessible only via OpenAI's API. This constrains reproducibility and long-term availability. Indeed, the recent release of GPT-5 already renders some of the reported results outdated. While the present study remains highly valuable, it would be worthwhile to also evaluate local or open-source LLMs to ensure future reproducibility.