Enhanced semantic classification of microbiome sample origins using Large Language Models (LLMs)
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Over the past decade, central sequence repositories have expanded significantly in size. This vast accumulation of data holds value and enables further studies, provided that the data entries are well annotated. However, the submitter-provided metadata of sequencing records can be of heterogeneous quality, presenting significant challenges for re-use. Here, we test to what extent large language models (LLMs) can be used to cost-effectively automate the re-annotation of sequencing records against a simplified classification scheme of broad ecological environments with relevance to microbiome studies, without retraining.
We focused on sequencing samples taken from the environment, for which metadata is important. We employed OpenAI Generative Pretrained Transformer (GPT) models, and assessed scalability, time and cost-effectiveness, as well as performance against a diverse, hand-curated ground-truth benchmark with 1000 examples, that span a wide range of complexity in metadata interpretation. We observed that annotation performance markedly outperforms that of a baseline, manually curated, non-machine-learning keyword-based approach. Changing models (or model parameters) has only minor effects on performance, but prompts need to be carefully designed to match the task.
We applied the optimized pipeline to more than 3.8 million sequencing records from the environment, providing coarse-grained yet standardized sampling site annotations covering the globe. Our work demonstrates the effective use of LLMs to simplify and standardize annotation from complex biological metadata.