Sidero-Mining: Systematic Extraction of Siderophore Biosynthetic Information Using Large Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Siderophores are essential secondary metabolites widely distributed across microorganisms, displaying remarkable diversity. Despite extensive research, public databases contain limited information on siderophore biosynthetic gene clusters (BGCs), particularly lacking cross-species distribution and biosynthetic substrate annotations. Systematically collecting and organizing siderophore BGC synthesis data on a large scale would significantly enhance the use of domain knowledge and support data-driven research.

Large language models (LLMs) now offer a practical and scalable approach for mining and curating biological data, especially for converting literature insights into structured datasets In this work, we developed the Sidero-Mining pipeline, using LLMs to efficiently extract siderophore BGC synthesis information. By employing LLMs to screen over 10,000 publications, we identified 1,843 high-quality articles for data mining based on Sidero-Mining framework, manual validation, and data integration.

This effort culminated in the creation of the most comprehensive siderophore BGC dataset to date, containing 728 BGCs and 325 NRPS A domain substrate entries cross various species. Our results highlight LLMs’ potential to accelerate secondary metabolite dataset construction, and our methodological framework can be adapted for systematically exploring other secondary metabolites.

Article activity feed