Interpreting biochemical text with language models: a machine learning framework for reaction extraction and cheminformatic validation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent advancements in large language models (LLMs) offer new opportunities for automating the manual curation of biochemical reaction databases from scientific literature. In this study, we present an integrated pipeline that enhances LLM-based extraction of enzymatic reactions with machine learning and cheminformatics-informed validation. Using BRENDA-linked PubMed articles, we evaluate GPT-4’s ability to extract reactions and infer missing chemical entities in textual descriptions of enzymatic reactions. Extracted reactions are converted to SMILES and InChI notations before being encoded into molecular fingerprint similarity scores and atom mapping metrics. These cheminformatics metrics are then used to train machine learning classifiers that validate GPT extractions. We employ a Positive-Unlabeled learning approach with synthetic invalid reactions to train various classifiers and assess model performances. The best classifier is then benchmarked on GPT extractions. Our findings show that GPT can accurately infer incomplete reactions and cheminformatics tools can serve as effective predictors of reaction validity. This work demonstrates a scalable framework for automated and reliable curation of enzymatic reaction databases, highlighting the potential of combining LLMs with cheminformatics and machine learning for reliable scientific knowledge extraction.
Author Summary
Curating databases of biochemical reactions is a time-consuming and manual task, yet it plays a vital role in advancing research in biology and chemistry. Many scientific articles describe important enzymatic reactions, but often do so in incomplete ways—such as mentioning only the starting molecule or the enzyme, and leaving out the rest. In this work, we explore how recent advancements in artificial intelligence, specifically large language models like GPT, can help extract such information automatically from scientific literature. We show that these models can not only find reactions in text, but also infer missing parts of reactions based on the surrounding context. To make sure these inferred reactions are chemically plausible, we use computational chemistry tools that analyze the structure of the molecules involved. We then train a machine learning model to help us automatically detect which reactions are likely to be valid. This combination of tools offers a new way to speed up and improve how biochemical knowledge is extracted from the growing body of scientific literature. Our study suggests that this kind of automation could help scientists keep biological databases up to date and reduce the burden of manual data entry.