Interpreting biochemical text with language models: a machine learning framework for reaction extraction and cheminformatic validation

Daven Lim
Swathi Badrinarayanan
Kira Sterling
Guru Rajesh
Eshaan Mistry
Daphne Yang
Max Lee
Kenneth Bryan Hsu
Mrunali Manjrekar
Cassandra Areff
Phil Xie
Ivan Kristanto
Arjun Chandran
J. Christopher Anderson

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent advancements in large language models (LLMs) offer new opportunities for automating the manual curation of biochemical reaction databases from scientific literature. In this study, we present an integrated pipeline that enhances LLM-based extraction of enzymatic reactions with machine learning and cheminformatics-informed validation. Using BRENDA-linked PubMed articles, we evaluate GPT-4’s ability to extract reactions and infer missing chemical entities in textual descriptions of enzymatic reactions. Extracted reactions are converted to SMILES and InChI notations before being encoded into molecular fingerprint similarity scores and atom mapping metrics. These cheminformatics metrics are then used to train machine learning classifiers that validate GPT extractions. We employ a Positive-Unlabeled learning approach with synthetic invalid reactions to train various classifiers and assess model performances. The best classifier is then benchmarked on GPT extractions. Our findings show that GPT can accurately infer incomplete reactions and cheminformatics tools can serve as effective predictors of reaction validity. This work demonstrates a scalable framework for automated and reliable curation of enzymatic reaction databases, highlighting the potential of combining LLMs with cheminformatics and machine learning for reliable scientific knowledge extraction.

Author Summary

Curating databases of biochemical reactions is a time-consuming and manual task, yet it plays a vital role in advancing research in biology and chemistry. Many scientific articles describe important enzymatic reactions, but often do so in incomplete ways—such as mentioning only the starting molecule or the enzyme, and leaving out the rest. In this work, we explore how recent advancements in artificial intelligence, specifically large language models like GPT, can help extract such information automatically from scientific literature. We show that these models can not only find reactions in text, but also infer missing parts of reactions based on the surrounding context. To make sure these inferred reactions are chemically plausible, we use computational chemistry tools that analyze the structure of the molecules involved. We then train a machine learning model to help us automatically detect which reactions are likely to be valid. This combination of tools offers a new way to speed up and improve how biochemical knowledge is extracted from the growing body of scientific literature. Our study suggests that this kind of automation could help scientists keep biological databases up to date and reduce the burden of manual data entry.

Version published to 10.1101/2025.05.15.654376v1 on bioRxiv
May 20, 2025

EVODEX: A Mechanistic Framework for Extracting, Structuring, and Predicting Enzymatic Reactivity

This article has 7 authors:
1. Lais Lastre Conceicao
2. Haohong Lin
3. Doris Tai
4. Gongao Xue
5. Han Zhang
6. Chenyan Zhang
7. J. Christopher Anderson
This article has no evaluationsLatest version Jun 25, 2025
Enhancing bio.tools by Semantic Literature Mining

This article has 10 authors:
1. Aleksandra Szmigiel
2. Ana Mendes
3. Erik Jaaniso
4. Magnus Palmblad
5. Rob M. Ewing
6. SANTOSH TIRUNAGARI
7. Tess AV Afanasyeva
8. Vedran Kasalica
9. Veit Schwämmle
10. Zunaira Shafique
This article has no evaluationsLatest version May 4, 2025
Leveraging Large Language Models for Literature-Driven Prioritization of Protein Binding Pockets

This article has 11 authors:
1. Roman Stratiichuk
2. Mykola Melnychenko
3. Ihor Koleiev
4. Taras Voitsitskyi
5. Vladyslav Husak
6. Nazar Shevchuk
7. Zakhar Osrovsky
8. Volodymyr Bdzhola
9. Semen Yesylevskyy
10. Serhii Starosyla
11. Alan Nafiiev
This article has no evaluationsLatest version May 15, 2025

Listed in

Abstract

Author Summary

Article activity feed

Related articles

EVODEX: A Mechanistic Framework for Extracting, Structuring, and Predicting Enzymatic Reactivity

Enhancing bio.tools by Semantic Literature Mining

Leveraging Large Language Models for Literature-Driven Prioritization of Protein Binding Pockets