Assessing the ability of ChatGPT to extract natural product bioactivity and biosynthesis data from publications

Thomas L. Kalmer
Christine Mae F. Ancajas
Zihao Cheng
Abiodun S. Oyedele
Hunter L. Davis
Allison S. Walker

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Natural products are an excellent source of therapeutics and are often discovered through the process of genome mining, where genomes are analyzed by bioinformatic tools to determine if they have the biosynthetic capacity to produce novel or active compounds. Recently, several tools have been reported for predicting natural product bioactivities from the sequence of the biosynthetic gene clusters that produce them. These tools have the potential to accelerate the rate of natural product drug discovery by enabling the prioritization of novel biosynthetic gene clusters that are more likely to produce compounds with therapeutically relevant bioactivities. However, these tools are severely limited by a lack of training data, specifically data pairing biosynthetic gene clusters with activity labels for their products. There are many reports of natural product biosynthetic gene clusters and bioactivities in the literature that are not included in existing databases. Manual curation of these data is time consuming and inefficient. Recent developments in large language models and the chatbot interfaces built on top of them have enabled automatic data extraction from text, including scientific publications. We investigated how accurate ChatGPT is at extracting the necessary data for training models that predict natural product activity from biosynthetic gene clusters. We found that ChatGPT did well at determining if a paper described discovery of a natural product and extracting information about the product’s bioactivity. ChatGPT did not perform as well at extracting accession numbers for the biosynthetic gene cluster or producer’s genome although using an altered prompt improved accuracy.

Arcadia Science
Sep 6, 2024

I really appreciated how this manuscript takes advantage of one of the tasks LLMs are really good at—text summarization—and applies it to a scientific problem. Overall, the model's performance seems quite accurate for most questions (>90% in many cases). I imagine that additional prompt engineering and repeated prompting might continue to improve performance, as demonstrated by your updates to question 5. Thanks for sharing these results!

Read the original source
Arcadia Science
Sep 6, 2024

This result demonstrates that changes made in prompting strategies designed to avoid common errors can greatly improve accuracy of responses.

OpenAI has a really helpful overview of additional strategies you could take to improve the prompts, such as asking the model if it missed any information in previous responses. This might allow you to increase the accuracy of prompt responses where the model told you that no information was found, or to correct itself when it made a mistake.

Read the original source
Arcadia Science
Sep 6, 2024

The exact text used in the prompts is provided in the methods section.

How reproducible were the model's outputs? If you queried it repeatedly, did you observe different responses?

Read the original source
Arcadia Science
Sep 6, 2024

No original software or code was created for this publication.

Have you considered using OpenAI's API to automate the submission of these queries? You can try out a GUI version of the API using ChatGPT's playground here. This interface has several advanced features that could improve your prompt responses. For example, you can set a System prompt to refine the behavior of ChatGPT, for example, telling it to respond only in exact quotations. You can also adjust the Temperature parameter of the model, which allows you to control how variable the responses are.

Read the original source
Arcadia Science
Sep 6, 2024

without for

without error?

Read the original source
Version published to 10.1101/2024.08.01.606186v1 on bioRxiv
Aug 2, 2024

A novel transformer-based platform for the prediction and design of biosynthetic gene clusters for (un)natural products

This article has 4 authors:
1. Tomoki Kawano
2. Taro Shiraishi
3. Tomohisa Kuzuyama
4. Maiko Umemura
This article has no evaluationsLatest version Jun 4, 2025
PPIKB: A Comprehensive Knowledge Base and Analysis Platform for Protein–Peptide Interactions Based on Literature and Patents

This article has 7 authors:
1. Ning Zhu
2. Yanyu Ming
3. Chengyun Zhang
4. Cao Sen
5. Chongyang Li
6. Jingjing Guo
7. Hongliang Duan
This article has no evaluationsLatest version Jun 12, 2025
RC-GNN: A predictive model of enzyme-reaction pairs

This article has 4 authors:
1. Stefan C. Pate
2. Eric H. Wang
3. Linda J. Broadbelt
4. Keith E.J. Tyo
This article has no evaluationsLatest version Jun 27, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

A novel transformer-based platform for the prediction and design of biosynthetic gene clusters for (un)natural products

PPIKB: A Comprehensive Knowledge Base and Analysis Platform for Protein–Peptide Interactions Based on Literature and Patents

RC-GNN: A predictive model of enzyme-reaction pairs