Assessing the ability of ChatGPT to extract natural product bioactivity and biosynthesis data from publications

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Natural products are an excellent source of therapeutics and are often discovered through the process of genome mining, where genomes are analyzed by bioinformatic tools to determine if they have the biosynthetic capacity to produce novel or active compounds. Recently, several tools have been reported for predicting natural product bioactivities from the sequence of the biosynthetic gene clusters that produce them. These tools have the potential to accelerate the rate of natural product drug discovery by enabling the prioritization of novel biosynthetic gene clusters that are more likely to produce compounds with therapeutically relevant bioactivities. However, these tools are severely limited by a lack of training data, specifically data pairing biosynthetic gene clusters with activity labels for their products. There are many reports of natural product biosynthetic gene clusters and bioactivities in the literature that are not included in existing databases. Manual curation of these data is time consuming and inefficient. Recent developments in large language models and the chatbot interfaces built on top of them have enabled automatic data extraction from text, including scientific publications. We investigated how accurate ChatGPT is at extracting the necessary data for training models that predict natural product activity from biosynthetic gene clusters. We found that ChatGPT did well at determining if a paper described discovery of a natural product and extracting information about the product’s bioactivity. ChatGPT did not perform as well at extracting accession numbers for the biosynthetic gene cluster or producer’s genome although using an altered prompt improved accuracy.

Article activity feed

  1. I really appreciated how this manuscript takes advantage of one of the tasks LLMs are really good at—text summarization—and applies it to a scientific problem. Overall, the model's performance seems quite accurate for most questions (>90% in many cases). I imagine that additional prompt engineering and repeated prompting might continue to improve performance, as demonstrated by your updates to question 5. Thanks for sharing these results!

  2. This result demonstrates that changes made in prompting strategies designed to avoid common errors can greatly improve accuracy of responses.

    OpenAI has a really helpful overview of additional strategies you could take to improve the prompts, such as asking the model if it missed any information in previous responses. This might allow you to increase the accuracy of prompt responses where the model told you that no information was found, or to correct itself when it made a mistake.

  3. The exact text used in the prompts is provided in the methods section.

    How reproducible were the model's outputs? If you queried it repeatedly, did you observe different responses?

  4. No original software or code was created for this publication.

    Have you considered using OpenAI's API to automate the submission of these queries? You can try out a GUI version of the API using ChatGPT's playground here. This interface has several advanced features that could improve your prompt responses. For example, you can set a System prompt to refine the behavior of ChatGPT, for example, telling it to respond only in exact quotations. You can also adjust the Temperature parameter of the model, which allows you to control how variable the responses are.