An evaluation of ChatGPT and Bard (Gemini) in the context of biological knowledge retrieval

Ron Caspi
Peter D. Karp

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

ChatGPT and Bard (now called Gemini), two conversational AI models developed by OpenAI and Google AI, respectively, have garnered considerable attention for their ability to engage in natural language conversations and perform various language-related tasks. While the versatility of these chatbots in generating text and simulating human-like conversations is undeniable, we wanted to evaluate their effectiveness in retrieving biological knowledge for curation and research purposes. To do so we asked each chatbot a series of questions and scored their answers based on their quality. Out of a maximal score of 24, ChatGPT scored 5 and Bard scored 13. The encountered issues included missing information, incorrect answers, and instances where responses combine accurate and inaccurate details. Notably, both tools tend to fabricate references to scientific papers, undermining their usability. In light of these findings, we recommend that biologists continue to rely on traditional sources while periodically assessing the reliability of ChatGPT and Bard. As ChatGPT aptly suggested, for specific and up-to-date scientific information, established scientific journals, databases, and subject-matter experts remain the preferred avenues for trustworthy data.

Version published to 10.1099/acmi.0.000790.v3 on Access Microbiology
Jun 1, 2024
Access Microbiology
May 14, 2024

I am pleased to tell you that your article has now been accepted for publication in Access Microbiology.

Read the original source
Access Microbiology
May 13, 2024

Comments to Author

All my comments have been answered satisfactorily.

Please rate the manuscript for methodological rigour

Good

Please rate the quality of the presentation and structure of the manuscript

Very good

To what extent are the conclusions supported by the data?

Strongly support

Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

No

Is there a potential financial or other conflict of interest between yourself and the author(s)?

No

If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

Yes

Read the original source
Version published to 10.1099/acmi.0.000790.v2 on Access Microbiology
Apr 17, 2024
Access Microbiology
Mar 19, 2024

The reviewers have highlighted major concerns with the work presented. Please ensure that you address their comments.

Read the original source
Access Microbiology
Mar 15, 2024

Comments to Author

The respected authors shared presented their manuscript titled: "An Evaluation of ChatGPT and Bard in the Context of Biological Knowledge Retrieval". Specifically, they introduced 8 case scenarios where they take to Google Bard and ChatGPT to answer biological questions related to each case. They used means of prompt engineering to ask questions and retrieve the answer. While such evaluations are valuable and necessary for the scientfic community to learn about such AI tools, the papers has many flaws. [1] Missing the definition of ground truth notion where the questions and asnwers are commonly known [2] Missing summary statistics presented in tables and plotted in figures. [3] Missing a scoring mechanism for questions that are correct, partially correct, and incorret [4] Missing details of how the …

Comments to Author

The respected authors shared presented their manuscript titled: "An Evaluation of ChatGPT and Bard in the Context of Biological Knowledge Retrieval". Specifically, they introduced 8 case scenarios where they take to Google Bard and ChatGPT to answer biological questions related to each case. They used means of prompt engineering to ask questions and retrieve the answer. While such evaluations are valuable and necessary for the scientfic community to learn about such AI tools, the papers has many flaws. [1] Missing the definition of ground truth notion where the questions and asnwers are commonly known [2] Missing summary statistics presented in tables and plotted in figures. [3] Missing a scoring mechanism for questions that are correct, partially correct, and incorret [4] Missing details of how the prompt engineering was created [5] Missing a motivating and exciting introduction backed up relevant references

Please rate the manuscript for methodological rigour

Poor

Please rate the quality of the presentation and structure of the manuscript

Poor

To what extent are the conclusions supported by the data?

Partially support

Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

No

Is there a potential financial or other conflict of interest between yourself and the author(s)?

No

If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

No: I don't have concerns about animal/human works

Read the original source
Access Microbiology
Mar 6, 2024

Comments to Author

Caspi and Karp have performed a test of the ability of the AI models ChatGPT and Bard (now called Gemini). They have posed a range of questions to the two models covering biological subjects, and in this manuscript report and score the answers the two models give. Methodological rigour, reproducibility and availability of underlying data When reading the manuscript I wondered about the following things: 1. Had the authors any expectations about the answers they would receive? 2. How were the topics for questions selected? 3. Does this manuscript report all the questions asked, or is this a selection of the worst possible answers? 4. To what extent are the results reproducible? Would you get the same answers if you repeated the experiment today, or would you get new (better or worse) answers (and …

Comments to Author

Caspi and Karp have performed a test of the ability of the AI models ChatGPT and Bard (now called Gemini). They have posed a range of questions to the two models covering biological subjects, and in this manuscript report and score the answers the two models give. Methodological rigour, reproducibility and availability of underlying data When reading the manuscript I wondered about the following things: 1. Had the authors any expectations about the answers they would receive? 2. How were the topics for questions selected? 3. Does this manuscript report all the questions asked, or is this a selection of the worst possible answers? 4. To what extent are the results reproducible? Would you get the same answers if you repeated the experiment today, or would you get new (better or worse) answers (and newly invented references)? 5. Why were the interactions with ChatGPT spread over 8 months? Would the answers have been different if they had all been asked in one session (cf. question 4)? 6. Did the interactions with ChatGPT and Bard for each case follow a fixed pattern, or where they dependent on what happened in each case? How were the follow-up questions chosen? (see also comment below for case 7) Presentation of results and How the style and organization of the paper communicates and represents key findings The results are presented as the conversations unfolded. It gives an anecdotal appearance. It would be nice if the results were summarized to give a better overview of the performance of the two AI models. Literature analysis or discussion It would improve the paper if the authors related their findings to other articles about the usage of ChatGPT and Bard (see for example: https://www.nature.com/articles/d41586-023-04071-6 and https://www.calcalistech.com/ctechnews/article/b1ensqkih). Any other relevant comments In case 7 (Line 352) Bard provides a very short answer about the function of the RbcX. The authors score the result based on their prior knowledge of the function of this protein. In case they had not known the answer and had asked for a reference, would it have changed the scoring if such a reference had been wrong or non-existent? A couple of other minor points Line 193: There is a duplication of the text: "[10] (2009),". Line 337: I am not sure what the authors mean by "a different team". If they mean two completely different groups this is not correct. To clarify, the article cited by Bard does share an author with ref. 17 as Staffan Normark is author of both papers. But as Bard has invented the first names of the authors this was not evident. The true names of the authors from the Bard article are: Mårten Hammar, Zhao Bian, and Staffan Normark. This fact is missing from the manuscript. Line 349: To be precise, it has been known since at least 2004. The authors are not citing the earliest articles about RbcX function, e.g. see PMID: 15564522 Onizuka et al. 2004: "The rbcX gene product promotes the production and assembly of ribulose-1,5-bisphosphate carboxylase/oxygenase of Synechococcus sp. PCC7002 in Escherichia coli." Line 434: The authors write "Based on my experience" as if there is only one author on the manuscript.

Please rate the manuscript for methodological rigour

Poor

Please rate the quality of the presentation and structure of the manuscript

Good

To what extent are the conclusions supported by the data?

Strongly support

Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

No

Is there a potential financial or other conflict of interest between yourself and the author(s)?

No

If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

Yes

Read the original source
Version published to 10.1099/acmi.0.000790.v1 on Access Microbiology
Feb 9, 2024

Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

This article has 7 authors:
1. Hanbo Lu
2. Yusa Zhang
3. Zhan Wang
4. Yang Zhao
5. Jiang Liu
6. Dongxu Qiu
7. Yushi Zhang
This article has no evaluationsLatest version Apr 8, 2026
A BLEU-Based Comparative Analysis of Human and ChatGPT 4.0 Translation in Kumpulan Lagu dan Cerita Anak- anak Dwibahasa

This article has 2 authors:
1. Amon Bernabas Tenis
2. Adi Sytrisno
This article has no evaluationsLatest version Mar 24, 2026
Are Eight Chatbots Better Than One? Boosting Chatbot Creative Outcomes via Exposure to Self- and Peer-Generated Examples

This article has 2 authors:
1. Dimitris Grammenos
2. Todd Lubart
This article has no evaluationsLatest version Feb 16, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Evaluation of ChatGPT-4o’s and DeepSeek R1’s responses to urological problems: A comparative study

A BLEU-Based Comparative Analysis of Human and ChatGPT 4.0 Translation in Kumpulan Lagu dan Cerita Anak- anak Dwibahasa

Are Eight Chatbots Better Than One? Boosting Chatbot Creative Outcomes via Exposure to Self- and Peer-Generated Examples