An evaluation of ChatGPT and Bard (Gemini) in the context of biological knowledge retrieval
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
ChatGPT and Bard (now called Gemini), two conversational AI models developed by OpenAI and Google AI, respectively, have garnered considerable attention for their ability to engage in natural language conversations and perform various language-related tasks. While the versatility of these chatbots in generating text and simulating human-like conversations is undeniable, we wanted to evaluate their effectiveness in retrieving biological knowledge for curation and research purposes. To do so we asked each chatbot a series of questions and scored their answers based on their quality. Out of a maximal score of 24, ChatGPT scored 5 and Bard scored 13. The encountered issues included missing information, incorrect answers, and instances where responses combine accurate and inaccurate details. Notably, both tools tend to fabricate references to scientific papers, undermining their usability. In light of these findings, we recommend that biologists continue to rely on traditional sources while periodically assessing the reliability of ChatGPT and Bard. As ChatGPT aptly suggested, for specific and up-to-date scientific information, established scientific journals, databases, and subject-matter experts remain the preferred avenues for trustworthy data.
Article activity feed
-
-
I am pleased to tell you that your article has now been accepted for publication in Access Microbiology.
-
Comments to Author
All my comments have been answered satisfactorily.
Please rate the manuscript for methodological rigour
Good
Please rate the quality of the presentation and structure of the manuscript
Very good
To what extent are the conclusions supported by the data?
Strongly support
Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?
No
Is there a potential financial or other conflict of interest between yourself and the author(s)?
No
If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?
Yes
-
-
The reviewers have highlighted major concerns with the work presented. Please ensure that you address their comments.
-
Comments to Author
The respected authors shared presented their manuscript titled: "An Evaluation of ChatGPT and Bard in the Context of Biological Knowledge Retrieval". Specifically, they introduced 8 case scenarios where they take to Google Bard and ChatGPT to answer biological questions related to each case. They used means of prompt engineering to ask questions and retrieve the answer. While such evaluations are valuable and necessary for the scientfic community to learn about such AI tools, the papers has many flaws. [1] Missing the definition of ground truth notion where the questions and asnwers are commonly known [2] Missing summary statistics presented in tables and plotted in figures. [3] Missing a scoring mechanism for questions that are correct, partially correct, and incorret [4] Missing details of how the …
Comments to Author
The respected authors shared presented their manuscript titled: "An Evaluation of ChatGPT and Bard in the Context of Biological Knowledge Retrieval". Specifically, they introduced 8 case scenarios where they take to Google Bard and ChatGPT to answer biological questions related to each case. They used means of prompt engineering to ask questions and retrieve the answer. While such evaluations are valuable and necessary for the scientfic community to learn about such AI tools, the papers has many flaws. [1] Missing the definition of ground truth notion where the questions and asnwers are commonly known [2] Missing summary statistics presented in tables and plotted in figures. [3] Missing a scoring mechanism for questions that are correct, partially correct, and incorret [4] Missing details of how the prompt engineering was created [5] Missing a motivating and exciting introduction backed up relevant references
Please rate the manuscript for methodological rigour
Poor
Please rate the quality of the presentation and structure of the manuscript
Poor
To what extent are the conclusions supported by the data?
Partially support
Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?
No
Is there a potential financial or other conflict of interest between yourself and the author(s)?
No
If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?
No: I don't have concerns about animal/human works
-
Comments to Author
Caspi and Karp have performed a test of the ability of the AI models ChatGPT and Bard (now called Gemini). They have posed a range of questions to the two models covering biological subjects, and in this manuscript report and score the answers the two models give. Methodological rigour, reproducibility and availability of underlying data When reading the manuscript I wondered about the following things: 1. Had the authors any expectations about the answers they would receive? 2. How were the topics for questions selected? 3. Does this manuscript report all the questions asked, or is this a selection of the worst possible answers? 4. To what extent are the results reproducible? Would you get the same answers if you repeated the experiment today, or would you get new (better or worse) answers (and …
Comments to Author
Caspi and Karp have performed a test of the ability of the AI models ChatGPT and Bard (now called Gemini). They have posed a range of questions to the two models covering biological subjects, and in this manuscript report and score the answers the two models give. Methodological rigour, reproducibility and availability of underlying data When reading the manuscript I wondered about the following things: 1. Had the authors any expectations about the answers they would receive? 2. How were the topics for questions selected? 3. Does this manuscript report all the questions asked, or is this a selection of the worst possible answers? 4. To what extent are the results reproducible? Would you get the same answers if you repeated the experiment today, or would you get new (better or worse) answers (and newly invented references)? 5. Why were the interactions with ChatGPT spread over 8 months? Would the answers have been different if they had all been asked in one session (cf. question 4)? 6. Did the interactions with ChatGPT and Bard for each case follow a fixed pattern, or where they dependent on what happened in each case? How were the follow-up questions chosen? (see also comment below for case 7) Presentation of results and How the style and organization of the paper communicates and represents key findings The results are presented as the conversations unfolded. It gives an anecdotal appearance. It would be nice if the results were summarized to give a better overview of the performance of the two AI models. Literature analysis or discussion It would improve the paper if the authors related their findings to other articles about the usage of ChatGPT and Bard (see for example: https://www.nature.com/articles/d41586-023-04071-6 and https://www.calcalistech.com/ctechnews/article/b1ensqkih). Any other relevant comments In case 7 (Line 352) Bard provides a very short answer about the function of the RbcX. The authors score the result based on their prior knowledge of the function of this protein. In case they had not known the answer and had asked for a reference, would it have changed the scoring if such a reference had been wrong or non-existent? A couple of other minor points Line 193: There is a duplication of the text: "[10] (2009),". Line 337: I am not sure what the authors mean by "a different team". If they mean two completely different groups this is not correct. To clarify, the article cited by Bard does share an author with ref. 17 as Staffan Normark is author of both papers. But as Bard has invented the first names of the authors this was not evident. The true names of the authors from the Bard article are: Mårten Hammar, Zhao Bian, and Staffan Normark. This fact is missing from the manuscript. Line 349: To be precise, it has been known since at least 2004. The authors are not citing the earliest articles about RbcX function, e.g. see PMID: 15564522 Onizuka et al. 2004: "The rbcX gene product promotes the production and assembly of ribulose-1,5-bisphosphate carboxylase/oxygenase of Synechococcus sp. PCC7002 in Escherichia coli." Line 434: The authors write "Based on my experience" as if there is only one author on the manuscript.
Please rate the manuscript for methodological rigour
Poor
Please rate the quality of the presentation and structure of the manuscript
Good
To what extent are the conclusions supported by the data?
Strongly support
Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?
No
Is there a potential financial or other conflict of interest between yourself and the author(s)?
No
If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?
Yes
-
