BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics

Varuni Sarwal
Viorel Munteanu
Timur Suhodolschi
Dumitru Ciorba
Eleazar Eskin
Wei Wang
Serghei Mangul

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Large Language Models (LLMs) have shown great promise in their knowledge integration and problem-solving capabilities, but their ability to assist in bioinformatics research has not been systematically evaluated. To bridge this gap, we present BioLLMBench, a novel benchmarking framework coupled with a scoring metric scheme for comprehensively evaluating LLMs in solving bioinformatics tasks. Through BioLLMBench, we conducted a thorough evaluation of 2,160 experimental runs of the three most widely used models, GPT-4, Bard and LLaMA, focusing on 36 distinct tasks within the field of bioinformatics. The tasks come from six key areas of emphasis within bioinformatics that directly relate to the daily challenges and tasks faced by individuals within the field. These areas are domain expertise, mathematical problem-solving, coding proficiency, data visualization, summarizing research papers, and developing machine learning models. The tasks also span across varying levels of complexity, ranging from fundamental concepts to expert-level challenges. Each key area was evaluated using seven specifically designed task metrics, which were then used to conduct an overall evaluation of the LLM’s response. To enhance our understanding of model responses under varying conditions, we implemented a Contextual Response Variability Analysis. Our results reveal a diverse spectrum of model performance, with GPT-4 leading in all tasks except mathematical problem solving. GPT4 was able to achieve an overall proficiency score of 91.3% in domain knowledge tasks, while Bard excelled in mathematical problem-solving with a 97.5% success rate. While GPT-4 outperformed in machine learning model development tasks with an average accuracy of 65.32%, both Bard and LLaMA were unable to generate executable end-to-end code. All models faced considerable challenges in research paper summarization, with none of them exceeding a 40% score in our evaluation using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, highlighting a significant area for future improvement. We observed an increase in model performance variance when using a new chatting window compared to using the same chat, although the average scores between the two contextual environments remained similar. Lastly, we discuss various limitations of these models and acknowledge the risks associated with their potential misuse.

Arcadia Science
Jan 5, 2024

Poviding

typo

Read the original source
Arcadia Science
Jan 5, 2024

include additional bioinformatics tasks, in order to obtain a more comprehensive understanding of the strengths and weaknesses of LLMs in this field.

It would be great to do an analysis of how the different models handle format conversions for common bioinformatics format types. If they can take a one sentence question for converting between formats, or if you have to explain more exactly what the file should look like. This seems like a common bioinformatics task that one will inevitably have to deal with and is one of the more tedious tasks.

Read the original source
Arcadia Science
Jan 5, 2024

When GPT-4 received feedback that its response was incorrect, it exhibited the tendency to modify its subsequent response, even for initially correct answers. This behavior could potentially be problematic for users without comprehensive domain knowledge.

Ah cool this is what I was wondering about! How often for this paper was each model provided the feedback that it was wrong?

Read the original source
Arcadia Science
Jan 5, 2024

For this challenge, we provided the top 10 most cited bioinformatics papers to the 3 LLMs, and asked them to generate a summary.

Since the top 10 most cited bioinformatics papers would probably have quite a few summaries either in subsequent papers or news/perspective articles that these LLMs could have been trained on, could you also include newer bioinformatics papers to see how well each model attempts to summarize them?

Read the original source
Arcadia Science
Jan 5, 2024

with 10 runs of asking the model the same question in the same search window, and 10 runs using a new search window

Only an anecdotal note, but since these LLM chatboxes like ChatGPT seem to improve by having a conversation - was it evaluated if a model couldn't get a particular question correct even 10 times in a row asking the same question, by pointing out the problem it was having and clarifying the question? I know this might muddy the analysis and benchmark, but could be an interesting analysis to provide - which models improve the most when prompted a little bit, vs those that still don't get around to the correct answer

Read the original source
Version published to 10.1101/2023.12.19.572483v1 on bioRxiv
Dec 20, 2023

How do Large Language Models understand Genes and Cells

This article has 10 authors:
1. Chen Fang
2. Yidong Wang
3. Yunze Song
4. Qingqing Long
5. Wang Lu
6. Linghui Chen
7. Pengfei Wang
8. Guihai Feng
9. Yuanchun Zhou
10. Xin Li
This article has no evaluationsLatest version Mar 27, 2024
Bioinformatics Copilot 1.0: A Large Language Model-powered Software for the Analysis of Transcriptomic Data

This article has 5 authors:
1. Yongheng Wang
2. Weidi Zhang
3. Siyu Lin
4. Matthew S. Farruggio
5. Aijun Wang
This article has no evaluationsLatest version Apr 15, 2024
Efficiency in Language Understanding and Generation: An Evaluation of Four Open-Source Large Language Models

This article has 3 authors:
1. Siu Ming Wong
2. Ho Leung
3. Ka Yan Wong
This article has no evaluationsLatest version Mar 11, 2024

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

How do Large Language Models understand Genes and Cells

Bioinformatics Copilot 1.0: A Large Language Model-powered Software for the Analysis of Transcriptomic Data

Efficiency in Language Understanding and Generation: An Evaluation of Four Open-Source Large Language Models