A benchmark for large language models in bioinformatics
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
The rapid advancements in artificial intelligence, particularly in Large Language Models (LLMs) such as GPT-4, Gemini, and LLaMA, have opened new avenues for computational biology and bioinformatics. We report the development of BioLLMBench, a novel framework designed to evaluate LLMs in bioinformatics tasks. This study assessed GPT-4, Gemini, and LLaMA through 2,160 experimental runs, focusing on 24 distinct tasks across six key areas: domain expertise, mathematical problem-solving, coding proficiency, data visualization, research paper summarization, and machine learning model development. Tasks ranged from fundamental to expert-level challenges, and each area was evaluated using seven specific metrics. A Contextual Response Variability Analysis was implemented to understand how model responses varied under different conditions. Results showed diverse performance: GPT-4 led in most tasks, achieving a 91.3% proficiency in domain knowledge, while Gemini excelled in mathematical problem-solving with a 97.5% proficiency score. GPT-4 also outperformed in machine learning model development, though Gemini and LLaMA struggled to generate executable code. All models faced challenges in research paper summarization, scoring below 40% using the ROUGE metric. Model performance variance increased when using a new chat window, though average scores remained similar. The study also discusses the limitations and potential misuse risks of these models in bioinformatics.
Article activity feed
-
Poviding
typo
-
include additional bioinformatics tasks, in order to obtain a more comprehensive understanding of the strengths and weaknesses of LLMs in this field.
It would be great to do an analysis of how the different models handle format conversions for common bioinformatics format types. If they can take a one sentence question for converting between formats, or if you have to explain more exactly what the file should look like. This seems like a common bioinformatics task that one will inevitably have to deal with and is one of the more tedious tasks.
-
When GPT-4 received feedback that its response was incorrect, it exhibited the tendency to modify its subsequent response, even for initially correct answers. This behavior could potentially be problematic for users without comprehensive domain knowledge.
Ah cool this is what I was wondering about! How often for this paper was each model provided the feedback that it was wrong?
-
For this challenge, we provided the top 10 most cited bioinformatics papers to the 3 LLMs, and asked them to generate a summary.
Since the top 10 most cited bioinformatics papers would probably have quite a few summaries either in subsequent papers or news/perspective articles that these LLMs could have been trained on, could you also include newer bioinformatics papers to see how well each model attempts to summarize them?
-
with 10 runs of asking the model the same question in the same search window, and 10 runs using a new search window
Only an anecdotal note, but since these LLM chatboxes like ChatGPT seem to improve by having a conversation - was it evaluated if a model couldn't get a particular question correct even 10 times in a row asking the same question, by pointing out the problem it was having and clarifying the question? I know this might muddy the analysis and benchmark, but could be an interesting analysis to provide - which models improve the most when prompted a little bit, vs those that still don't get around to the correct answer
-