What Large Language Models Know About Plant Molecular Biology

Manuel Fernandez Burda
Lucia Ferrero
Nicolás Gaggion
Camille Fonouni-Farde
The MoBiPlant Consortium
Martín Crespi
Federico Ariel
Enzo Ferrante

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are rapidly permeating scientific research, yet their capabilities in plant molecular biology remain largely uncharacterized. Here, we present M o B i P lant , the first comprehensive benchmark for evaluating LLMs in this domain, developed by a consortium of 112 plant scientists across 19 countries. M o B i P lant comprises 565 expert-curated multiple-choice questions and 1,075 synthetically generated questions, spanning core topics from gene regulation to plant-environment interactions. We benchmarked seven leading chat-based LLMs using both automated scoring and human evaluation of open-ended answers. Models performed well on multiple-choice tasks (exceeding 75% accuracy), although most of them exhibited a consistent bias towards option A. In contrast, expert reviews exposed persistent limitations, including factual misalignment, hallucinations, and low self-awareness. Critically, we found that model performance strongly correlated with the citation frequency of source literature, suggesting that LLMs do not simply encode plant biology knowledge uniformly, but are instead shaped by the visibility and frequency of information in their training corpora. This understanding is key to guiding both the development of next-generation models and the informed use of current tools in the everyday work of plant researchers. M o B i P lant is publicly available online in this link.

Version published to 10.1101/2025.08.31.672925 on bioRxiv
Sep 4, 2025

Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine

This article has 9 authors:
1. Sajib Acharjee Dip
2. Dipanwita Mallick
3. Uddip Acharjee Shuvo
4. Shovito Barua Soummo
5. Fazle Rafsani
6. Bikash Kumar Paul
7. Nazifa Ahmed Moumi
8. Shafayat Ahmed
9. Liqing Zhang
This article has no evaluationsLatest version Dec 16, 2025
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Emergence of Biological Structural Discovery in General-Purpose Language Models

LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine

A Survey on Efficient Protein Language Models