Benchmarking Agentic Large Language Models for Complex Protein-Set Functional Annotation
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Large language model (LLM) agents are increasingly used to synthesize heterogeneous bioinformatics evidence, but their reliability for high-volume biological annotation remains poorly characterized. We evaluated three agent configurations on a controlled protein annotation task: Claude App with Claude Opus 4.7, Claude Code CLI with Claude Opus 4.7 and Claude Scientific Skills, and Codex App with GPT-5.4 and Claude Scientific Skills. Each configuration was run three times on the same verbatim prompt, the same 73 selected orthogroup FASTA files (1,705 protein sequences), and the same local evidence: Swiss-Prot BLASTP output, Pfam/HMMER domain hits, DeepTMHMM topology predictions, and SignalP secretion predictions. We audited the nine outputs for coverage, biological correctness, missing evidence, hallucinated or over-specific annotations, and within-method consistency, then merged the best-supported evidence into a final orthogroup annotation table. All nine runs covered all 73 orthogroups, indicating that the agents could retrieve and organize the complete input set. However, normalized calcification-relevance calls were only moderately reproducible: within-method exact tier agreement ranged from 0.397 to 0.685 for Claude App (mean 0.562), 0.342 to 0.740 for Claude Code (mean 0.516), and 0.411 to 0.630 for Codex App (mean 0.539), and the per-run number of high-confidence calls varied from 0 to 12 across the nine runs. The final curated table retained 3 high-confidence, 9 moderate, 18 watchlist, and 43 low-relevance orthogroups. The most robust direct candidates were sulfatase (OG0017138) and sulfotransferase (OG0020703) families and an FG-GAP/integrin-like surface protein family (OG0018986), whereas common error modes included elevating pentapeptide-repeat orthogroups on motif evidence alone, treating weakly secreted housekeeping enzymes as matrix proteins, and taking low-complexity BLAST labels at face value. Skill-enabled agents improved file handling, evidence traceability, and reproducibility of computational checking, but they did not eliminate biological overinterpretation. These results support a best-practice workflow in which LLM agents draft annotations only after deterministic evidence tables are generated, with explicit scoring rules, provenance columns, run-to-run replication, and expert review of high-impact claims.
Article activity feed
-
Within-method exact agreement on normalized relevance labels was modest (Figure 3; Table 2). The best agreement was between Claude Code runs 2 and 3 (54/73 orthogroups; 0.740), while the lowest was between Claude Code runs 1 and 3 (25/73; 0.342). Mean within-method agreement was in the same range for all three configurations (0.516–0.562), so no configuration was dramatically more reproducible than the others at the tier-label level. These results argue against relying on a single stochastic agent run for final biological claims, even when the input files and prompt are identical.
Is within-method exact agreement really the best metric? Recommending to not run against a single stochastic agent is fine but what is the delta? Running many costs more for what benefit?
-
lthough coverage was complete, calibration differed strongly across runs (Figure 2; Table 1). Claude App run 2 was highly conservative, assigning 67 of 73 orthogroups to a low or background tier and only one high call. Claude App runs 1 and 3 were less conservative, with 11 and 8 high calls, respectively. Claude Code with scientific skills produced fewer high calls overall (1, 3, and 2), but shifted substantially between low and watchlist labels across runs. Codex App with scientific skills showed the widest high-call range, from no high calls in run 2 to 12 high calls in run 3.
How does temperature/nucleus sampling/effort affect these results? Did you control for potential variation in these parameters?
-
Here we use a controlled, repeated-run comparison to evaluate three agent configurations as they were used on the same orthogroup annotation prompt. The goal is not to rank proprietary foundation models in general. Instead, we ask a practical question relevant to bioinformatics groups: when agents are asked to retrieve, integrate, and interpret a large set of complex protein annotations, where do they help, where do they fail, how consistent are repeated runs, and how should their outputs be merged into a defensible final annotation table?
Great experiment! I wonder what metrics are reported and how representative/relevant those metrics are given real life tasks.
-
ombining these evidence streams is routine, but the final biological interpretation is still difficult because many protein families are multidomain, repetitive, lineage-specific, or only indirectly connected to the process of interest.
Absolutely!! A challenge of great importance.
-