GAME: Genomic API for Model Evaluation

Ishika Luthra
Satyam Priyadarshi
Rui Guo
Lukas Mahieu
Niklas Kempynck
Damion Dooley
Dmitry Penzar
Ilya Vorontsov
Yilun Sheng
Xinming Tu
Adam Klie
Shiron Drusinsky
Alexander Floren
Ethan Armand
Kaur Alasoo
Georg Seelig
Ryan Tewhey
Peter Koo
Vikram Agarwal
Sager Gosai
Luca Pinello
Michael A. White
Avantika Lal
Julia Zeitlinger
Katherine S. Pollard
Maxwell Libbrecht
Hannah Carter
Sara Mostafavi
Ivan Kulakovskiy
Will Hsiao
Stein Aerts
Jian Zhou
Carl G. de Boer

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid expansion of genomics datasets and the application of machine learning has produced sequence-to-activity genomics models with ever-expanding capabilities. However, benchmarking these models on practical applications has been challenging because individual projects evaluate their models in ad hoc ways, and there is substantial heterogeneity of both model architectures and benchmarking tasks. To address this challenge, we have created GAME, a system for large-scale, community-led standardized model benchmarking on user-defined evaluation tasks. We borrow concepts from the Application Programming Interface (API) paradigm to allow for seamless communication between pre-trained models and benchmarking tasks, ensuring consistent evaluation protocols. Because all models and benchmarks are inherently compatible in this framework, the continual addition of new models and new benchmarks is easy. We also developed a Matcher module powered by a large language model (LLM) to automate ambiguous task alignment between benchmarks and models. Containerization of these modules enhances reproducibility and facilitates the deployment of models and benchmarks across computing platforms. By focusing on predicting underlying biochemical phenomena (e.g. gene expression, open chromatin, DNA binding), we ensure that tasks remain technology-independent. We provide examples of benchmarks and models implementing this framework, and anticipate that the community will contribute their own, leading to an ever-expanding and evolving set of models and evaluation tasks. This resource will accelerate genomics research by illuminating the best models for a given task, motivating novel functional genomic benchmarks, and providing a more nuanced understanding of model abilities.

Version published to 10.1101/2025.07.04.663250v1 on bioRxiv
Jul 8, 2025

Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma

This article has 24 authors:
1. Yihui Wang
2. Zhiyuan Cai
3. Qian Zeng
4. Yihang Gao
5. Jiarui Ouyang
6. Yingxue Xu
7. Shu Yang
8. Sunan He
9. Yuxiang Nie
10. Yu Cai
11. Fengtao Zhou
12. Cheng Jin
13. Xi Wang
14. Zhi Xie
15. Danqing Zhu
16. Ting Xie
17. Kwang-Ting Cheng
18. Can Yang
19. Xi Fu
20. Jiguang Wang
21. Kang Zhang
22. Jianhua Yao
23. Raul Rabadan
24. Hao Chen
This article has no evaluationsLatest version Jun 30, 2025
A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants

This article has 2 authors:
1. Owen Graham
2. Jim Balford
This article has no evaluationsLatest version Jun 13, 2025
Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions

This article has 6 authors:
1. Ming Yin
2. Yuanhao Qu
3. Dyllan Liu
4. Ling Yang
5. Le Cong
6. Mengdi Wang
This article has no evaluationsLatest version Jun 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma

A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants

Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions