DARKIN: A zero-shot benchmark for phosphosite–dark kinase association using protein language models

Emine Ayşe Sunar
Zeynep Işık
Mert Pekey
Ramazan Gökberk Cinbiş
Oznur Tastan

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Protein Language Models (pLMs) have emerged as powerful tools for capturing the intricate information encoded in protein sequences, facilitating various downstream protein prediction tasks. With numerous pLMs available, there is a critical need for diverse benchmarks to systematically evaluate their performance across biologically relevant tasks. Here, we introduce DARKIN, a zero-shot classification benchmark designed to assign phosphosites to understudied kinases, termed dark kinases. Kinases, which catalyze phosphorylation, are central to cellular signaling pathways. While phosphoproteomics enables the large-scale identification of phosphosites, determining the cognate kinase responsible for the phosphorylation event remains an experimental challenge.

Results

In DARKIN, we prepared training, validation, and test folds that respect the zero-shot nature of this classification problem, incorporating stratification based on kinase groups and sequence similarity. We evaluated multiple pLMs using two zero-shot classifiers: a novel, training-free k-NN-based method, and a bilinear classifier. Our findings indicate that ESM, ProtT5-XL, and SaProt exhibit superior performance on this task. DARKIN provides a challenging benchmark for assessing pLM efficacy and fosters deeper exploration of under-characterized (dark) kinases by offering a biologically relevant test bed.

Implementation

The DARKIN benchmark data and the scripts for generating additional splits are publicly available at: https://github.com/tastanlab/darkin

Contact

otastan@sabanciuniv.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

Version published to 10.1101/2025.08.27.672558 on bioRxiv
Sep 1, 2025

Data augmentation enables label-specific generation of homologous protein sequences

This article has 3 authors:
1. Lorenzo Rosset
2. Martin Weigt
3. Francesco Zamponi
This article has no evaluationsLatest version Jul 25, 2025
Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models

This article has 5 authors:
1. Jatin Nainani
2. Bryn Marie Reimer
3. Connor Watts
4. David Jensen
5. Anna G. Green
This article has no evaluationsLatest version Aug 28, 2025
Benchmarking DNA Foundation Models for zero-shot variant effect prediction: the role of context, training, and architecture

This article has 4 authors:
1. Ilaria Alfisi
2. Francesca Ciapi
3. Marta Baragli
4. Alberto Magi
This article has no evaluationsLatest version Aug 5, 2025

Listed in

Abstract

Motivation

Results

Implementation

Contact

Supplementary information

Article activity feed

Related articles

Data augmentation enables label-specific generation of homologous protein sequences

Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models

Benchmarking DNA Foundation Models for zero-shot variant effect prediction: the role of context, training, and architecture