AINN-P1: A Compact Sequence-Only Protein Language Model Achieves Competitive Fitness Prediction on ProteinGym

Roger Wang
Kevin Jin
Lurong Pan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein language models (PLMs) are increasingly central to protein engineering and drug discovery. Many high-performing systems, however, rely on large parameter counts, multiple sequence alignments (MSAs), explicit structural inputs, or computationally intensive attention mechanisms, limiting their accessibility and throughput. Here we present AINN-P1, a 167M-parameter protein language model trained exclusively on raw UniRef amino-acid sequences using an autoregressive next-token prediction objective. AINN-P1 employs a multiplicative LSTM (mLSTM) architecture—an attention-free, recurrent design that scales linearly with sequence length and avoids growing key–value caches during inference.

We evaluate AINN-P1 on ProteinGym fitness prediction tasks spanning activity, binding, expression, and stability using a frozen-encoder protocol with lightweight few-shot regression heads. Under this protocol, AINN-P1 achieves an average Spearman ρ of 0.441 across four task categories and a Spearman ρ of 0.625 on stability—the highest among sequence-only models in our comparison set. Because our evaluation uses few-shot supervised regression rather than the zero-shot scoring employed by most ProteinGym leaderboard baselines, direct numerical comparison requires caution; we discuss this methodological distinction throughout.

Beyond benchmark performance, AINN-P1 emphasizes practical deployability: its recurrent architecture avoids quadratic memory scaling, supports fixed-state inference on long sequences, and enables rapid adaptation through frozen embeddings rather than costly end-to-end fine-tuning. We discuss when sequence-only models are sufficient when structural information remains beneficial and how compact foundation models can serve as efficient front-end filters in drug discovery workflows.

Version published to 10.64898/2026.03.26.714619 on bioRxiv
Mar 30, 2026

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

This article has 7 authors:
1. Bing Rao
2. Jie Bai
3. Maha A. Thafar
4. Somayah Albaradei
5. Kamran Arshad
6. Apilak Worachartcheewanh
7. Muhammad Arif
This article has no evaluationsLatest version Mar 26, 2026
Cross-Attention Over RNA And Protein Sequences Enables Generalizable Interaction Prediction

This article has 7 authors:
1. Mario Catalano
2. Gerardo Pepe
3. Gabriele Ausiello
4. Claire McWhite
5. Giorgio Gambosi
6. Manuela Helmer Citterich
7. Pier Federico Gherardini
This article has no evaluationsLatest version Apr 23, 2026
Explainable protein–protein binding affinity prediction via fine-tuning protein language models

This article has 5 authors:
1. Harshit Singh
2. Rajeev Kumar Singh
3. Satya Pratik Srivastava
4. Suryavedha Pradhan
5. Rohan Gorantla
This article has no evaluationsLatest version Apr 1, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

Cross-Attention Over RNA And Protein Sequences Enables Generalizable Interaction Prediction

Explainable protein–protein binding affinity prediction via fine-tuning protein language models