ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Jude Wells
Alex Hawkins Hooker
Micha Livne
Weining Lin
David Miller
Christian Dallago
Nicola Bordin
Brooks Paige
Burkhard Rost
Christine Orengo
Michael Heinzinger

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Protein language models have become essential tools for engineering novel functional proteins. The emerging paradigm of family-based language models makes use of homologous sequences to steer protein design and enhance zero-shot fitness prediction, by imbuing models with an ability to explicitly reason over evolutionary context. To provide an open foundation for this modelling approach, we introduce ProFam-1 , a 251M-parameter autoregressive protein family language model (pfLM) trained with next-token prediction on millions of protein families represented as concatenated, unaligned sets of sequences. ProFam-1 is competitive with state-of-the-art models on the ProteinGym zero-shot fitness prediction benchmark, achieving Spearman correlations of 0.47 for substitutions and 0.53 for indels. For homology-guided generation, ProFam-1 generates diverse sequences with predicted structural similarity, while preserving residue conservation and covariance patterns. All of ProFam’s training and inference pipelines, together with our curated, large-scale training dataset ProFam Atlas , are released fully open source, lowering the barrier to future method development.

Arcadia Science
Dec 26, 2025

First, the attainable Spearman correlation varies widely across prompts for the same assay: the gap between the best and worst prompt commonly exceeds 0.3. Second, the average variant log-likelihood also spans a broad range, and the optimal likelihood differs by assay.

Is this a good candidate for distillation? It seems like it could lock in these performance gains without the heavy inference cost, and it might naturally solve the prompt sensitivity issues that warrant ensembling in the first place. Curious to hear your thoughts.

Read the original source
Arcadia Science
Dec 26, 2025

During training, we randomized the order of sequences within each document to encourage invariance with respect to sequence order

When creating the prompt for a given homolog set {H_i, ...}, the order of concatenation is randomized to promote homolog order invariance. But was invariance ever tested post-training? Specifically, did you guys quantify the variance in model output when the exact same set of homologs is simply re-ordered? Establishing this baseline seems critical to determine whether the performance gains from ensembling truly derive from aggregating diverse evolutionary information, or if they are partially an artifact of smoothing out the model's sensitivity to arbitrary input ordering.

Read the original source
Version published to 10.64898/2025.12.19.695431 on bioRxiv
Dec 21, 2025

A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Survey on Efficient Protein Language Models

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

Emergence of Biological Structural Discovery in General-Purpose Language Models