ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Protein language models have become essential tools for engineering novel functional proteins. The emerging paradigm of family-based language models makes use of homologous sequences to steer protein design and enhance zero-shot fitness prediction, by imbuing models with an ability to explicitly reason over evolutionary context. To provide an open foundation for this modelling approach, we introduce ProFam-1 , a 251M-parameter autoregressive protein family language model (pfLM) trained with next-token prediction on millions of protein families represented as concatenated, unaligned sets of sequences. ProFam-1 is competitive with state-of-the-art models on the ProteinGym zero-shot fitness prediction benchmark, achieving Spearman correlations of 0.47 for substitutions and 0.53 for indels. For homology-guided generation, ProFam-1 generates diverse sequences with predicted structural similarity, while preserving residue conservation and covariance patterns. All of ProFam’s training and inference pipelines, together with our curated, large-scale training dataset ProFam Atlas , are released fully open source, lowering the barrier to future method development.
Article activity feed
-
First, the attainable Spearman correlation varies widely across prompts for the same assay: the gap between the best and worst prompt commonly exceeds 0.3. Second, the average variant log-likelihood also spans a broad range, and the optimal likelihood differs by assay.
Is this a good candidate for distillation? It seems like it could lock in these performance gains without the heavy inference cost, and it might naturally solve the prompt sensitivity issues that warrant ensembling in the first place. Curious to hear your thoughts.
-
During training, we randomized the order of sequences within each document to encourage invariance with respect to sequence order
When creating the prompt for a given homolog set {H_i, ...}, the order of concatenation is randomized to promote homolog order invariance. But was invariance ever tested post-training? Specifically, did you guys quantify the variance in model output when the exact same set of homologs is simply re-ordered? Establishing this baseline seems critical to determine whether the performance gains from ensembling truly derive from aggregating diverse evolutionary information, or if they are partially an artifact of smoothing out the model's sensitivity to arbitrary input ordering.
-