Medium-sized protein language models perform well at transfer learning on realistic datasets

Luiz C. Vieira
Morgan L. Handojo
Claus O. Wilke

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various ESM-style models across multiple biological datasets to assess the impact of model size on transfer learning via feature extraction. Surprisingly, we found that larger models do not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts—ESM-2 15B and ESM C 6B—despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.

Version published to 10.1101/2024.11.22.624936v3 on bioRxiv
May 8, 2025
Version published to 10.1101/2024.11.22.624936v2 on bioRxiv
Jan 28, 2025
Arcadia Science
Dec 31, 2024

We have evaluated the performance of ESM2 embeddings across various model sizes (from 8 million to 15 billion parameters) in transfer learning tasks on a wide range of different biological datasets

I think the diversity of regression tasks lends a lot of robustness to your conclusions. However, I think you're using the term "transfer learning" rather narrowly, specifically referring to prediction tasks where either a value or a vector is predicted for each sequence.

There are many classes of transfer learning tasks, like sequence labeling, token classification, all sequence-to-sequence tasks, etc. I think being more specific about the type of transfer learning you guys are making claims about would make your conclusions more accurate.

Read the original source
Arcadia Science
Dec 31, 2024

Scaling Down for Efficiency: Medium-Sized Transformer Models for Protein Sequence Transfer Learning

Thanks for this insightful piece. I've left some food for thought below.

Read the original source
Arcadia Science
Dec 31, 2024

Even though these models were also pretrained with a maximum sequence length

Technically ESM2 is trained using sequences longer than 1022, but a length 1022 subsequence is sampled whenever it is selected for a training batch.

Read the original source
Arcadia Science
Dec 31, 2024

Mean reduction in R2 when embeddings are compressed with methods other than mean pooling.A) Results for DMS data. B) Results for diverse protein sequences (PISCES data). In all cases, the y-axis represents different compression methods and the x-axis shows the resulting difference in R2. Dots represent the fixed effects estimates from mixed-effects modeling, and error bars represent 95% confidence intervals.

This analysis that compares pooling methods was very informative, but it left me wondering the extent that mean pooling compares to no pooling at all. Is this something y'all considered? It would be interesting to compare the R2 of a more sophisticated transfer learning model that ingests the raw embeddings (like a basic FCN). Though an apples to apples might be hard to create, it would be useful to know the "cost" of mean pooling …

Mean reduction in R2 when embeddings are compressed with methods other than mean pooling.A) Results for DMS data. B) Results for diverse protein sequences (PISCES data). In all cases, the y-axis represents different compression methods and the x-axis shows the resulting difference in R2. Dots represent the fixed effects estimates from mixed-effects modeling, and error bars represent 95% confidence intervals.

This analysis that compares pooling methods was very informative, but it left me wondering the extent that mean pooling compares to no pooling at all. Is this something y'all considered? It would be interesting to compare the R2 of a more sophisticated transfer learning model that ingests the raw embeddings (like a basic FCN). Though an apples to apples might be hard to create, it would be useful to know the "cost" of mean pooling by observing the extend to which raw embeddings outperform mean pooling (if at all?)

Read the original source
Arcadia Science
Dec 31, 2024

In most scenarios

I really don't think this is true. Many transfer learning tasks are token-level predictions, and therefore in those scenarios embeddings cannot be compressed.

Read the original source
Arcadia Science
Dec 31, 2024

this strategy may not retain all critical

Critical what?

Read the original source
Arcadia Science
Dec 31, 2024

six additional proteins

Initially I thought a mere 6 proteins were analyzed. Wording in the caption clarified: six additional DMS datasets were analyzed.

Read the original source
Version published to 10.1101/2024.11.22.624936v1 on bioRxiv
Nov 24, 2024

Rapid and accurate protein structure database search using inverse folding model and contrastive learning

This article has 5 authors:
1. Qiuyi Lyu
2. Hong Wei
3. Shuaishuai Chen
4. Zhenling Peng
5. Jianyi Yang
This article has no evaluationsLatest version May 20, 2025
Tomtom-lite: Accelerating Tomtom enables large-scale and real-time motif similarity scoring

This article has 1 author:
1. Jacob Schreiber
This article has no evaluationsLatest version May 31, 2025
Trainable subnetworks reveal insights into structure knowledge organization in protein language models

This article has 4 authors:
1. Ria Vinod
2. Ava P. Amini
3. Lorin Crawford
4. Kevin K. Yang
This article has no evaluationsLatest version Jun 1, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Rapid and accurate protein structure database search using inverse folding model and contrastive learning

Tomtom-lite: Accelerating Tomtom enables large-scale and real-time motif similarity scoring

Trainable subnetworks reveal insights into structure knowledge organization in protein language models