Protein Language Models Expose Viral Mimicry and Immune Escape

Dan Ofer
Michal Linial

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Motivation

Viruses elude the immune system through molecular mimicry, adopting biophysical characteristics of their host. We adapt protein language models (PLMs) to differentiate between human and viral proteins. Understanding where the immune system and our models make mistakes could reveal viral immune escape mechanisms.

Results

We applied pretrained deep-learning PLMs to predict viral from human proteins. Our predictors show state-of-the-art results with AUC of 99.7%. We use interpretable error analysis models to characterize viral escapers. Altogether, mistakes account for 3.9% of the sequences with viral proteins being disproportionally misclassified. Analysis of external variables, including taxonomy and functional annotations, indicated that errors typically involve proteins with low immunogenic potential, viruses specific to human hosts, and those using reverse-transcriptase enzymes for their replication. Viral families causing chronic infections and immune evasion are further enriched and their protein mimicry potential is discussed. We provide insights into viral adaptation strategies and highlight the combined potential of PLMs and explainable AI in uncovering mechanisms of viral immune escape, contributing to vaccine design and antiviral research.

Availability and implementation

Data and results available in https://github.com/ddofer/ProteinHumVir .

Contact

michall@cc.huji.ac.il

Arcadia Science
Feb 7, 2025

nd were not evident through a sequence matching protocols (e.g., BLAST search).

It would be really interesting and useful if you could measure the level of sequence similarity of your V4H proteins vs. your correctly classified proteins. I wonder how much of the misclassification is driven by shared sequence due to recent HGT vs. deeper structural/functional conservation or convergence

Read the original source
Arcadia Science
Feb 7, 2025

We do not attribute this to the nature of the genetic materials (RNA or DNA). Instead, reverse tran-scriptase (RT) replication-dependent viruses consistently suffer from more V4H mistakes.

Since you only have 1 family for dsDNA-RT and 1 family for ssRNA-RT, it could be useful to strengthen this observation by performing a similar classification task in non-human infecting viruses and their hosts. It would be interesting to know if this is a pattern that holds broadly true (and is driven by the presence of the RT) , or if this is something specific to these 2 viral families.

Read the original source
Arcadia Science
Nov 27, 2024

The observed distributions suggest differential immune recognition for viral versus human proteins

Did you do any sort of statistical test to see if these distributions are actually different?

Read the original source
Arcadia Science
Nov 27, 2024

A full list of families and genera is provided in our repository.

Can you link to where? this was somewhat hard for me to find

Read the original source
Arcadia Science
Nov 27, 2024

We found that the fraction of mistakes is higher for genera of human-specific when compared to genera of viruses that infect vertebrates

was the human dataset larger for training? I think we have more examples of viruses that infect humans than any other species

Read the original source
Arcadia Science
Nov 27, 2024

also extremely elusive

What does this mean? That they are wrong more often in your model?

Read the original source
Arcadia Science
Nov 27, 2024

mis takes

Typo

Read the original source
Arcadia Science
Nov 27, 2024

Overall, models more frequently misclassify viruses as human proteins (abbreviated V4H) than the other way around (H4V). Altogether, 9.48% of viral proteins are misclassified (635/6,699), as opposed to only 1.87% of human proteins (H4V, 345/18,418)

These set sizes are pretty different, so just on average the model will be more accurate if it always predicts human than virus. Did you do anything to balance this out during your training?

Read the original source
Arcadia Science
Nov 27, 2024

While the Amino Acid n-grams model, using only sequence length and amino acid combinations achieves good separation (91.9% AUC), the PLM-based models outperform and reach AUC of 99.7% and ∼97% accuracy.

Is this just on the test set or the combined training and test? I would expect it to be just on the test but there was a sentence in the methods about how you re-combined these two sets

Read the original source
Arcadia Science
Nov 27, 2024

Models were trained to predict if a protein is from a human or a virus. Performance was evaluated on the test set.

Can you add the size of each set, as well as the taxonomic diversity of sequences represented?

Read the original source
Arcadia Science
Nov 27, 2024

These embeddings were derived from “prottrans_T5_xl_u50”, a T5 Transformer architecture PLM model, with 3 billion parameters (Elnaggar, et al., 2022). These are used as input features for training downstream, non-deep ML models.

I think I'm confused, so you used ESM2 or you used T5 transformer embeddings?

Read the original source
Arcadia Science
Nov 27, 2024

Virus family, genus, and Baltimore classification were downloaded from ViralZone

Can you add the link that this was downloaded from? I've historically had trouble downloading directly from ViralZone

Read the original source
Arcadia Science
Nov 27, 2024

Protein-level embeddings were downloaded from UniProt.

What information is this? How can this be downloaded? Embeddings in what model?

Read the original source
Arcadia Science
Nov 27, 2024

always disjoint

This isn't super clear. You always had them in them either all in train or all in test, or you split them between the two?

Read the original source
Arcadia Science
Nov 27, 2024

1,600

Why this length? I thought the ESM2 length cut off was shorter

Read the original source
Arcadia Science
Nov 27, 2024

Duplicate sequences at the UniRef90 level were dropped to reduce redundancy

What does this mean? I thought that UniRef90 was already non-redundant at the level of 90% sequence similarity

Read the original source
Arcadia Science
Nov 27, 2024

a known, vertebrate host were downloaded

How did you define "known vertebrate host"? Did you use UniProt annotations or something else?

Read the original source
Arcadia Science
Nov 27, 2024

ength

length

Read the original source
Arcadia Science
Nov 27, 2024
adept at mimicking host proteins.

What type of mimicry? This could be:

Codon usage

Structural mimicry without high sequence similarity

Structural mimicry with high sequence similarity

Epitope mimicry
Read the original source
Arcadia Science
Nov 27, 2024

These mechanisms are evolutionarily optimized through sequence adaption

Which mechanisms, and what do you mean by evolutionarily optimized? The abstract of that citation (Bahir et al. 2009) suggests that, "In contrast, proteins that are known to participate in host‐specific recognition do not necessarily adapt to their respective hosts," meaning that the proteins you discuss here might not have the same adapted codon usage.

Read the original source
Arcadia Science
Nov 27, 2024

mimic host proteins

Which host proteins? Does it mimic more than one simultaneously?

Read the original source
Arcadia Science
Nov 27, 2024

Gp120 structurally mimics the host cell receptor CD4

It would be very helpful if you can provide a citation for this, and potentially the TM score or some other measurement of structural similarity. I'm curious to know how large the structural mimic is (the whole host or virus protein? over how many amino acid residues?) and how much the viral protein mimics the host.

Read the original source
Version published to 10.1101/2024.03.14.585057v1 on bioRxiv
Mar 15, 2024

Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

This article has 16 authors:
1. Yuan-Fei Pan
2. Yong He
3. Yu-Qi Liu
4. Yong-Tao Shan
5. Shu-Ning Liu
6. Xue Liu
7. Xiaoyun Pan
8. Yinqi Bai
9. Zan Xu
10. Zheng Wang
11. Jieping Ye
12. Edward C. Holmes
13. Bo Li
14. Yao-Qing Chen
15. Zhao-Rong Li
16. Mang Shi
This article has no evaluationsLatest version Jun 20, 2025
HAVEN: Hierarchical Attention for Viral protEin-based host iNference

This article has 5 authors:
1. Blessy Antony
2. Maryam Haghani
3. Adam Lauring
4. Anuj Karpatne
5. T. M. Murali
This article has no evaluationsLatest version Jun 13, 2025
Extending Protein Language Models to a Viral Genomic Scale Using Biologically Induced Sparse Attention

This article has 8 authors:
1. Thibaut Dejean
2. Barbra D. Ferrell
3. William Harrigan
4. Zachary D. Schreiber
5. Rajan Sawhney
6. K. Eric Wommack
7. Shawn W. Polson
8. Mahdi Belcaid
This article has no evaluationsLatest version Jun 11, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Motivation

Results

Availability and implementation

Contact

Article activity feed

Related articles

Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

HAVEN: Hierarchical Attention for Viral protEin-based host iNference

Extending Protein Language Models to a Viral Genomic Scale Using Biologically Induced Sparse Attention