Protein Language Models Expose Viral Mimicry and Immune Escape
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Motivation
Viruses elude the immune system through molecular mimicry, adopting biophysical characteristics of their host. We adapt protein language models (PLMs) to differentiate between human and viral proteins. Understanding where the immune system and our models make mistakes could reveal viral immune escape mechanisms.
Results
We applied pretrained deep-learning PLMs to predict viral from human proteins. Our predictors show state-of-the-art results with AUC of 99.7%. We use interpretable error analysis models to characterize viral escapers. Altogether, mistakes account for 3.9% of the sequences with viral proteins being disproportionally misclassified. Analysis of external variables, including taxonomy and functional annotations, indicated that errors typically involve proteins with low immunogenic potential, viruses specific to human hosts, and those using reverse-transcriptase enzymes for their replication. Viral families causing chronic infections and immune evasion are further enriched and their protein mimicry potential is discussed. We provide insights into viral adaptation strategies and highlight the combined potential of PLMs and explainable AI in uncovering mechanisms of viral immune escape, contributing to vaccine design and antiviral research.
Availability and implementation
Data and results available in https://github.com/ddofer/ProteinHumVir .
Contact
michall@cc.huji.ac.il
Article activity feed
-
The observed distributions suggest differential immune recognition for viral versus human proteins
Did you do any sort of statistical test to see if these distributions are actually different?
-
A full list of families and genera is provided in our repository.
Can you link to where? this was somewhat hard for me to find
-
We found that the fraction of mistakes is higher for genera of human-specific when compared to genera of viruses that infect vertebrates
was the human dataset larger for training? I think we have more examples of viruses that infect humans than any other species
-
also extremely elusive
What does this mean? That they are wrong more often in your model?
-
mis takes
Typo
-
Overall, models more frequently misclassify viruses as human proteins (abbreviated V4H) than the other way around (H4V). Altogether, 9.48% of viral proteins are misclassified (635/6,699), as opposed to only 1.87% of human proteins (H4V, 345/18,418)
These set sizes are pretty different, so just on average the model will be more accurate if it always predicts human than virus. Did you do anything to balance this out during your training?
-
While the Amino Acid n-grams model, using only sequence length and amino acid combinations achieves good separation (91.9% AUC), the PLM-based models outperform and reach AUC of 99.7% and ∼97% accuracy.
Is this just on the test set or the combined training and test? I would expect it to be just on the test but there was a sentence in the methods about how you re-combined these two sets
-
Models were trained to predict if a protein is from a human or a virus. Performance was evaluated on the test set.
Can you add the size of each set, as well as the taxonomic diversity of sequences represented?
-
These embeddings were derived from “prottrans_T5_xl_u50”, a T5 Transformer architecture PLM model, with 3 billion parameters (Elnaggar, et al., 2022). These are used as input features for training downstream, non-deep ML models.
I think I'm confused, so you used ESM2 or you used T5 transformer embeddings?
-
Virus family, genus, and Baltimore classification were downloaded from ViralZone
Can you add the link that this was downloaded from? I've historically had trouble downloading directly from ViralZone
-
Protein-level embeddings were downloaded from UniProt.
What information is this? How can this be downloaded? Embeddings in what model?
-
always disjoint
This isn't super clear. You always had them in them either all in train or all in test, or you split them between the two?
-
1,600
Why this length? I thought the ESM2 length cut off was shorter
-
Duplicate sequences at the UniRef90 level were dropped to reduce redundancy
What does this mean? I thought that UniRef90 was already non-redundant at the level of 90% sequence similarity
-
a known, vertebrate host were downloaded
How did you define "known vertebrate host"? Did you use UniProt annotations or something else?
-
ength
length
-
adept at mimicking host proteins.
What type of mimicry? This could be:
- Codon usage
- Structural mimicry without high sequence similarity
- Structural mimicry with high sequence similarity
- Epitope mimicry
-
These mechanisms are evolutionarily optimized through sequence adaption
Which mechanisms, and what do you mean by evolutionarily optimized? The abstract of that citation (Bahir et al. 2009) suggests that, "In contrast, proteins that are known to participate in host‐specific recognition do not necessarily adapt to their respective hosts," meaning that the proteins you discuss here might not have the same adapted codon usage.
-
mimic host proteins
Which host proteins? Does it mimic more than one simultaneously?
-
Gp120 structurally mimics the host cell receptor CD4
It would be very helpful if you can provide a citation for this, and potentially the TM score or some other measurement of structural similarity. I'm curious to know how large the structural mimic is (the whole host or virus protein? over how many amino acid residues?) and how much the viral protein mimics the host.
-