Protein Language Models Expose Viral Mimicry and Immune Escape

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Motivation

Viruses elude the immune system through molecular mimicry, adopting biophysical characteristics of their host. We adapt protein language models (PLMs) to differentiate between human and viral proteins. Understanding where the immune system and our models make mistakes could reveal viral immune escape mechanisms.

Results

We applied pretrained deep-learning PLMs to predict viral from human proteins. Our predictors show state-of-the-art results with AUC of 99.7%. We use interpretable error analysis models to characterize viral escapers. Altogether, mistakes account for 3.9% of the sequences with viral proteins being disproportionally misclassified. Analysis of external variables, including taxonomy and functional annotations, indicated that errors typically involve proteins with low immunogenic potential, viruses specific to human hosts, and those using reverse-transcriptase enzymes for their replication. Viral families causing chronic infections and immune evasion are further enriched and their protein mimicry potential is discussed. We provide insights into viral adaptation strategies and highlight the combined potential of PLMs and explainable AI in uncovering mechanisms of viral immune escape, contributing to vaccine design and antiviral research.

Availability and implementation

Data and results available in https://github.com/ddofer/ProteinHumVir .

Contact

michall@cc.huji.ac.il

Article activity feed

  1. The observed distributions suggest differential immune recognition for viral versus human proteins

    Did you do any sort of statistical test to see if these distributions are actually different?

  2. We found that the fraction of mistakes is higher for genera of human-specific when compared to genera of viruses that infect vertebrates

    was the human dataset larger for training? I think we have more examples of viruses that infect humans than any other species

  3. Overall, models more frequently misclassify viruses as human proteins (abbreviated V4H) than the other way around (H4V). Altogether, 9.48% of viral proteins are misclassified (635/6,699), as opposed to only 1.87% of human proteins (H4V, 345/18,418)

    These set sizes are pretty different, so just on average the model will be more accurate if it always predicts human than virus. Did you do anything to balance this out during your training?

  4. While the Amino Acid n-grams model, using only sequence length and amino acid combinations achieves good separation (91.9% AUC), the PLM-based models outperform and reach AUC of 99.7% and ∼97% accuracy.

    Is this just on the test set or the combined training and test? I would expect it to be just on the test but there was a sentence in the methods about how you re-combined these two sets

  5. Models were trained to predict if a protein is from a human or a virus. Performance was evaluated on the test set.

    Can you add the size of each set, as well as the taxonomic diversity of sequences represented?

  6. These embeddings were derived from “prottrans_T5_xl_u50”, a T5 Transformer architecture PLM model, with 3 billion parameters (Elnaggar, et al., 2022). These are used as input features for training downstream, non-deep ML models.

    I think I'm confused, so you used ESM2 or you used T5 transformer embeddings?

  7. Virus family, genus, and Baltimore classification were downloaded from ViralZone

    Can you add the link that this was downloaded from? I've historically had trouble downloading directly from ViralZone

  8. Duplicate sequences at the UniRef90 level were dropped to reduce redundancy

    What does this mean? I thought that UniRef90 was already non-redundant at the level of 90% sequence similarity

  9. adept at mimicking host proteins.

    What type of mimicry? This could be:

    • Codon usage
    • Structural mimicry without high sequence similarity
    • Structural mimicry with high sequence similarity
    • Epitope mimicry
  10. These mechanisms are evolutionarily optimized through sequence adaption

    Which mechanisms, and what do you mean by evolutionarily optimized? The abstract of that citation (Bahir et al. 2009) suggests that, "In contrast, proteins that are known to participate in host‐specific recognition do not necessarily adapt to their respective hosts," meaning that the proteins you discuss here might not have the same adapted codon usage.

  11. Gp120 structurally mimics the host cell receptor CD4

    It would be very helpful if you can provide a citation for this, and potentially the TM score or some other measurement of structural similarity. I'm curious to know how large the structural mimic is (the whole host or virus protein? over how many amino acid residues?) and how much the viral protein mimics the host.