Hit or Miss: Understanding Emergence and Absence of Homo-oligomeric Contacts in Protein Language Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Many proteins function not as isolated molecules but as symmetrical assemblies of identical subunits, from ion channels that gate cellular signals to metabolic enzymes that catalyze life’s essential reactions. Here, we reveal that single-sequence protein language models (pLMs), trained solely on individual protein sequences, implicitly learn the interface contacts of homo-oligomeric assemblies. As models grow larger, their ability to predict inter-subunit contacts continues to improve, whereas the accuracy for single-chain predictions shows only minimal gains. The largest ESM2 model can accurately distinguish genuine biological interfaces from crystallographic artifacts. MSA Pairformer and ESM2-15B perform comparably when broad sets of homologs are used, but restricting alignments to closer evolutionary neighbors reveals a clear difference: MSA Pairformer reaches an interface contact recovery rate of 0.44 compared with 0.33 for ESM2-15B. We hypothesized that one contributing factor is how models implicitly cluster homologous proteins. Comparing evolutionary constraints extracted from pLMs, we find correlation between homologs decrease as model size increases, consistent with larger models partitioning families into finer-grained subclusters and better separating distinct oligomeric interfaces. When models fail to detect known interfaces, these discrepancies may reflect annotation errors, proteins that adopt multiple assembly conformations, or intrinsic model limitations. Overall, our findings show that statistical patterns learned by pLMs encode key aspects of homo-oligomeric assembly organization and provide a basis for understanding how such interfaces diversify across evolution, even though structure is conserved over large evolutionary distances, oligomeric assembly may not be.