Fine-tuning protein language models on human spatial constraint yields state-of-the-art variant effect prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Millions of missense variants are present in human genomes, yet the functional consequences of most remain unknown. Here, we introduce Human Spatial Constraint (HSC), a framework for quantifying intraspecies constraint on missense variants that integrates population-scale human genetic variation with 3D protein structures. HSC models the expected frequency of missense variation under neutral evolution and compares it to observed variation, accounting for both variation in mutational processes and structural context. HSC outperforms traditional inter- and intraspecies conservation metrics, as well as unsupervised protein language models (PLMs) such as ESM1b, in predicting pathogenic variants, achieving performance comparable to AlphaMissense. Fine-tuning PLMs on HSC scores improves the prediction of variant fitness across diverse taxa and deep mutational scanning (DMS) functional assay types. Together, these results demonstrate that combining intraspecies constraint with cross-species PLMs improves variant effect interpretation and understanding of protein function.