Quantifying uncertainty in Protein Representations Across Models and Task
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Embeddings, derived by language models, are widely used as numeric proxies for human language sentences and structured data. In the realm of biomolecules, embeddings serve as efficient sequence and/or structure representations, enabling similarity searches, structure and function prediction, and estimation of biophysical and biological properties. However, relying on embeddings without assessing the model’s confidence in its ability to accurately represent molecular properties is a critical flaw—akin to using a scalpel in surgery without verifying its sharpness.
In this study, we propose a means to evaluate the ability of protein language models to represent proteins, assessing their capacity to encode biologically relevant information. Our findings reveal that low-quality embeddings often fail to capture meaningful biology, displaying vector properties indistinguishable from those of randomly generated sequences. A key contributor to this performance issue is the models’ failure to learn the underlying biology from unevenly distributed sequence spaces in the training data.
Our novel, model-agnostic scoring framework is, to the best of our knowledge, the first to quantify protein sequence embedding reliability. We believe that our robust approach to screening embeddings prior to making biological inferences, stands to significantly enhance the reliability of downstream applications.