Unbiased learning of protein conformational representation via unsupervised random forest
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate data representation is paramount in biophysics to capture the functionally relevant motions of biomolecules. Traditional feature selection methods, while effective, often rely on labeled data based on prior knowledge and user-supervision, limiting their applicability to novel systems. Here, we present unsupervised random forest (URF), a self-supervised adaptation of traditional random forests that identifies functionally critical features of biomolecules without requiring prior labels. By devising a memory-efficient implementation, we first demonstrate URF’s capability to learn important sets of inter-residue features of a protein and subsequently to resolve its complex conformational landscape, performing at par or surpassing its traditional supervised counterpart and 15 other leading baseline methods. Crucially, URF is supplemented by an internal metric, the learning coefficient , which automates the process of hyper-parameter optimization, making the method robust and user-friendly. URF’s remarkable ability to learn important protein features in an unbiased fashion was validated against 10 independent protein systems including both both folded and intrinsically disordered states. In particular, benchmarking investigations showed that the protein representations identified by URF are functionally meaningful in comparison to current state-of-the-art deep learning methods. As an application, we show that URF can be seamlessly integrated with downstream analyses pipeline such as Markov state models to attain better resolved outputs. The investigation presented here establishes URF as a leading tool for unsupervised representation learning in protein biophysics.