ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Introduction: Accurate prediction of Phage Virion Proteins (PVP) is essen- tial for genomic studies due to their crucial role as structural elements in bacteriophages. Replacing tedious traditional methods, computational tools, par- ticularly machine learning, have emerged for annotating phage protein sequences obtained via high-throughput sequencing. However, effective annotation requires specialized sequence encodings to discern distinguishing sequence characteristics. Our paper introduces ProteoKnight, a new image-based encoding method that addresses spatial constraints inherent in existing techniques, yielding compet- itive performance in PVP classification using pre-trained convolutional neural networks. Additionally, our study bridges a gap in uncertainty analysis for pro- tein sequence classification by evaluating prediction uncertainty in binary PVP classification through the Monte Carlo Dropout (MCD) technique. Methods: Our encoding method, ProteoKnight, adapts the classical DNA-Walk algorithm for protein sequences. We enhanced the encoding process by incor- porating pixel colors and adjusting walk distances to capture intricate protein features. Encoded sequences were classified using multiple pre-trained CNNs, with standard evaluation metrics. Additionally, variance and entropy measures were used to assess prediction uncertainty across proteins of various classes and engths, forming the foundation of our investigation into protein classification and uncertainty quantification. Results: We encoded a benchmark PVP dataset using ProteoKnight and employed pre-trained CNNs for classification. Our experiments highlight the effi- cacy of our approach in binary classification, achieving prediction performance (90.8% accuracy), comparable to state-of-the-art methods. Nevertheless, multi- class classification accuracy remains suboptimal. Furthermore, our uncertainty analysis unveils variability in prediction confidence influenced by protein class and sequence length, contributing novel insights to protein classification research. Conclusions: Our study surpasses the sole existing PVP image encoding method, frequency chaos game representation (FCGR), by introducing a novel image encoding that mitigates FCGR’s spatial information loss limitations. Leveraging parameter-efficient CNNs, our classification technique yields accurate and robust PVP predictions. Moreover, our uncertainty investigation identifies data points associated with low-confidence predictions, enhancing the compre- hensiveness of our analysis. The research codes are available at: https://github. com/eniac00/ProteoKnight