Benchmarking Recent Computational Tools for DNA-binding Protein Identification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control and various cellular processes. In this paper, we conduct an unbiased benchmarking of eleven state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques and training methods; and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at: https://github.com/Rafeed-bot/DNA_BP_Benchmarking .
Key Points
-
We designed a comprehensive evaluation pipeline which systematically evaluates eleven recent machine learning (ML) based DBP identification tools.
-
We analyzed the test prediction mistakes made by top-performing tools identifying their potential limitations in terms of model architecture, feature extraction and class balancing.
-
We showed that although the best of these tools do not convincingly outperform BLAST, they still provide substantial value when integrated together with BLAST into a simple majority-voting ensemble.
-
We provide recommendations on more robust development & evaluation and better usability of future tools.
-
We provide the two best-performing ML-based tools, BLAST and the ensemble method as user-friendly software, as well as our proposed datasets, publicly available via GitHub.