Benchmarking Recent Computational Tools for DNA-binding Protein Identification

Xizi Luo
Amadeus Song Yi Chi
Andre Huikai Lin
Tze Jet Ong
Limsoon Wong
Chowdhury Rafeed Rahman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control and various cellular processes. In this paper, we conduct an unbiased benchmarking of eleven state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques and training methods; and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at: https://github.com/Rafeed-bot/DNA_BP_Benchmarking .

Key Points

We designed a comprehensive evaluation pipeline which systematically evaluates eleven recent machine learning (ML) based DBP identification tools.
We analyzed the test prediction mistakes made by top-performing tools identifying their potential limitations in terms of model architecture, feature extraction and class balancing.
We showed that although the best of these tools do not convincingly outperform BLAST, they still provide substantial value when integrated together with BLAST into a simple majority-voting ensemble.
We provide recommendations on more robust development & evaluation and better usability of future tools.
We provide the two best-performing ML-based tools, BLAST and the ensemble method as user-friendly software, as well as our proposed datasets, publicly available via GitHub.

Version published to 10.1101/2024.09.01.610735 on bioRxiv
Sep 3, 2024

Discuss this preprint

Listed in

Abstract

Key Points

Article activity feed