Species-Specific Protein Function Prediction in Flavobacterium covae Using Ensemble Machine Learning

Zaidur Rahman
Harun Pirim
Larry Hanson
Matt Griffin
Hasan Tekedar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein function prediction remains a critical challenge in computational biology, particularly for species-specific applications. This study presents a machine learning framework for predicting protein functions in Flavobacterium covae, a Gram-negative bacterium responsible for columnaris disease in channel catfish. We formulate the problem as a multi-label classification task where each protein sequence is associated with multiple Gene Ontology (GO) terms. Our approach integrates four feature groups: homologous sequence information from BLAST searches, essential gene properties from the Database of Essential Genes (DEG), subcellular localization predictions from PSORTb, and physicochemical properties derived from protein sequences. We evaluate three ensemble learning algorithms—Random Forest, XGBoost, and AdaBoost—on a dataset of 69,960 protein sequences with 1,868 GO term categories. Random Forest and XGBoost achieved accuracies exceeding 90%, with XGBoost demonstrating superior performance across all metrics (accuracy: 90.50%, precision: 93.92%, recall: 92.23%, F1-score: 92.67%). The models successfully predicted functions for over 99% of previously unannotated hypothetical proteins, substantially outperforming existing tools like PANNZER. This species-specific approach provides insights into F. covae pathogenicity and demonstrates the efficacy of integrating diverse biological features for protein function prediction in understudied organisms.

Version published to 10.21203/rs.3.rs-8107275/v1 on Research Square
Nov 14, 2025

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

This article has 4 authors:
1. Tayyip Topuz
2. Zeki Erdem
3. Halil Bisgin
4. E. Demet Akten
This article has no evaluationsLatest version Feb 2, 2026
Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

This article has 1 author:
1. Hayden Farquhar
This article has no evaluationsLatest version Feb 4, 2026
Integrative Transcriptomics and Machine Learning Identify Key Predictive Genes and Pathways in Celiac Disease

This article has 2 authors:
1. Amir Mahdi Taghizadeh
2. Yasin Soflaei
This article has no evaluationsLatest version Jan 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

Protein Language Models Rescue Variant Pathogenicity Prediction in Intrinsically Disordered Regions Through Synergistic Integration with Structure-Based Methods

Integrative Transcriptomics and Machine Learning Identify Key Predictive Genes and Pathways in Celiac Disease