Enhanced Identification of Key Bacterial Motility Genes via a Cross-Species Genomic Hybrid Feature Machine Learning Approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Efficient and accurate identification of functional genes is critical to biological research, yet traditional single-species approaches are often limited by low efficiency. Previously, we established a novel method for identifying key genes using cross-species protein domain features and machine learning. However, the high multiplicity of gene members associated with specific domains creates a substantial workload for subsequent experimental validation. This study proposes an enhanced approach that integrates EggNOG-based protein sequence annotation with domain analysis. Unannotated sequences are subsequently analyzed for protein domains, generating a comprehensive "direct gene annotation plus domain" hybrid feature matrix. While the hybrid matrix model yielded a modest improvement in predictive accuracy, it significantly enhanced feature resolution: the top 50 predicted features were all known motility-related genes or domains. Furthermore, among the top 100 ranked genes, 58 are confirmed to be directly related to motility based on experimental evidence. These results demonstrate that the new method significantly enhances the precision of key gene identification. The quantity and accuracy of functional genes identified in a single analysis far exceed those of existing single-species methods, providing a highly efficient solution for mining key genes underlying other complex bacterial phenotypes.