Fungi classification from metagenomic data using CNN_FunBar: A simulation study
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Fungal classification from metagenomic datasets remains a highly challenging task due to the inherent complexity of eukaryotic genomes. The internal transcribed spacer (ITS) region is widely recognised as a reliable DNA marker for fungal taxonomic identification. Previously, a convolutional neural network (CNN)-based architecture, CNN_FunBar, was developed and demonstrated high efficiency in classifying fungal ITS sequences across multiple taxonomic levels. In the present study, two CNN_FunBar classifiers, namely Genus_Model.h5 and Species_Model.h5, were retrained using the UNITE+INSDC reference ITS datasets to classify 429 distinct fungal species and 1,293 genera. The retrained models achieved average accuracies exceeding 90% and 95% for genus- and species-level classification, respectively, on the test dataset. To further evaluate model performance, an agricultural soil metagenome was simulated using 130 microbial whole genomes (65 bacterial and 65 fungal genomes) through the MetaSim read simulator, followed by contig assembly using MEGAHIT. Complete ITS regions were extracted from the assembled contigs using the ITSx tool. The extracted ITS sequences were then classified at the genus and species levels using Genus_Model.h5 and Species_Model.h5, respectively, based on hexamer features. The results demonstrated that Species_Model.h5 and Genus_Model.h5 achieved classification accuracies of 91.93% and 95.16%, respectively, correctly identifying 62 species and 41 genera from the simulated metagenomic dataset. This study provides valuable insights and practical tools for researchers working in computational biology and metagenomics.