A machine learning framework for interpreting phylogenetic tree patterns in interkingdom horizontal gene transfer

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Horizontal gene transfer (HGT), the movement of genetic material between unrelated organisms, is widely recognized as an important driver of genome evolution in bacteria. In eukaryotes, however, the evolutionary impact of HGT remains debated. The identification of interkingdom HGT (iHGT) is especially challenging due to the lack of gold standard methods.

Traditionally, iHGT identification has relied on manual inspection of phylogenetic trees, a process that is subjective, difficult to reproduce, and not scalable to large datasets. In this study, we present a computational framework that formalizes phylogenetic tree interpretation into a supervised machine-learning problem. We define five recurrent phylogenetic patterns—iHGT, NoHGT, Limited donor evidence, Multiple major clades (Multiple MC), and Patchy phylogeny—capturing clear and ambiguous evolutionary scenarios.

To operationalize these patterns, we developed a feature-extraction pipeline that quantifies taxonomic composition and phylogenetic topology using seven biological descriptors derived from gene trees. These features were used to train and evaluate multiple machine-learning models, among which a Random Forest (RF) classifier achieved the best performance (AUC–ROC = 0.98; accuracy = 0.89). Model interpretability analyses revealed that topological distance to additional clades and lineage diversity are the most informative predictors, reflecting key signals used in expert-driven phylogenetic interpretation.

The RF model was further validated using 1,000 simulated phylogenies and 1,438 real iHGT candidates, achieving low misclassification rates (7.8% and 10.43%, respectively). Benchmarking against AVP (Alienness vs. Predictor), a comparable tool for iHGT detection, demonstrated improved performance across all evaluation metrics, highlighting the advantages of incorporating global phylogenetic structure into the classification process. This study provides a reproducible and scalable framework for phylogenetic pattern classification that captures complex evolutionary signals while maintaining biological interpretability. Beyond improving iHGT detection, the approach offers a more nuanced representation of evolutionary scenarios by explicitly accounting for inconclusive cases, supporting more robust inference in comparative genomics.

Authors summary

Horizontal gene transfer is the movement of genes between unrelated organisms rather than through normal inheritance from parent to offspring. While this process is known to play a major role in bacterial evolution, its importance in complex organisms such as fungi, plants, and animals remains debated. One reason for this uncertainty is that identifying these events often depends on manually interpreting phylogenetic trees, a process that can be subjective, difficult to reproduce, and impractical for analyzing the rapidly growing amount of genomic data.

In this study, we developed a computational framework that transforms phylogenetic tree interpretation into a machine-learning problem. Instead of simply classifying genes as transferred or non-transferred, our approach recognizes several distinct evolutionary scenarios, including cases where the evidence is ambiguous or inconclusive. To achieve this, we extracted biologically meaningful features from phylogenetic trees describing evolutionary relationships and taxonomic diversity, and used them to train machine-learning models capable of recognizing recurrent phylogenetic patterns.

Our framework successfully classified complex evolutionary scenarios and outperformed an existing automated method for interkingdom horizontal gene transfer detection. More broadly, this work demonstrates how expert-driven evolutionary reasoning can be translated into scalable and reproducible computational approaches. As genomic datasets continue to expand, such methods may help improve evolutionary inference and support more rigorous comparative genomics analyses.

Article activity feed