Preserving Hidden Hierarchical Structure: Poincaré Distance for Enhanced Genomic Sequence Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The analysis of large volumes of molecular (genomic, proteomic, etc.) sequences has become a significant research field, especially after the recent coronavirus pandemic. Although it has proven beneficial to sequence analysis, machine learning (ML) is not without its difficulties, particularly when the feature space becomes highly dimensional. While most ML models operate with the conventional Euclidean distance, the hidden hierarchical structure present among a set of phylogenetically related sequences is difficult to represent in Euclidean space without losing much information or requiring many dimensions. Since such hierarchical structure can be informative to analysis tasks such as clustering and classification, we propose two measures for generating a distance matrix from a set of sequences based on distance in the Poincaré disk model of hyperbolic geometry, or the Poincaré distance , for short. Such a distance measure can allow to embedding of even a fully resolved phylogenetic tree in just two dimensions with minimal distortion to any hierarchical structure. Our first approach is based purely on the classical Poincaré distance, while the other approach modifies this distance by combining the Euclidean norms and the dot product between the sequence representations. A thorough analysis of both measures demonstrates its superiority in a variety of genomic and proteomic sequence classification tasks in terms of efficiency, accuracy, predictive performance, and the capacity to capture significant sequence correlations. These approaches perform better than existing state-of-the-art methods across the majority of evaluation metrics.