Genome-wide Pervasiveness and Localized Variation of k -mer-based Genomic Signatures in Eukaryotes

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genomic signatures are taxon-specific patterns in nucleotide sequence composition observed across different regions of a genome, used in the taxonomic classification of organisms and in inferring their evolutionary relationships. However, the nature and extent of the pervasiveness of a genomic signature across the expanse of a Telomere-to-Telomere (T2T) assembly, especially across the functionally diverse sequence elements and highly repetitive regions, remain counterintuitive and underexplored. This study aims to bridge this knowledge gap by systematically investigating the pervasiveness and variation of the genomic signature across the human genome and the genome of each of three other eukaryotic species from different kingdoms. Using the alignment-free k -mer-based Frequency Chaos Game Representation (FCGR) of DNA sequences , this study qualitatively and quantitatively analyzes the variations of the genomic signature along an entire genome. Qualitative analysis is first performed through visual inspection of FCGR patterns across different chromosomes of a species. In parallel, a quantitative analysis evaluates the variation of the genomic signature within a genome by comparing eight distance measures to identify the optimal one for the datasets in this study. By taking an intragenomic perspective with detailed analysis of chromosome landscapes these analyses reveal that, while the genomic signature is preserved in most genomic regions, exceptions exist in localized regions, such as tandem repetitions of short and long repeat units. Upon determining this pervasiveness, we assemble novel pipelines aimed at selecting a short contiguous representative genomic segment that encapsulates the sequence composition patterns characteristic of the entire genome. These representative segments are then used to assess intragenomic variation of the genomic signature, demonstrating that only a small proportion of segments (namely those characterized by regional density of short and long tandem repeats) show high distance values from the representative. No-tably, in the human genome, 80% of the segments have a distance of less than 0.24 (on a [0,1] DSSIM scale) from the representative. Moreover, we demonstrate that using these representative segments improves down-stream tasks, e.g., increasing one-nearest-neighbor (1-NN) taxonomic classification accuracy by 7% compared to selecting a random genomic segment to serve as a proxy of the genome. Lastly, this study presents a special-purpose graphical user interface (GUI) software tool, CGR-Diff , designed to provide both visual and quantitative comparisons of FCGRs of sample or user-provided DNA sequences, thereby facilitating intragenomic variation analysis of genomic signature within and across species.

Article activity feed