The shape of evolution: persistent homology of genetic-distance data as an observable of reticulate processes in pathogen and plant genomes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Evolutionary biology has long theorized processes—recombination, lineage divergence, drug-resistance sweeps, introgression, refugial persistence—whose signatures in genomic data are incompatible with tree structure. We argue that the shape of genetic-distance data, formalized through simplicial complexes and quantified through persistent homology, is a direct observable of these processes. The Vietoris—Rips filtration of a genetic-distance matrix yields the Betti numbers β 0 (connected components), β 1 (loops), and β 2 (cavities); we read β 1 not as a literal count of recombination events but as a quantity that is monotone in effective recombination above a sampling-dependent geometric baseline, and we organise the resulting shapes into a four-letter alphabet of topological primitives ( K 1 clonal, K 2 divergence, K 3 reticulation, K 4 higher-order reticulation). Coalescent and Wright—Fisher simulations establish the two load-bearing claims: β 1 rises monotonically with the recombination rate over six orders of magnitude, and persistent-homology features separate reticulate from non-reticulate histories with 98—100% recall (the residual confusion falls entirely within the non-reticulate K 1 / K 2 pair, which β 1 does not distinguish). We then apply the pipeline to four empirical systems. (i) On the MalariaGEN Pf7 Plasmodium falciparum dataset ( n = 20,864, 33 countries), per-population β 1 spans two orders of magnitude and diverges significantly from a label—shuffle null (median 20.5, range 8-32); the ordering runs opposite to recombination rate-freely-recombining African populations sit lowest and clonal/swept Southeast Asian and Papuan populations highest—because at the population scale β 1 is dominated by demographic structure rather than recombination rate, a point we reconcile explicitly with the controlled dose-response. (ii) Colombian Cauca SP-resistant samples carry β 1 = 12 against a near-clonal SP-sensitive baseline of β 1 = 5 (and two orders of magnitude more total persistence), the high- β 1 , multi-origin band of K 3 consistent with resistance carried on several genomic backgrounds. (iii) The Cambodia artemisinin sweep (2008-2018) traces a K 3K 1 trajectory, β 1 rising to a mid-sweep peak of 45 and collapsing to 13 at fixation—to our knowledge the first direct observation of a selective-sweep transient in topological coordinates, with the caveat that the per-bin values are medians of three subsamples with wide bars. (iv) On Arabidopsis thaliana 1001 Genomes data, Iberian relict populations (Spain, β 1 /n = 0.64) exceed post-glacial-expansion populations (Sweden 0.54; United Kingdom 0), generalising the framework beyond pathogens. A P. falciparum mitochondrial negative control recovers β 1 = 0 across all subsamples, establishing pipeline specificity. Moving above the 1-skeleton, β 2 is zero at the clonal/expansion limits and positive across the reticulate systems; a controlled two-vs-three-way admixture simulation confirms that β 2 separates regimes that share a β 1 profile, while the further suggestion that the ratio η = β 2 1 separates microevolutionary from macroevolutionary timescales is presented, given the small number of systems and the absence of a β 2 null, as a hypothesis for future testing. Together these results demonstrate that the topology of genetic-distance data is an evolutionary observable, with immediate implications for drug-resistance surveillance in P. falciparum .

Author summary

Biologists usually picture the history of life as a tree, in which lineages split and never rejoin. Many of the most consequential evolutionary events break that picture: malaria parasites recombine in the mosquito gut, drug-resistant strains arise repeatedly on different genetic backgrounds, and plant populations that survived the Ice Age in southern refuges carry tangled ancestry that no tree can represent. We ask a different question of genetic data—not “what tree fits?” but “what shape does the data make?” —and answer it with topological data analysis, which measures shape through three counts (the Betti numbers) of clusters, loops, and higher-order cavities. Loops appear when lineages recombine and rejoin. We show, in simulations and in real Plasmodium falciparum and Arabidopsis thaliana genomes, that the loop count rises with recombination above a baseline set by finite sampling, cleanly separates recombining from clonal histories, and tracks a real artemisinin-resistance sweep in Cambodia as it rises and collapses over a decade. A non-recombining mitochondrial control correctly shows no loops. The shape of genetic data is thus a direct, tree-free readout of evolutionary process, with immediate value for drug-resistance surveillance.

Article activity feed