Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The TKF92 model of molecular evolution—a linear birth-death process for indels, with finite-state continuous-time Markov chain substitutions—is exchangeable in residue identity at every site: the generative process treats amino acids symmetrically, conditional on a single substitution rate matrix. To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar , and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction — an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs–Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way — exact closed-form sufficient statistics for the linear birth–death–immigration component, the resolvable L’Hôpital limit at λ = μ , and a closed-form M-step for a recursive generalisation of TKF92 — and we report a 1,000-family Pfam fit with K =4 site classes whose Potts atoms carry ∼0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net /.

Article activity feed