EpiLink: a simulation-based compatibility model for genomic transmission clustering in infectious disease surveillance

Dominic Arthur
Christopher J. Banks
Rowland R. Kao

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (PREreview)

Abstract

Identifying recently linked infections from pathogen genome sequences is central to infectious disease surveillance, yet many clustering approaches rely on fixed genetic distance thresholds whose relationship to transmission is often unclear. This limitation is especially important in rapidly growing outbreaks and superspreading events, where many cases may be sampled close together in time and share little genetic variation, making true transmission links difficult to distinguish from other closely related infections. Supervised models can improve discrimination, but they require labelled transmission data that are rarely available during outbreak response.

We developed EpiLink, a threshold-free method that estimates whether two cases are compatible with recent transmission. Here, compatibility means how well the observed genetic distance and sampling-time difference between two cases fit what would be expected if they were linked by defined recent transmission scenarios. EpiLink simulates plausible recent transmission histories while accounting for uncertainty in infection timing, testing delay, and mutation accumulation, then assigns higher scores to pairs whose observed differences are typical of those simulations.

EpiLink was evaluated using both synthetic and empirical SARS-CoV-2 outbreak data from the 2020 Boston epidemic. Two EpiLink variants were compared to a logistic regression model trained on labelled transmission data. One EpiLink variant assumed deterministic mutation accumulation, with genetic differences proportional to elapsed evolutionary time; the other accounted for stochasticity by sampling mutation counts from a Poisson distribution. The logistic regression model performed better at distinguishing linked from unlinked pairs, but EpiLink achieved comparable clustering accuracy. In the Boston data, EpiLink recovered clusters enriched for documented conference and skilled nursing facility outbreaks. EpiLink thus provides an interpretable, simulation-based approach for identifying recent transmission clusters when fixed thresholds are difficult to justify and labelled transmission data are unavailable.

Author summary

Grouping infectious disease cases into transmission clusters is a routine part of outbreak surveillance, but many methods rely on fixed genetic distance cut-offs that can be hard to interpret, especially when transmission is rapid and pathogen diversity is low. We developed EpiLink, which instead asks how consistent the observed genetic and sampling-time differences between two cases are with recent transmission. EpiLink simulates plausible transmission histories and scores each pair according to how typical its observed differences are within the simulated distributions. In simulated SARS-CoV-2 outbreaks, EpiLink nearly matched the clustering accuracy of a supervised model trained on labelled transmission pairs, without requiring labelled data. We found a practical trade-off: deterministic configurations performed best when model assumptions were well met, while configurations incorporating uncertainty were more robust when assumptions were misspecified. Applied to SARS-CoV-2 data from the 2020 Boston epidemic, EpiLink recovered clusters enriched for known outbreaks at a conference and skilled nursing facility. EpiLink offers a practical and interpretable approach for transmission clustering when labelled data are unavailable.

PREreview
Jun 21, 2026
This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/20785870.

Major Issues
1. The "threshold-free" framing needs qualification. EpiLink avoids fixed SNP-distance thresholds, but the final clusters still depend on a sparsification threshold and Leiden resolution parameter. The authors should rephrase this as "not dependent on fixed genetic-distance thresholds" and provide clearer guidance on how users should choose or vary graph parameters in real surveillance analyses.
2. The default target scenario is narrow. The main analysis uses direct transmission and co-primary infection as the target set. This is interpretable, but many surveillance questions involve short chains with unsampled intermediates. The manuscript should more clearly explain when this …
This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/20785870.

Major Issues

The "threshold-free" framing needs qualification. EpiLink avoids fixed SNP-distance thresholds, but the final clusters still depend on a sparsification threshold and Leiden resolution parameter. The authors should rephrase this as "not dependent on fixed genetic-distance thresholds" and provide clearer guidance on how users should choose or vary graph parameters in real surveillance analyses.
The default target scenario is narrow. The main analysis uses direct transmission and co-primary infection as the target set. This is interpretable, but many surveillance questions involve short chains with unsampled intermediates. The manuscript should more clearly explain when this default is appropriate, when users should include hidden intermediates, and how conclusions change as the scenario set expands.
Synthetic validation may be optimistic because the evaluation model resembles the inference model. The synthetic data are generated using assumptions close to EpiLink's own natural-history and mutation model. This is useful for controlled benchmarking, but it may overstate performance under real-world model misspecification. The authors should add or discuss simulations with different generation-time distributions, sampling fractions, sequencing error, heterogeneous ascertainment, within-host diversity, and mutation processes not matched to the EpiLink assumptions.
Comparator framing could be stronger. Logistic regression trained on 10% of all pair labels is a useful upper benchmark, but it may not represent a realistic outbreak-response setting where labels are scarce and future cases are unseen. A temporal train/test split or low-label regime would make the comparison more practical. The authors could also include simpler baselines such as fixed SNP plus sampling-time thresholds, since these are common in applied surveillance.
The cluster ground truth needs more justification. The reference cluster definition appears to emphasize direct infectees and close transmission neighborhoods, while positive pairs also include sibling infections. The authors should justify why this target best matches EpiLink's intended use and report sensitivity to alternative definitions, such as same transmission chain within one or two generations or same superspreading event.
Boston validation is suggestive but not definitive. Enrichment for known exposure categories is encouraging, but it does not prove that inferred clusters correspond to transmission clusters. The authors should compare more directly with published outbreak labels, sampling dates, known epidemiological links, and cluster fragmentation/merging patterns. Reporting robustness of Boston clusters across Leiden resolution and sparsification settings would also improve confidence.
Score interpretation needs stronger user-facing guidance. The compatibility score is not a posterior probability of transmission. This is stated, but readers may still treat high scores as probabilities. The authors should include calibration-style diagnostics or examples showing how scores should and should not be interpreted in public health decision-making.

Minor Issues

The acronyms EDD, EDS, ESD, and ESS are hard to remember. A small table defining each variant near first use would help.
The use of a 5,000 bp synthetic genome should be justified more clearly, especially because SARS-CoV-2 has a much larger genome. Readers may wonder how this choice affects genetic resolution and transferability.
Some figures and tables are dense. Figure captions are informative, but the manuscript would benefit from slightly more visual explanation of what each panel demonstrates.
The authors should report computational scaling more explicitly for larger datasets, since pairwise methods can become expensive as surveillance datasets grow.
A short practical workflow box would be helpful: choose pathogen parameters, choose target scenarios, run compatibility scoring, vary graph settings, interpret clusters cautiously.
There are minor formatting issues in captions and supplementary references, such as awkward "Fig. S1 Fig" phrasing and occasional spacing problems.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.
Read the original source
Version published to 10.64898/2026.06.16.26355814 on medRxiv
Jun 20, 2026

A modular Bayesian framework for inferring transmission networks from polyclonal infections, with application to Plasmodium falciparum

This article has 4 authors:
1. Maxwell Murphy
2. Rasmus Nielsen
3. T. Alex Perkins
4. Bryan Greenhouse
This article has no evaluationsLatest version May 15, 2026
MxSure: a mixture model for inferring within-host substitution rates and transmission SNP thresholds

This article has 9 authors:
1. Zunair Khurram
2. Chrispin Chaguza
3. Brenda A. Kwambana-Adams
4. Yan Shao
5. Trevor Lawley
6. Michelle Yong
7. Mark Davies
8. Alexander E. Zarebski
9. Gerry Tonkin-Hill
This article has no evaluationsLatest version Jun 29, 2026
Disentangling infectiousness and susceptibility by age group using transmission pair data: a study of SARS-CoV-2 household transmission

This article has 3 authors:
1. Ka Yin Leung
2. Fuminari Miura
3. Jantien A. Backer
This article has no evaluationsLatest version Jun 5, 2026

EpiLink: a simulation-based compatibility model for genomic transmission clustering in infectious disease surveillance

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Author summary

Article activity feed

Competing interests

Use of Artificial Intelligence (AI)

A modular Bayesian framework for inferring transmission networks from polyclonal infections, with application to Plasmodium falciparum

MxSure: a mixture model for inferring within-host substitution rates and transmission SNP thresholds

Disentangling infectiousness and susceptibility by age group using transmission pair data: a study of SARS-CoV-2 household transmission

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Author summary

Article activity feed

Competing interests

Use of Artificial Intelligence (AI)

Related articles

A modular Bayesian framework for inferring transmission networks from polyclonal infections, with application to Plasmodium falciparum

MxSure: a mixture model for inferring within-host substitution rates and transmission SNP thresholds

Disentangling infectiousness and susceptibility by age group using transmission pair data: a study of SARS-CoV-2 household transmission