EpiLink: a simulation-based compatibility model for genomic transmission clustering in infectious disease surveillance

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Identifying recently linked infections from pathogen genome sequences is central to infectious disease surveillance, yet many clustering approaches rely on fixed genetic distance thresholds whose relationship to transmission is often unclear. This limitation is especially important in rapidly growing outbreaks and superspreading events, where many cases may be sampled close together in time and share little genetic variation, making true transmission links difficult to distinguish from other closely related infections. Supervised models can improve discrimination, but they require labelled transmission data that are rarely available during outbreak response. We developed EpiLink, a threshold-free method that estimates whether two cases are compatible with recent transmission. Here, compatibility means how well the observed genetic distance and sampling-time difference between two cases fit what would be expected if they were linked by defined recent transmission scenarios. EpiLink simulates plausible recent transmission histories while accounting for uncertainty in infection timing, testing delay, and mutation accumulation, then assigns higher scores to pairs whose observed differences are typical of those simulations. EpiLink was evaluated using both synthetic and empirical SARS-CoV-2 outbreak data from the 2020 Boston epidemic. Two EpiLink variants were compared to a logistic regression model trained on labelled transmission data. One EpiLink variant assumed deterministic mutation accumulation, with genetic differences proportional to elapsed evolutionary time; the other accounted for stochasticity by sampling mutation counts from a Poisson distribution. The logistic regression model performed better at distinguishing linked from unlinked pairs, but EpiLink achieved comparable clustering accuracy. In the Boston data, EpiLink recovered clusters enriched for documented conference and skilled nursing facility outbreaks. EpiLink thus provides an interpretable, simulation-based approach for identifying recent transmission clusters when fixed thresholds are difficult to justify and labelled transmission data are unavailable.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/20785870.

    Major Issues

    1. The "threshold-free" framing needs qualification. EpiLink avoids fixed SNP-distance thresholds, but the final clusters still depend on a sparsification threshold and Leiden resolution parameter. The authors should rephrase this as "not dependent on fixed genetic-distance thresholds" and provide clearer guidance on how users should choose or vary graph parameters in real surveillance analyses.

    2. The default target scenario is narrow. The main analysis uses direct transmission and co-primary infection as the target set. This is interpretable, but many surveillance questions involve short chains with unsampled intermediates. The manuscript should more clearly explain when this default is appropriate, when users should include hidden intermediates, and how conclusions change as the scenario set expands.

    3. Synthetic validation may be optimistic because the evaluation model resembles the inference model. The synthetic data are generated using assumptions close to EpiLink's own natural-history and mutation model. This is useful for controlled benchmarking, but it may overstate performance under real-world model misspecification. The authors should add or discuss simulations with different generation-time distributions, sampling fractions, sequencing error, heterogeneous ascertainment, within-host diversity, and mutation processes not matched to the EpiLink assumptions.

    4. Comparator framing could be stronger. Logistic regression trained on 10% of all pair labels is a useful upper benchmark, but it may not represent a realistic outbreak-response setting where labels are scarce and future cases are unseen. A temporal train/test split or low-label regime would make the comparison more practical. The authors could also include simpler baselines such as fixed SNP plus sampling-time thresholds, since these are common in applied surveillance.

    5. The cluster ground truth needs more justification. The reference cluster definition appears to emphasize direct infectees and close transmission neighborhoods, while positive pairs also include sibling infections. The authors should justify why this target best matches EpiLink's intended use and report sensitivity to alternative definitions, such as same transmission chain within one or two generations or same superspreading event.

    6. Boston validation is suggestive but not definitive. Enrichment for known exposure categories is encouraging, but it does not prove that inferred clusters correspond to transmission clusters. The authors should compare more directly with published outbreak labels, sampling dates, known epidemiological links, and cluster fragmentation/merging patterns. Reporting robustness of Boston clusters across Leiden resolution and sparsification settings would also improve confidence.

    7. Score interpretation needs stronger user-facing guidance. The compatibility score is not a posterior probability of transmission. This is stated, but readers may still treat high scores as probabilities. The authors should include calibration-style diagnostics or examples showing how scores should and should not be interpreted in public health decision-making.

    Minor Issues

    1. The acronyms EDD, EDS, ESD, and ESS are hard to remember. A small table defining each variant near first use would help.

    2. The use of a 5,000 bp synthetic genome should be justified more clearly, especially because SARS-CoV-2 has a much larger genome. Readers may wonder how this choice affects genetic resolution and transferability.

    3. Some figures and tables are dense. Figure captions are informative, but the manuscript would benefit from slightly more visual explanation of what each panel demonstrates.

    4. The authors should report computational scaling more explicitly for larger datasets, since pairwise methods can become expensive as surveillance datasets grow.

    5. A short practical workflow box would be helpful: choose pathogen parameters, choose target scenarios, run compatibility scoring, vary graph settings, interpret clusters cautiously.

    6. There are minor formatting issues in captions and supplementary references, such as awkward "Fig. S1 Fig" phrasing and occasional spacing problems.

    Competing interests

    The author declares that they have no competing interests.

    Use of Artificial Intelligence (AI)

    The author declares that they did not use generative AI to come up with new ideas for their review.