Using phylogenetic summary statistics for epidemiological inference

Rafael C. Núñez
Gregory R. Hart
Michael Famulare
Christopher Lorton
Joshua T. Herbeck

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Since the coining of the term phylodynamics, the use of phylogenies to understand infectious disease dynamics has steadily increased. As methods for phylodynamics and genomic epidemiology have proliferated and grown more computationally expensive, the epidemiological information they extract has also evolved to better complement what can be learned through traditional epidemiological data. However, for genomic epidemiology to continue to grow, and for the accumulating number of pathogen genetic sequences to fulfill their potential widespread utility, the extraction of epidemiological information from phylogenies needs to be simpler and more efficient. Summary statistics provide a straightforward way of extracting information from a phylogenetic tree, but the relationship between these statistics and epidemiological quantities needs to be better understood. In this work we address this need via simulation. Using two different benchmark scenarios, we evaluate 74 tree summary statistics and their relationship to epidemiological quantities. In addition to evaluating the epidemiological information that can be inferred from each summary statistic, we also assess the computational cost of each statistic. This helps us optimize the selection of summary statistics for specific applications. Our study offers guidelines on essential considerations for designing or choosing summary statistics. The evaluated set of summary statistics, along with additional helpful functions for phylogenetic analysis, is accessible through an open-source Python library. Our research not only illuminates the main characteristics of many tree summary statistics but also provides valuable computational tools for real-world epidemiological analyses. These contributions aim to enhance our understanding of disease spread dynamics and advance the broader utilization of genomic epidemiology in public health efforts.

Author Summary

Our study focuses on the use of phylogenetic analysis to get valuable epidemiological insights. We conducted a simulation study to evaluate 74 phylogenetic summary statistics and their relationship to epidemiological quantities, shedding light on the potential of each of these statistics to quantify different characteristics of disease spread dynamics. Additionally, we assessed the computational cost of each statistic. This gives us additional information when selecting a statistic for a particular application. Our research is available through an open-source Python library. This work helps us enhance our understanding of phylogenetic tree structures and contributes to the broader application of genomic epidemiology in public health initiatives.

Version published to 10.1101/2024.08.07.607080 on bioRxiv
Aug 7, 2024

Testing the validity and adequacy of linguistic phylogenetic analyses

This article has 1 author:
1. Benedict King
This article has no evaluationsLatest version Dec 17, 2025
Optimal Inference of Asynchronous Boolean Network Models

This article has 1 author:
1. Guy Karlebach
This article has no evaluationsLatest version Dec 19, 2025
Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Author Summary

Article activity feed

Related articles

Testing the validity and adequacy of linguistic phylogenetic analyses

Optimal Inference of Asynchronous Boolean Network Models

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary