Using viral genomics to estimate undetected infections and extent of superspreading events for COVID-19
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (eLife)
- Evaluated articles (ScreenIT)
Abstract
Asymptomatic infections and limited testing capacity have led to under-reporting of SARS-CoV-2 cases. This has hampered the ability to ascertain true infection numbers, evaluate the effectiveness of surveillance strategies, determine transmission dynamics, and estimate reproductive numbers. Leveraging both viral genomic and time series case data offers methods to estimate these parameters.
Using a Bayesian inference framework to fit a branching process model to viral phylogeny and time series case data, we estimated time-varying reproductive numbers and their variance, the total numbers of infected individuals, the probability of case detection over time, and the estimated time to detection of an outbreak for 12 locations in Europe, China, and the United States.
The median percentage of undetected infections ranged from 13% in New York to 92% in Shanghai, China, with the length of local transmission prior to two cases being detected ranging from 11 days (95% CI: 4-21) in California to 37 days (9-100) in Minnesota. The probability of detection was as low as 1% at the start of local epidemics, increasing as the number of reported cases increased exponentially. The precision of estimates increased with the number of full-length viral genomes in a location. The viral phylogeny was informative of the variance in the reproductive number with the 32% most infectious individuals contributing 80% of total transmission events.
This is the first study that incorporates both the viral genomes and time series case data in the estimation of undetected COVID-19 infections. Our findings suggest the presence of undetected infections broadly and that superspreading events are contributing less to observed dynamics than during the SARS epidemic in 2003. This genomics-informed modeling approach could estimate in near real-time critical surveillance metrics to inform ongoing COVID-19 response efforts.
Funding
AWS provided computational credit via the Diagnostic Development Initiative.
Article activity feed
-
###Reviewer #3:
The relative contributions of both asymptomatic infections and super spreading events to the ongoing SARS-COV-2 pandemic are critical, controversial questions. As far as I know this may be the first paper to utilize the approach combining phylogenetic inferences from genomic data with time series case data to estimate these parameters from available data applied to the ongoing SARS-COV-2 pandemic. However, with so many papers coming out so quickly it's possible I missed this.
Here, the authors combine viral phylogenetics with time series case data to estimate parameters (including temporally structured estimates of the reproductive number) about the SARS-COV-2 pandemic in 12 locations globally. They find that the number of undetected infections ranges substantially by location from 13% to 92% and the precision of their …
###Reviewer #3:
The relative contributions of both asymptomatic infections and super spreading events to the ongoing SARS-COV-2 pandemic are critical, controversial questions. As far as I know this may be the first paper to utilize the approach combining phylogenetic inferences from genomic data with time series case data to estimate these parameters from available data applied to the ongoing SARS-COV-2 pandemic. However, with so many papers coming out so quickly it's possible I missed this.
Here, the authors combine viral phylogenetics with time series case data to estimate parameters (including temporally structured estimates of the reproductive number) about the SARS-COV-2 pandemic in 12 locations globally. They find that the number of undetected infections ranges substantially by location from 13% to 92% and the precision of their estimates improves substantially with the number of viral genomes included from each location and this is visualized in Figure 2.
However, in its current form it suffers from some shortcomings..
SARS-COV-2 evolves slowly relative to other viruses and this can lead to high levels of phylogenetic uncertainty in recovered trees and this can have a strong influence on parameter estimates. According to the methods and the supplemental material the authors inferred a single phylogenetic tree for each location. The authors should be encouraged to infer a distribution of trees for each location and condition their analyses across this additional uncertainty. If this has already been done then the manuscript needs to be augmented to make this clear.
Abstract:
This section requires a thorough edit to improve clarity, in its current form it is rather discombobulated and needs to better link aims to results to conclusions.
Introduction:
The first 2 paragraphs of the introduction should be switched. The introduction should start with the big questions - in this case why it is important in the big picture of epidemiology to estimate parameters like the total number of infections - and then introduce the study system in play to address the big questions in this case SARS-COV-2.
The third paragraph addresses other ways to directly estimate the number of infected through serological surveys. Missing from this paragraph is acknowledging the assumption that markers of immunity lasts long enough for such surveys to be effective in detecting past infected individuals.
The final paragraph of the introduction outlines the aims and is rather lacking in scientific detail namely what are the hypotheses? What are the alternatives? What are the predictions and tests of hypotheses in play? What specific hypotheses are the authors testing by applying their method? This requires clarification.
Methods:
Generally, the methods lack sufficient detail to replicate what the authors have done.
In the Viral genomes section of the methods it is stated that several locations were excluded due to "multiple circulating lineages" however nearly all of the locations included (e.g. Guangdong, Hubei, Shanghai, UK) also have multiple circulating lineages. What was done here needs to be clarified greatly.
Phylogenetic inference as performed in IQ-TREE is fine however as previously mentioned the authors need to minimally infer a distribution of trees for each region to condition their subsequent analyses across.
In the section on sub-sampling the sequences to the dominant lineages, how was lineage assignment done? Using Pangolin? Or another classification system? More detail is needed.
A bit more detail on how the authors determined convergence was achieved would be valuable. For example, how was visual confirmation of convergence done? Via visual inspection of parameter traces? A generalist reader may need more detail than has been provided.
Results:
More detail is needed in the figure legend for Figure 1. For example unless I misunderstand this it is mentioned that the red lines are HPD intervals on those days but it is actually a shaded area with a measure of central tendency as a red line.
Discussion:
Overall, the discussion puts the results in appropriate context. It seems though that caveats associated with these analyses were not appropriately acknowledged. A bit more thought should be put into appropriate acknowledgements of things which may affect the authors estimates and interpretations of findings.
On balance I do think that the approach utilized in this manuscript makes a potentially useful contribution to addressing the current pandemic and it is to my knowledge this approach has not yet been applied to SARS-COV-2. I would like to see additional analyses (incorporation of phylogenetic uncertainty) and a thorough edit and revision for clarity.
-
###Reviewer #2:
The authors presented a Bayesian inference framework to fit a branching process model that incorporates both viral genomes and time series of case data to estimate the undetected COVID-19 infections. While the method seems to be valid, the application of the method on the data is subject to some uncertainties especially for locations in Asia, such as Japan, Shanghai and Hong Kong. Please see below for my comments/suggestions:
Major comments:
My biggest concern is that in many of the locations in Asia in Table 1/Figure 1, no sustained local outbreak has been detected. So far the majority of cases in Hong Kong were imported cases (https://www.chp.gov.hk/files/pdf/local_situation_covid19_en.pdf ). By the end of Feb 2020, more than 50% of cases in Guangdong of China were imported cases from Hubei. How would the sequence …
###Reviewer #2:
The authors presented a Bayesian inference framework to fit a branching process model that incorporates both viral genomes and time series of case data to estimate the undetected COVID-19 infections. While the method seems to be valid, the application of the method on the data is subject to some uncertainties especially for locations in Asia, such as Japan, Shanghai and Hong Kong. Please see below for my comments/suggestions:
Major comments:
My biggest concern is that in many of the locations in Asia in Table 1/Figure 1, no sustained local outbreak has been detected. So far the majority of cases in Hong Kong were imported cases (https://www.chp.gov.hk/files/pdf/local_situation_covid19_en.pdf ). By the end of Feb 2020, more than 50% of cases in Guangdong of China were imported cases from Hubei. How would the sequence analysis and model fit be if imported cases are excluded?
As mentioned above, the proportion of imported cases would likely affect the estimation of the Rt and undetected infections. What if the method is applied to imported cases and local separately for some of the locations such as Hong Kong (in which the imported/local case status is clear for every case)?
-
###Reviewer #1:
In this work the authors use previously-developed methods linking viral sequence data and reported case counts to estimate the percentage of undetected infections and the effective reproduction number Rt through time in a number of locations. This is an extremely important topic. It remains the case that despite the urgency, there has not been consistent population-based viral testing and the fraction of COVID-19 cases that are reported remains largely unknown. This is an important topic and if genomics can help it is very valuable.
However, there are some concerns about the methods for this specific application. Validation on simulated data, and exploration of robustness to some of the assumptions and limitations, could help.
Dates of confirmation may differ from dates of symptom onset by many days. This is discussed …
###Reviewer #1:
In this work the authors use previously-developed methods linking viral sequence data and reported case counts to estimate the percentage of undetected infections and the effective reproduction number Rt through time in a number of locations. This is an extremely important topic. It remains the case that despite the urgency, there has not been consistent population-based viral testing and the fraction of COVID-19 cases that are reported remains largely unknown. This is an important topic and if genomics can help it is very valuable.
However, there are some concerns about the methods for this specific application. Validation on simulated data, and exploration of robustness to some of the assumptions and limitations, could help.
Dates of confirmation may differ from dates of symptom onset by many days. This is discussed briefly but the impact of a shift is not explored. The bias may additionally depend on the population size, with more bias towards the beginning when there are few cases and few sequences. It could also impact the sequencing; this is discussed briefly but could be explored to some extent by shifting the dates and re-estimating.
The authors subsampled the sequences to the dominant lineages. More information about how this was done would be helpful. In addition, of course without information to link viral genomes to reported case counts, the same adjustment cannot be made to the reported cases -- could this impact the results? It is not quite clear how multiple lineages, introductions, geographical mixing in the phylogeny are treated. For example, consider an example in which the California sequences have some Minnesota ones embedded in them, scattered in a clade. If the Minnesota sequences in entirety are treated as one phylogeny (without any of the CA tips) then there would be very long branches between these and other Minnesota sequences, and the likelihood would reflect no branching events on these branches. In reality there were plenty of events but they were in CA. Meanwhile those branching events do not occur in the CA tree either, because their descendants have been pruned out of the CA analysis. In any case it is not clear what precisely is meant by not including locations with co-circulating lineages, nor how geographical mixing is treated.
The probability of sequencing, and its variation over time, may affect the model's inferences, because in times of more dense sequencing the intervals in the tree will be shorter (and conversely). The model may not be able to distinguish this from changes in prevalence and reporting fraction. Should there be a rho_t that applies to the sequencing data?
I wonder if the authors are able to model tips that occur in the reported data, handling these dates differently. It seems that the only link is through the conditional independence of the yi and zi information (condition on the xi information). I also wonder about the impact of phylogenetic uncertainty.
There seems to be a possible identifiability issue with rho_t and x_t, because surely a higher x and lower rho could give the same likelihood, particularly since we can't sequence cases that we can't detect.
How do the estimates of the reporting fraction compare to those obtained for example with the model by Russell et al ( https://cmmid.github.io/topics/covid19/global_cfr_estimates.html ) or with other estimates of under-reporting? (Some of these are given in the results but CIs are wide).
I would have liked to see more information for how this was done: "we computed the smallest number of individuals that could contribute to 80% of infections during each week (Figure 4)". Similarly, detailed methods are not given for the 'time to detect an outbreak' results.
It would be interesting to see the comparison between the estimated reporting fractions and the testing data available at (for example) https://covidtracking.com which allows downloads of data on testing through time by state. It is mentioned in the discussion; information about testing is available for many places (US states and otherwise) .
I am also concerned about the large population assumption that is inherent in the mathematics behind the core equation for lambda_t (which the authors should either derive or give the citation for). This equation requires that the mean of the number of offspring in the data is equal to the mean of the offspring distribution, which only happens in the limit when the present and past populations are both large. The same assumption is required for the variance. Particularly in the early stages the large population assumption is unlikely to be met.
-
##Preprint Review
This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 3 of the manuscript.
###Summary:
This paper uses a combination of sequence and case data to estimate the ascertainment rate of COVID19 in different settings. The methods are known but this is the first application to SARS-CoV-2 data, and the topic is of very high importance. The reviewers had some substantial concerns about the methodology and the clarity of description.
-
SciScore for 10.1101/2020.05.05.20092098: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
No key resources detected.
Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:One of the limitations of this study was that we did not take into account time-varying delays in testing during an individual’s course of infection, so the time series data could be offset from the genetic data by a number of days. This is a reasonable assumption in locations with consistent testing capacity, however these delays likely changed across the course of the epidemic with changing testing regimens and burdens on the …
SciScore for 10.1101/2020.05.05.20092098: (What is this?)
Please note, not all rigor criteria are appropriate for all manuscripts.
Table 1: Rigor
NIH rigor criteria are not applicable to paper type.Table 2: Resources
No key resources detected.
Results from OddPub: Thank you for sharing your code and data.
Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:One of the limitations of this study was that we did not take into account time-varying delays in testing during an individual’s course of infection, so the time series data could be offset from the genetic data by a number of days. This is a reasonable assumption in locations with consistent testing capacity, however these delays likely changed across the course of the epidemic with changing testing regimens and burdens on the healthcare and public health systems. More data on testing capacity over time in different locations could support refined parameterization of the delay distribution over the course of the pandemic. Using genomic data to infer infected numbers generally is not as sensitive to sampling schema as serological surveys, though we do see signals in the data that opportunistic genomic sampling may be over-representing subpopulations and biasing estimates for a subset of locations. This could explain the seemingly high detection rates in California, New York, and the United Kingdom, where multiple smaller outbreaks are occurring within the country or state, but genomic data were only generated from a subset of those locations, biasing results towards underestimation of the total number of cases at the state or country level. Conversely, the estimated infection numbers are likely biased upwards near the root of the phylogeny due to multiple introductions into each location27. This would unlikely impact the overall detection rate as those early infections only a...
Results from TrialIdentifier: No clinical trial numbers were referenced.
Results from Barzooka: We did not find any issues relating to the usage of bar graphs.
Results from JetFighter: We did not find any issues relating to colormaps.
Results from rtransparent:- Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
- Thank you for including a protocol registration statement.
-
-
-