COVID-19 cluster size and transmission rates in schools from crowdsourced case reports

Paul Tupper
Shraddha Pai
COVID Schools Canada
Caroline Colijn

Curated by eLife

Evaluation Summary:

This paper is the first to characterize overdispersion of COVID-19 spread in schools using crowdsourcing . It has the potential to serve as a useful platform for assessing preventative measures in schools but needs more clarity regarding the sensitivity of the approach to the completeness of input data, as evidence by different model conclusions when sparse data from the US is used as an input as opposed to the more detailed Canadian data.

(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (eLife)
Evaluated articles (ScreenIT)

Abstract

The role of schools in the spread of SARS-CoV-2 is controversial, with some claiming they are an important driver of the pandemic and others arguing that transmission in schools is negligible. School cluster reports that have been collected in various jurisdictions are a source of data about transmission in schools. These reports consist of the name of a school, a date, and the number of students known to be infected. We provide a simple model for the frequency and size of clusters in this data, based on random arrivals of index cases at schools who then infect their classmates with a highly variable rate, fitting the overdispersion evident in the data. We fit our model to reports from four Canadian provinces, providing estimates of mean and dispersion for cluster size, as well as the distribution of the instantaneous transmission parameter β , whilst factoring in imperfect ascertainment. According to our model with parameters estimated from the data, in all four provinces (i) more than 65% of non-index cases occur in the 20% largest clusters, and (ii) reducing instantaneous transmission rate and the number of contacts a student has at any given time are effective in reducing the total number of cases, whereas strict bubbling (keeping contacts consistent over time) does not contribute much to reduce cluster sizes. We predict strict bubbling to be more valuable in scenarios with substantially higher transmission rates.

Version published to 10.7554/elife.76174 on eLife
Oct 21, 2022
eLife
Jun 6, 2022

Author Response

Reviewer #1 (Public Review):

There are several key weaknesses. As the authors describe honestly and thoroughly, the high potential for misclassification of clusters is a real limitation. This is likely to be of higher relevance for the US data. Perhaps this is too subjective on my part, but the Canadian data seems likely to be more complete and less biased, particularly in terms of including singlet events with n=1 case. It also strains belief that the true cluster distribution would differ markedly between Canada and the US based on overlapping demographics, culture, class size, etc....For this reason, I would favor labeling the Canadian data as more representative of reality and interpreting the analysis accordingly. It seems fair to use the US data as a likely surrogate of what occurs when the model is applied to …

Author Response

Reviewer #1 (Public Review):

There are several key weaknesses. As the authors describe honestly and thoroughly, the high potential for misclassification of clusters is a real limitation. This is likely to be of higher relevance for the US data. Perhaps this is too subjective on my part, but the Canadian data seems likely to be more complete and less biased, particularly in terms of including singlet events with n=1 case. It also strains belief that the true cluster distribution would differ markedly between Canada and the US based on overlapping demographics, culture, class size, etc....For this reason, I would favor labeling the Canadian data as more representative of reality and interpreting the analysis accordingly. It seems fair to use the US data as a likely surrogate of what occurs when the model is applied to incomplete datasets with overrepresentation of large clusters. I would therefore consider excluding all data from the US in the main figures and writing the US data into a separate section at the end of the results associated with a supplementary figure. Overall, I think it is fair to assume that the Canadian dataset is more complete and more representative, not just of Canada but also of the US.

Thank you for these points. We agree, and we’ve followed your suggestion and moved the US data and analysis to the Appendix, and made more clear its limitations.

In the discussion, the authors fail to state the most obvious contributor to overdispersion which is aerosolization. Notably, influenza virus is associated with equivalently heterogeneous contact networks, similarly high variation in viral load and an overlapping major route of transmission. Yet, its degree of overdispersion is substantially less than SARS-CoV-2, SARS, or MERS, likely due to less aerosolization. Accordingly, influenza is much less commonly associated with large super-spreader events. Please see Goyal et al (Elife, 2021).

We now discuss aerosolization with the addition of the following sentence, and cite the important Goyal et al paper. “But a key factor in higher dispersion with SARS-CoV-2 in comparison to other pathogens such as influenza is aerosolization (Goyal et al), which allows the index case to infect others in the room even if they are not a close contact.”

Next, I am puzzled by one result which is the Canadian data in Figure 6. This panel suggests that clusters involving more than 12 cases never will happen. This is probably not correct. I think that the issue is that this analysis fails to account for the rarity, but high importance of larger super-spreader events. I am assuming that this figure is showing average values as it is directly extrapolated from parameter values. It would be more useful to show the range of expected times needed to see a cluster of different sizes. This would require stochastic simulation which could be performed by drawing randomly from a distribution with given values for Rc & k. The result would likely be a wide range in time to cluster for given set of Rc & k values. Without accounting for stochasticity, this figure is misleading and should probably be removed.

Following this suggestion, we have removed the time to detect analysis.

Read the original source
eLife
Feb 23, 2022

Evaluation Summary:

This paper is the first to characterize overdispersion of COVID-19 spread in schools using crowdsourcing . It has the potential to serve as a useful platform for assessing preventative measures in schools but needs more clarity regarding the sensitivity of the approach to the completeness of input data, as evidence by different model conclusions when sparse data from the US is used as an input as opposed to the more detailed Canadian data.

(This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

Read the original source
eLife
Feb 23, 2022

Reviewer #1 (Public Review):

This is an innovative and interesting study using crowd sourced data to estimate overdispersion of COVID-19 clusters in US & Canadian classrooms. The writing is exceedingly clear and the linkage of the observed overdispersion of case clusters with specific classroom prevention strategies in Figure 7 is potentially extremely useful. I particularly appreciate that the authors very clearly state the limitations of the cluster size data in terms of ascertainment. The writing in these sections is excellent. Overall, the study thoughtfully addresses an important public healthy question with a novel and perhaps less expensive method. The study achieved its aims with caveats listed below.

There are several key weaknesses. As the authors describe honestly and thoroughly, the high potential for misclassification of …

Reviewer #1 (Public Review):

This is an innovative and interesting study using crowd sourced data to estimate overdispersion of COVID-19 clusters in US & Canadian classrooms. The writing is exceedingly clear and the linkage of the observed overdispersion of case clusters with specific classroom prevention strategies in Figure 7 is potentially extremely useful. I particularly appreciate that the authors very clearly state the limitations of the cluster size data in terms of ascertainment. The writing in these sections is excellent. Overall, the study thoughtfully addresses an important public healthy question with a novel and perhaps less expensive method. The study achieved its aims with caveats listed below.

There are several key weaknesses. As the authors describe honestly and thoroughly, the high potential for misclassification of clusters is a real limitation. This is likely to be of higher relevance for the US data. Perhaps this is too subjective on my part, but the Canadian data seems likely to be more complete and less biased, particularly in terms of including singlet events with n=1 case. It also strains belief that the true cluster distribution would differ markedly between Canada and the US based on overlapping demographics, culture, class size, etc....For this reason, I would favor labeling the Canadian data as more representative of reality and interpreting the analysis accordingly. It seems fair to use the US data as a likely surrogate of what occurs when the model is applied to incomplete datasets with overrepresentation of large clusters. I would therefore consider excluding all data from the US in the main figures and writing the US data into a separate section at the end of the results associated with a supplementary figure. Overall, I think it is fair to assume that the Canadian dataset is more complete and more representative, not just of Canada but also of the US.

To harp on this a bit more, It is particularly worrisome that public health officials might interpret US time to cluster analyses in Figure 6 literally when the Canadian estimates are more likely to approximate the truth. Similarly, differing estimates for Rc between Canada and the US are alarming. As written, a public health official could interpret the Rc in schools in Florida as 7! This is false. Accordingly, all estimates of k in the US are likely to be deeply misclassified.

In the discussion, the authors fail to state the most obvious contributor to overdispersion which is aerosolization. Notably, influenza virus is associated with equivalently heterogeneous contact networks, similarly high variation in viral load and an overlapping major route of transmission. Yet, its degree of overdispersion is substantially less than SARS-CoV-2, SARS, or MERS, likely due to less aerosolization. Accordingly, influenza is much less commonly associated with large super-spreader events. Please see Goyal et al (Elife, 2021).

Next, I am puzzled by one result which is the Canadian data in Figure 6. This panel suggests that clusters involving more than 12 cases never will happen. This is probably not correct. I think that the issue is that this analysis fails to account for the rarity, but high importance of larger super-spreader events. I am assuming that this figure is showing average values as it is directly extrapolated from parameter values. It would be more useful to show the range of expected times needed to see a cluster of different sizes. This would require stochastic simulation which could be performed by drawing randomly from a distribution with given values for Rc & k. The result would likely be a wide range in time to cluster for given set of Rc & k values. Without accounting for stochasticity, this figure is misleading and should probably be removed.

Another final limitation is regarding data presentation in the figures. Multiple suggestions are provided for the authors to strengthen scientific messaging.

Read the original source
eLife
Feb 23, 2022

Reviewer #2 (Public Review):

This manuscript has a number of strengths. First, the paper concerns a topic of considerable importance and interest as a safe return to in person education will be critical for safe reopening of societies. Second, there have been limited analyses of school transmission clusters, and they present the opportunity to better understand school transmission risks. Finally, the authors integrate an analysis that estimates transmission risk with a previous infection risk framework, which provides actionable guidance to school administrators concerning the most effective mitigation measures.

However, there are a number of weaknesses in the analysis. First, the manuscript relies on reported cluster data rather than systematically collected datasets. This causes issues related to reporting biases such as differences …

Reviewer #2 (Public Review):

This manuscript has a number of strengths. First, the paper concerns a topic of considerable importance and interest as a safe return to in person education will be critical for safe reopening of societies. Second, there have been limited analyses of school transmission clusters, and they present the opportunity to better understand school transmission risks. Finally, the authors integrate an analysis that estimates transmission risk with a previous infection risk framework, which provides actionable guidance to school administrators concerning the most effective mitigation measures.

However, there are a number of weaknesses in the analysis. First, the manuscript relies on reported cluster data rather than systematically collected datasets. This causes issues related to reporting biases such as differences in reporting standards across jurisdictions and a propensity to miss smaller case clusters. Second, the modeling methodology relies on assumed ascertainment rates of infection, which appear sensitive to assumptions related to the proportion of cases that are detected (particularly in low ascertainment ranges). Finally, it is not fully clear what constitutes a cluster within a school in the dataset and the methodology. This makes it difficult to interpret the fitted model results, particularly for analyses comparing the cluster size across regions.

Overall, given my concerns related to the underlying data, I find it difficult to interpret the results without significant alterations to the methodology and manuscript overall.

Read the original source
eLife
Feb 23, 2022

Reviewer #3 (Public Review):

The strength of this paper lies in its simplicity. The authors have, as above, fitted simple negative binomial models to available school outbreak case distributions. The sensitivity analysis in which plausible variation in ascertainment fraction does relatively little to cluster size estimates is also important.

More exploration of mechanisms underlying Canada-US differences would be helpful.

Read the original source

SciScore for 10.1101/2021.12.07.21267381: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The US data was gathered from the National Educational Association website (4) (originally started by Alisha Morris, an educator at a Kansas high school) which collected data from news media and from reports submitted by volunteers (39).	National Educational Association suggested: None

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Our data and model have some limitations. The data rely on crowdsourcing, and there is reason to believe that reporting is incomplete. …

SciScore for 10.1101/2021.12.07.21267381: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

NIH rigor criteria are not applicable to paper type.

Table 2: Resources

Software and Algorithms
Sentences	Resources
The US data was gathered from the National Educational Association website (4) (originally started by Alisha Morris, an educator at a Kansas high school) which collected data from news media and from reports submitted by volunteers (39).	National Educational Association suggested: None

Results from OddPub: Thank you for sharing your code and data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Our data and model have some limitations. The data rely on crowdsourcing, and there is reason to believe that reporting is incomplete. Larger clusters may be more likely to be reported. The rate at which exposures were reported in these data varies a great deal from state to state, and so these results cannot be used to estimate the true rate of exposures in schools in the jurisdictions. For example, California and New York are some of the most populous states in the US, both of which were severely hit by COVID during this period, and yet neither of them had enough reports to make it into the top 8 most reported states. However, the consistency of the Canadian estimates, despite the data being from different jurisdictions with different reporting, lend them credibility. In the modelling, we assumed a Poisson random variable for the cluster size, with an underlying gamma-distributed rate variable. This is a flexible model allowing for considerable overdispersion, but it is simple and does not explicitly handle complexities such as the differences between elementary and high schools. Our estimates of the transmission rate were derived (where feasible) from a model with a fixed number of hours that the index case would be infectious in the classroom, and fixed class sizes. Accounting for variation in these would result in more variability in the estimates. We also assumed that reported clusters in the Canadian data occurred in the same class. Errors where distinct clusters are i...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
Thank you for including a funding statement. Authors are encouraged to include this statement when submitting to a journal.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Version published to 10.1101/2021.12.07.21267381 on medRxiv
Dec 8, 2021

What the COVID-19 pandemic teaches us about modelling epidemics: Percolation versus SEIR

This article has 3 authors:
1. Jean-François Mathiot
2. Laurent Gerbaud
3. Vincent Breton
This article has no evaluationsLatest version Nov 15, 2025
Network Analysis of Pairwise Relative Tuberculosis Transmission Probabilities in Lima, Peru

This article has 6 authors:
1. Anne N. Shapiro
2. Meredith B. Brooks
3. Chuan-Chin Huang
4. Megan B. Murray
5. Laura F. White
6. Helen E. Jenkins
This article has no evaluationsLatest version Nov 19, 2025
Reassessing COVID-19’s First Year: Estimating Infections and Tracking Pandemic Trends with Probabilistic Bias Analysis

This article has 1 author:
1. Harsh Vivek Harkare
This article has no evaluationsLatest version Nov 22, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

What the COVID-19 pandemic teaches us about modelling epidemics: Percolation versus SEIR

Network Analysis of Pairwise Relative Tuberculosis Transmission Probabilities in Lima, Peru

Reassessing COVID-19’s First Year: Estimating Infections and Tracking Pandemic Trends with Probabilistic Bias Analysis