Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C)

Jason A Thomas
Randi E Foraker
Noa Zamstein
Jon D Morrow
Philip R O Payne
Adam B Wilcox
the N3C Consortium
Melissa A Haendel
Christopher G Chute
Kenneth R Gersing
Anita Walden
Melissa A Haendel
Tellen D Bennett
Christopher G Chute
David A Eichmann
Justin Guinney
Warren A Kibbe
Hongfang Liu
Philip R O Payne
Emily R Pfaff
Peter N Robinson
Joel H Saltz
Heidi Spratt
Justin Starren
Christine Suver
Adam B Wilcox
Andrew E Williams
Chunlei Wu
Christopher G Chute
Emily R Pfaff
Davera Gabriel
Stephanie S Hong
Kristin Kostka
Harold P Lehmann
Richard A Moffitt
Michele Morris
Matvey B Palchuk
Xiaohan Tanner Zhang
Richard L Zhu
Emily R Pfaff
Benjamin Amor
Mark M Bissell
Marshall Clark
Andrew T Girvin
Stephanie S Hong
Kristin Kostka
Adam M Lee
Robert T Miller
Michele Morris
Matvey B Palchuk
Kellie M Walters
Anita Walden
Yooree Chae
Connor Cook
Alexandra Dest
Racquel R Dietz
Thomas Dillon
Patricia A Francis
Rafael Fuentes
Alexis Graves
Julie A McMurry
Andrew J Neumann
Shawn T O'Neil
Usman Sheikh
Andréa M Volz
Elizabeth Zampino
Christopher P Austin
Kenneth R Gersing
Samuel Bozzette
Mariam Deacy
Nicole Garbarini
Michael G Kurilla
Sam G Michael
Joni L Rutter
Meredith Temple-O'Connor
Benjamin Amor
Mark M Bissell
Katie Rebecca Bradwell
Andrew T Girvin
Amin Manna
Nabeel Qureshi
Mary Morrison Saltz
Christine Suver
Christopher G Chute
Melissa A Haendel
Julie A McMurry
Andréa M Volz
Anita Walden
Carolyn Bramante
Jeremy Richard Harper
Wenndy Hernandez
Farrukh M Koraishy
Federico Mariona
Saidulu Mattapally
Amit Saha
Satyanarayana Vedula
Yujuan Fu
Nisha Mathews
Ofer Mendelevitch

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (ScreenIT)

Abstract

Objective

This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses.

Materials and Methods

Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated.

Results

In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased.

Discussion

Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression.

Conclusion

In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression—an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

Version published to 10.1093/jamia/ocac045
May 13, 2022

SciScore for 10.1101/2021.07.06.21259051: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	We randomly sampled ten zip codes from the list and constructed epidemic curves for these zip codes’ original and synthetic data (Figures 3-4).
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
All code was written in Python (v3.6.10) and -as required by N3C -ran within the secure N3C enclave using the Palantir Foundry Analytic Platform	Python suggested: (IPython, RRID:SCR_001658)
To assess the statistical difference between original and synthetic epidemic curves, we conducted the paired two-sided t-test (scipy v1.5.3, stats.ttest_rel) and two-sided wilcoxon signed-rank …

SciScore for 10.1101/2021.07.06.21259051: (What is this?)

Please note, not all rigor criteria are appropriate for all manuscripts.

Table 1: Rigor

Ethics	not detected.
Sex as a biological variable	not detected.
Randomization	We randomly sampled ten zip codes from the list and constructed epidemic curves for these zip codes’ original and synthetic data (Figures 3-4).
Blinding	not detected.
Power Analysis	not detected.

Table 2: Resources

Software and Algorithms
Sentences	Resources
All code was written in Python (v3.6.10) and -as required by N3C -ran within the secure N3C enclave using the Palantir Foundry Analytic Platform	Python suggested: (IPython, RRID:SCR_001658)
To assess the statistical difference between original and synthetic epidemic curves, we conducted the paired two-sided t-test (scipy v1.5.3, stats.ttest_rel) and two-sided wilcoxon signed-rank test (scipy v1.5.3,	scipy suggested: (SciPy, RRID:SCR_008058)
Visualizations: All visualizations (Plotly v4.14.1,	Plotly suggested: (Plotly, RRID:SCR_013991)
Each visualization was qualitatively tested for colorblind deuteranopia, protanopia, and tritanopia interpretability by one member of the research team (JAT) using Color Oracle.[39]	Color Oracle. suggested: (Color Oracle, RRID:SCR_018400)

Results from OddPub: Thank you for sharing your data.

Results from LimitationRecognizer: We detected the following sentences addressing limitations in the study:

Our findings show the importance of understanding the characteristics and limitations of the original data since we found these biases affected synthetic data utility. Data biases resulting in poorer performance of software tools, clinical guidelines and other applications for groups underrepresented in source data has been previously reported for separate tasks.[12,42–45] Foraker et al. (2021) found that censored zip codes had greater missingness of SDOH values in the original data than uncensored zip codes. In our study, we found the bulk of patients in the N3C data live in a small minority of zip codes (Figure 2), likely those most adjacent to institutions contributing data. These zip codes are therefore more likely to be urban and less likely to have their zip code censored (Table 3). As a consequence, rural zip codes, which are already underrepresented in the original data, become even less available to directly analyze. Additionally, patients with censored zip codes were older, potentially due to older patients traveling from sparsely tested regions to receive care offered at distant academic medical centers which participate in N3C. While our results demonstrate the utility of using synthetic data for a broad range of geospatial analyses, a caveat to synthetic data use is its utility to analyze rural N3C populations since nearly all zip codes with <10 tests were censored and much more likely to be rural within the original data. Suppression of non-zero counts <10 is a ...

Results from TrialIdentifier: No clinical trial numbers were referenced.

Results from Barzooka: We did not find any issues relating to the usage of bar graphs.

Results from JetFighter: We did not find any issues relating to colormaps.

Results from rtransparent:

Thank you for including a conflict of interest statement. Authors are encouraged to include this statement when submitting to a journal.
No funding statement was detected.
No protocol registration statement was detected.

Results from scite Reference Check: We found no unreliable references.

Read the original source

Version published to 10.1101/2021.07.06.21259051 on medRxiv
Jul 8, 2021

An assessment of overdose mortality risk across the urban–rural continuum: Integrating satellite-derived and socioeconomic indicators

This article has 2 authors:
1. Gia Barboza
2. Taylor Harrington
This article has no evaluationsLatest version Jul 30, 2025
A Statewise Analysis of the Socioeconomic and Health Impacts of the COVID-19 Pandemic in India: Lessons for Future Health System Preparedness

This article has 9 authors:
1. Geetha R. Menon
2. U Venkatesh
3. Jeetendra Yadav
4. Krushna Chandra Sahoo
5. Tanu Anand
6. Ashoo Grover
7. Saurabh Sharma
8. Sandhya Singh
9. Firoz Khan
This article has no evaluationsLatest version Jul 2, 2025
Spatiotemporal dynamics of waterborne and foodborne disease outbreaks in Brazil: regional inequality analysis and risk area mapping to inform public health strategies

This article has 13 authors:
1. Matheus Santos Melo
2. Janaína de Sousa Menezes
3. Vitor Vieira Vasconcelos
4. Allan Dantas dos Santos
5. Tarcilla Corrente Borghesan
6. Renata Carla de Oliveira
7. Pedro de Alcântara Brito Júnior
8. Josivânia Arrais de Figueiredo
9. Luís Ricardo Santos de Melo
10. Carla Oliveira de Castro
11. Silene Lima Dourado Ximenes Santos
12. Francisco Edilson Ferreira de Lima Júnior
13. Alda Maria Da-Cruz
This article has no evaluationsLatest version Aug 8, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Objective

Materials and Methods

Results

Discussion

Conclusion

Article activity feed

Related articles

An assessment of overdose mortality risk across the urban–rural continuum: Integrating satellite-derived and socioeconomic indicators

A Statewise Analysis of the Socioeconomic and Health Impacts of the COVID-19 Pandemic in India: Lessons for Future Health System Preparedness

Spatiotemporal dynamics of waterborne and foodborne disease outbreaks in Brazil: regional inequality analysis and risk area mapping to inform public health strategies