EpiCurveBench: Evaluating epidemic curve digitization

Thomas Berkane
Maimuna S. Majumder

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate data on disease case counts over time is essential for training reliable disease forecasting models. However, such data is often locked in non-machine-readable formats, most commonly as epidemic curve (epicurve) images—charts that depict case counts of a given disease over time, for a given location. Digitizing these charts would greatly expand the data available for forecasting models, improving their accuracy. Manual digitization, though, is very time-consuming, and existing automated methods struggle with real-world epicurves due to dense datapoints, overlapping series, and varied visual styles. To address this, we present EpiCurveBench, a benchmark of 100 manually curated and annotated epicurve images collected from diverse sources. The dataset spans a wide range of chart styles, from simple to highly complex. We also introduce Epi-Curve Similarity (ECS), a new evaluation metric that captures the temporal structure of epicurves, handles series of varying lengths, and remains stable in the presence of incomplete data. Using this metric, we evaluate state-of-the-art chart data extraction methods on EpiCurveBench and find substantial room for improvement, with the best model achieving an ECS of only 42.9%. We release the dataset and evaluation pipeline to accelerate progress in epicurve extraction. More broadly, the difficulty of EpiCurveBench compared to existing chart extraction benchmarks provides a rigorous testbed for advancing chart data extraction methods beyond disease forecasting.

Institutional Review Board (IRB)

This research does not require IRB approval.

Version published to 10.1101/2025.09.23.25336494 on medRxiv
Sep 25, 2025

Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

This article has 4 authors:
1. Md. Anwarul Islam Bhuiyan
2. Sohana Jahan
3. Md. Babul Hasan
4. Md. Maruf Hossain
This article has no evaluationsLatest version Jan 21, 2026
Enhancing Time-Varying Reproduction Number Estimates with Behavior and Surveillance Data

This article has 5 authors:
1. Byul Nim Kim
2. Suhyeon Kim
3. Haram Seo
4. Gerardo Chowell
5. Sunmi Lee
This article has no evaluationsLatest version Dec 10, 2025
From Points to Predictions: Data Curation for Geospatial Machine Learning

This article has 3 authors:
1. Louis Saumier
2. Joe Melton
3. Scott Winton
This article has no evaluationsLatest version Jan 22, 2026

Discuss this preprint

Listed in

Abstract

Institutional Review Board (IRB)

Article activity feed

Related articles

Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

Enhancing Time-Varying Reproduction Number Estimates with Behavior and Surveillance Data

From Points to Predictions: Data Curation for Geospatial Machine Learning