EpiCurveBench: Evaluating epidemic curve digitization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate data on disease case counts over time is essential for training reliable disease forecasting models. However, such data is often locked in non-machine-readable formats, most commonly as epidemic curve (epicurve) images—charts that depict case counts of a given disease over time, for a given location. Digitizing these charts would greatly expand the data available for forecasting models, improving their accuracy. Manual digitization, though, is very time-consuming, and existing automated methods struggle with real-world epicurves due to dense datapoints, overlapping series, and varied visual styles. To address this, we present EpiCurveBench, a benchmark of 100 manually curated and annotated epicurve images collected from diverse sources. The dataset spans a wide range of chart styles, from simple to highly complex. We also introduce Epi-Curve Similarity (ECS), a new evaluation metric that captures the temporal structure of epicurves, handles series of varying lengths, and remains stable in the presence of incomplete data. Using this metric, we evaluate state-of-the-art chart data extraction methods on EpiCurveBench and find substantial room for improvement, with the best model achieving an ECS of only 42.9%. We release the dataset and evaluation pipeline to accelerate progress in epicurve extraction. More broadly, the difficulty of EpiCurveBench compared to existing chart extraction benchmarks provides a rigorous testbed for advancing chart data extraction methods beyond disease forecasting.
Institutional Review Board (IRB)
This research does not require IRB approval.