EpiCurveBench: Evaluating epidemic curve digitization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate data on disease case counts over time is essential for training reliable disease forecasting models. However, such data is often locked in non-machine-readable formats, most commonly as epidemic curve (epicurve) images—charts that depict case counts over time for a given location. Digitizing these charts would greatly expand the data available for forecasting models, improving their accuracy. Manual digitization, though, is very time-consuming, and existing automated methods struggle with real-world epicurves due to dense data points, overlapping series, and diverse visual styles. To address this, we present Epi-CurveBench, a benchmark of 100 manually curated and annotated epicurve images collected from diverse sources. The dataset spans a wide range of chart styles, from simple to highly complex. We also introduce a new evaluation metric, based on Edit Distance with Real Penalty, that captures the temporal structure of epicurves, handles series of varying lengths, and remains stable in the presence of incomplete data. Using this metric, we evaluate state-of-the-art chart data extraction methods on Epi-CurveBench and find substantial room for improvement, with the best model achieving only 25%. We release the dataset and evaluation pipeline to accelerate progress in epicurve extraction. More broadly, EpiCurveBench provides a challenging testbed to advance research in chart data extraction beyond epidemic forecasting.

Article activity feed