Child Sexual Abuse Datasets: A Systematic Review

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of Child Sexual Abuse Imagery (CSAI) online demands automated tools to support timely and effective investigations. Machine learning models are essential for triage tasks such as CSAI classification. However, strict legal restrictions on data access force researchers to rely on datasets curated by law enforcement or proxy datasets from related domains. In this systematic review, we examine datasets — containing both real CSAI and CSAI-like content (e.g., synthetic or approximate) — used for training, evaluation, or statistical analysis in CSAI-related machine learning research. We distinguish between main datasets, used directly for tasks such as CSAI classification, and proxy datasets, used for related tasks like age estimation. Our analysis reveals a prevailing model-centric paradigm that prioritizes algorithmic performance while neglecting critical dataset properties, such as diversity, documentation, and fairness. This tendency risks introducing harmful biases and unintended effects when models are deployed in real-world contexts. To address these concerns, we evaluate the strengths and limitations of existing datasets, highlight key CSAI-specific data attributes, and advocate for a shift toward data-centric practices. We emphasize the urgent need for transparent dataset creation and standardized documentation to improve AI systems' ethical integrity and reliability in this high-stakes domain.

Article activity feed