Automatic Pruning and Quality Assurance of Object Detection Datasets for Autonomous Driving
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large amounts of high-quality data are required to train artificial intelligence (AI) models; however, curating such data through human intervention remains cumbersome, time-consuming, and error-prone. In particular, erroneous annotations and statistical imbalances in object detection datasets can significantly degrade model performance in real-world autonomous driving scenarios. This study proposes an automated pruning framework and quality assurance strategy for 2D object detection datasets to address these issues. The framework is composed of two stages: (1) noisy label identification and deletion based on labeling scores derived from the inference results of multiple object detection models, and (2) statistical distribution whitening based on class and bounding box size diversity metrics. The proposed method was designed in accordance with the ISO/IEC 25012 data quality standards to ensure data consistency, accuracy, and completeness. Experiments were conducted on widely used autonomous driving datasets, including KITTI, Waymo, nuScenes, and large-scale publicly available datasets from South Korea. An automated data pruning process was employed to eliminate anomalous and redundant samples, resulting in a more reliable and compact dataset for model training. The results demonstrate that the proposed method substantially reduces the amount of training data required, while enhancing the detection performance and minimizing manual inspection efforts.