Impact of Data Quality on CNN-Based Sewer Defect Detection
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Sewer pipelines are essential urban infrastructure that play a key role in sanitation and disaster prevention. Regular condition assessments are necessary to detect defects early and determine optimal maintenance timing. However, traditional visual inspection using closed-circuit television (CCTV) footage is time-consuming, labor-intensive, and dependent on subjective human judgment. To address these limitations, this study develops a convolutional neural network (CNN)-based sewer defect classification model and analyzes how data quality—such as mislabeled or redundant images—affects model accuracy. A large-scale public dataset of approximately 470,000 sewer images was used for training. The model was designed to classify non-defect and three major defect categories. Based on the ResNet50 architecture, the model incorporated dropout and L2 regularization to prevent overfitting. Experimental results showed the highest accuracy of 92.75% at a dropout rate of 0.2 and a regularization coefficient of 0.01. Further analysis revealed that mislabeled, redundant, or obscured images within the dataset negatively impacted model performance. Additional experiments quantified the impact of data quality on accuracy, emphasizing the importance of proper dataset curation. This study provides practical insights into optimizing data-driven approaches for automated sewer defect detection and high-performance model development.