Do We Practice What We Preach? An Exploratory Introspective Analysis of ICSE Research Artifacts
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background/Context: Research artifacts in Software Engineering accompany empirical studies and are expected to embody the engineering practices our community advocates.These artifacts serve dual purposes: supporting reproducibility of research findings and demonstrating practical application of software engineering principles.Yet the alignment between advocated practices and actual implementation in research artifacts remains underexamined. Goals: This study provides a quantitative examination of engineering practices in Python research artifacts from the International Conference on Software Engineering (ICSE) (2019--2023), focusing on process metrics, code quality, and testing practices. We contribute empirical evidence to discussions on reproducibility, artifact evaluation, and the role of engineering quality in research artifacts. Methods: We developed the CIRAS (Code Insight and Repository Analysis System) framework and implemented it for Python (PyCIRAS) to conduct repository mining on 90 badged research artifacts. Analysis covered Git process data via the Delta Maintainability Model, static code quality via Pylint, and unit testing metrics, including test-to-production ratios. Results: Findings revealed substantial variability. While artifacts showed reasonable structural maintainability, many exhibited low linting scores (mean 1.72/10), frequent errors (54.66% import errors), sparse documentation (68% undocumented modules), and minimal testing (mean TPS ratio 0.01). No statistically significant correlation emerged between these metrics and repository popularity (GitHub stars). Conclusions: Within our bounded sample of 90 Python artifacts from ICSE 2019--2023, we observe a gap between advocated practices and implementation in research artifacts. The contribution is primarily introspective, demonstrating how repository mining can quantify this gap and provide evidence-based recommendations. We discuss implications for artifact evaluation, training, and the potential role of generative AI in addressing quality gaps.