Do We Practice What We Preach? An Exploratory Introspective Analysis of ICSE Research Artifacts

Samuel Thand
Tobias Hansson
Felix Dobslaw
Sergio Rico

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background/Context: Research artifacts in Software Engineering accompany empirical studies and are expected to embody the engineering practices our community advocates.These artifacts serve dual purposes: supporting reproducibility of research findings and demonstrating practical application of software engineering principles.Yet the alignment between advocated practices and actual implementation in research artifacts remains underexamined. Goals: This study provides a quantitative examination of engineering practices in Python research artifacts from the International Conference on Software Engineering (ICSE) (2019--2023), focusing on process metrics, code quality, and testing practices. We contribute empirical evidence to discussions on reproducibility, artifact evaluation, and the role of engineering quality in research artifacts. Methods: We developed the CIRAS (Code Insight and Repository Analysis System) framework and implemented it for Python (PyCIRAS) to conduct repository mining on 90 badged research artifacts. Analysis covered Git process data via the Delta Maintainability Model, static code quality via Pylint, and unit testing metrics, including test-to-production ratios. Results: Findings revealed substantial variability. While artifacts showed reasonable structural maintainability, many exhibited low linting scores (mean 1.72/10), frequent errors (54.66% import errors), sparse documentation (68% undocumented modules), and minimal testing (mean TPS ratio 0.01). No statistically significant correlation emerged between these metrics and repository popularity (GitHub stars). Conclusions: Within our bounded sample of 90 Python artifacts from ICSE 2019--2023, we observe a gap between advocated practices and implementation in research artifacts. The contribution is primarily introspective, demonstrating how repository mining can quantify this gap and provide evidence-based recommendations. We discuss implications for artifact evaluation, training, and the potential role of generative AI in addressing quality gaps.

Version published to 10.21203/rs.3.rs-8741708/v1 on Research Square
Feb 19, 2026

Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review

This article has 3 authors:
1. Rania Ben Chekaya
2. Kamel Garrouch
3. Mohamed Nazih Omri
This article has no evaluationsLatest version Feb 23, 2026
An Empirical Evaluation of LLM-Assisted Sketch-Based Requirements Elicitation and Prototyping

This article has 6 authors:
1. Hamdan Alabsi
2. Sriram Srinivasan
3. Rand Obeidat
4. Nega Lakew
5. Mangle Andrew
6. Azene Zenebe
This article has no evaluationsLatest version Jan 30, 2026
Why Risk it, When You Can {rix} it: A Tutorial for Computational Reproducibility Focused on Simulation Studies

This article has 3 authors:
1. Felipe Fontana Vieira
2. Jason Geller
3. Bruno Rodrigues
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review

An Empirical Evaluation of LLM-Assisted Sketch-Based Requirements Elicitation and Prototyping

Why Risk it, When You Can {rix} it: A Tutorial for Computational Reproducibility Focused on Simulation Studies