Temporal validity of software datasets for code metrics: an empirical assessment of sampling strategies

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Context: In empirical research, drawing reliable conclusions about a target population requires working with representative samples. Representativeness refers to the degree to which a sample's properties of interest resemble those of the target population. However, a sample that was representative in the past might not be representative in the present day if the population has significantly evolved during that period. Objective: To evaluate the effectiveness of a dataset extraction tool for collecting current samples of software repositories and keeping their temporal validity over time. Method: We performed a Mining Software Repositories study utilizing three datasets: Tempero et al.’s Qualitas Corpus, a sample from Github and an updated version of the Qualitas Corpus. Based on these datasets, we generated thresholds for three source code metrics (Lines of Code, Cyclomatic Complexity and Weighted Methods per Class) and compared whether these thresholds yielded consistent results. Results: We observed significant differences in all the source code metrics under study when pairing the Qualitas Corpus and samples containing projects with recent development data, with the former registering higher thresholds. Furthermore, the thresholds obtained from the samples collected with our extraction tool recorded consistent thresholds. Conclusions: Using outdated code-based datasets in empirical studies can affect study results, therefore, it is important that researchers not only publish their datasets but also provide strategies to update those datasets over time. Additionally, we presented and validated sampling approaches implemented demonstrating their effectiveness to collect current samples.

Article activity feed