QModel: A Time-Aware GitHub Mining Framework for Empirical Software Quality Studies

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Empirical studies in software engineering frequently rely on ad hoc scripts to mineGitHub data, which makes metrics hard to compare and results difficult to reproduce. Thispaper presents QModel, an open-source framework that automatically collects and linksrepository information about commits (as a DAG-directed acyclic graph), pull requests,issues, timelines, file changes, and user reactions into a consistent relational schema designedfor quality analysis. Its companion module, QModel Compilation, turns SQL queries over thisschema into executable analyses by generating feature-target datasets and running statisticalor machine-learning strategies (correlation, regression, PCA, random forest, and others).Together, the tools provide an end-to-end, containerized pipeline that allows researchers andpractitioners to define quality hypotheses in SQL, recreate analyses across projects, andexplore how process and structural characteristics (e.g., branching depth, merge activity,developer responsiveness) relate to outcomes such as review time and defect density. Weillustrate the framework on long-lived GitHub projects, combining time-aware graph metricswith SZZ-style defect linking, and show how metrics of bug-introducing commits can serveas lightweight proxies for process bottlenecks and delayed defect handling in distributeddevelopment. All source code, container images, and replication notebooks are publiclyavailable, supporting the community goal of transparent, reusable, and extensible researchon software quality.

Article activity feed