Duplicate Pull Requests in Code Management Platforms : A Systematic Literature Review

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Résumé Context. Duplicate PRs (PRs) plague code management platforms like GitHub, squandering valuable reviewer effort, delaying integrations, and frustrating contributors. While previous research has explored automated detection methods, the problem remains prevalent across both open-source and proprietary projects. Objective. This study conducts a systematic literature review to comprehensively examine the phenomenon of duplicate PRs. We aim to synthesize existing knowledge on their root causes, frequency, detection methods, and impacts, while evaluating the effectiveness of current approaches and identifying gaps for future research. Methods . Our research follows a systematic methodology for identifying and analyzing relevant literature. We rigorously selected and reviewed 11 primary studies focused specifically on duplicate PR detection, complemented by an extensive analysis of 39 additional works on general PR management to provide context. The review process incorporated quantitative analysis of reported results and qualitative synthesis of methodologies, features, and limitations. Results . The analysis reveals that approximately 3-12% of PRs in active repositories are duplicates, with the highest occurrence in projects lacking clear contribution guidelines. Current detection approaches are categorized into retrieval-based and classification-based methods, utilizing features ranging from simple textual similarity to complex combinations of textual and non-textual attributes. The evaluation shows that while traditional methods using TF-IDF and cosine similarity achieve 55-83% recall, more recent approaches incorporating deep learning and contextual analysis demonstrate improved accuracy up to 92%. Conclusion . Duplicate PRs represent a substantial inefficiency that demands systematic solutions. Our synthesis suggests that combining improved detection algorithms with better community practices could significantly reduce duplication. The review identifies critical research gaps, including limited cross-platform studies, inadequate handling of semantic duplicates, and insufficient attention to human factors, providing a foundation for future work in this domain.

Article activity feed