Detecting Duplicates in Bug Tracking Systems with Artificial Intelligence: A Combined Retrieval and Classification Approach

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Duplicate bug reports increase the workload of software engineering teams and delay the resolution of critical issues, making automated detection essential. This paper presents a two-stage approach that combines transformer-based semantic retrieval with classical machine-learning classification. First, text features of the defect are vectorised using transformer models such as BERT (Bidirectional Encoder Representations from Transformers, google-bert/bert-base-uncased), MiniLM (Miniature Language Model, sentence-transformers/all-MiniLM-L6-v2) or MPNet (Masked and Permuted Pre-training for Language Understanding, sentence-transformers/all-mpnet-base-v2) to identify semantically similar reports and narrow the candidate search space. Second, the filtered pairs are classified using algorithms such as XGBoost (eXtreme Gradient Boosting), SVM (Support Vector Machines) or logistic regression to determine true duplicates. This hybrid method improves accuracy while substantially lowering computational cost. Experimental results validate the proposed approach, demonstrating robust accuracy and consistent performance in identifying duplicate defect reports.

Article activity feed