Detecting Duplicates in Bug Tracking Systems with Artificial Intelligence: A Combined Retrieval and Classification Approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The presence of duplicate bugs in defect tracking systems creates an additional burden on software engineering specialists, potentially causing delays in fixing critical bugs. The use of automated methods for detecting duplicates relieves this burden and re-duces the time and cost associated with their processing. Detecting duplicate bug re-ports in large databases is a challenging task that requires a balance between compu-tational efficiency and prediction accuracy. Traditional approaches either rely on re-source-intensive searches or use classification models that, while highly accurate, compromise performance. This paper proposes a new approach to automatic duplicate bug detection based on a two-level analysis of text features in reports. The first stage involves vectorising text data using BERT (Bidirectional Encoder Representations from Transformers), MiniLM (Miniature Language Model) and MPNet (Masked and Per-muted Pre-training for Language Understanding) transformer models, which deter-mine the semantic similarity between defect descriptions. This reduces the number of potential duplicates and the volume of reports that need to be compared. The second stage involves classifying pairs of potential duplicates using machine learning algo-rithms, including XGBoost (eXtreme Gradient Boosting), SVM (Support Vector Ma-chines) and logistic regression. The models are trained on vector representations of text to assess the degree of similarity between errors. The combination of transformer mod-els with classical classification algorithms ensures high accuracy in detecting dupli-cates while significantly reducing query processing time. The results of the experi-ments confirm the effectiveness of the approach, demonstrating its ability to reduce the number of required comparisons, cut the cost of analysing defect reports, and achieve sufficient accuracy in duplicate detection.