Evaluating Embedding Representations for Multiclass Code Smell Detection: A Comparative Study of CodeBERT and General-Purpose Embeddings
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Code smells are indicators of potential design problems in software systems and are commonly used to guide refactoring activities. Recent advances in representation learning have enabled the use of embedding-based models for analyzing source code, offering an alternative to traditional approaches based on manually engineered metrics. However, the effectiveness of different embedding representations for multiclass code smell detection remains insufficiently explored. This study presents an empirical comparison of embedding models for the automatic detection of three widely studied code smells: Long Method, God Class, and Feature Envy. Using the Crowdsmelling dataset as an empirical basis, source code fragments were extracted from the original projects and transformed into vector representations using two embedding approaches: a general-purpose embedding model and the code-specialized CodeBERT model. The resulting representations were evaluated using several machine learning classifiers under a stratified group-based validation protocol. The results show that CodeBERT consistently outperforms the general-purpose embeddings across multiple evaluation metrics, including balanced accuracy, macro F1-score, and Matthews correlation coefficient. Dimensionality reduction analyses using PCA and t-SNE further indicate that CodeBERT organizes code smell instances in a more structured latent representation space, which facilitates the separation of smell categories. These findings provide empirical evidence that domain-specific pretraining plays an important role in representation learning for software engineering tasks and that embedding choice significantly influences the separability of code smell categories in multiclass detection settings.