AI-Driven Code Documentation: Comparative Evaluation of LLMs for Commit Message Generation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Commit messages are essential for understanding software evolution and maintaining traceability of projects; nevertheless, their quality varies across repositories. Recent Large Language Models provide a promising path to automate this task by generating concise context and sensitive commit messages directly from code diffs. This paper provides a comparative study of three paradigms of large language models: zero-shot prompting, retrieval augmented generation, and fine-tuning, using the large scale CommitBench dataset that spans six programming languages. We assess the performance of the models with automatic metrics, namely BLEU, ROUGE-L, METEOR, and Adequacy, and a human assessment of 100 commits. In the latter, experienced developers rated each generated commit message for Adequacy and Fluency on a five-point Likert scale. The results show that fine-tuning and domain adaptation yield models that perform consistently better than general-purpose baselines across all evaluation metrics, thus generating commit messages with higher semantic adequacy and clearer phrasing than zero-shot. The correlation analysis suggests that the Adequacy and BLEU scores are closer to human judgment, while ROUGE-L and METEOR tend to underestimate the quality in cases where the models generate stylistically diverse or paraphrased outputs. Finally, the study outlines a conceptual integration pathway for incorporating such models into software development workflows, emphasizing a human in the loop approach for quality assurance.

Article activity feed