AI-Driven Code Documentation: Comparative Evaluation of LLMs for Commit Message Generation

Mohamed Mehdi Trigui
Wasfi G. Al-Khatib
Mohammad Amro
Fatma Mallouli

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Commit messages are essential for understanding software evolution and maintaining traceability of projects; nevertheless, their quality varies across repositories. Recent Large Language Models provide a promising path to automate this task by generating concise context and sensitive commit messages directly from code diffs. This paper provides a comparative study of three paradigms of large language models: zero-shot prompting, retrieval augmented generation, and fine-tuning, using the large scale CommitBench dataset that spans six programming languages. We assess the performance of the models with automatic metrics, namely BLEU, ROUGE-L, METEOR, and Adequacy, and a human assessment of 100 commits. In the latter, experienced developers rated each generated commit message for Adequacy and Fluency on a five-point Likert scale. The results show that fine-tuning and domain adaptation yield models that perform consistently better than general-purpose baselines across all evaluation metrics, thus generating commit messages with higher semantic adequacy and clearer phrasing than zero-shot. The correlation analysis suggests that the Adequacy and BLEU scores are closer to human judgment, while ROUGE-L and METEOR tend to underestimate the quality in cases where the models generate stylistically diverse or paraphrased outputs. Finally, the study outlines a conceptual integration pathway for incorporating such models into software development workflows, emphasizing a human in the loop approach for quality assurance.

Version published to 10.20944/preprints202512.2193.v1
Dec 24, 2025

Multi-Sallm: A Multilingual Security Assessment of Generated Code

This article has 5 authors:
1. Mohammed Latif Siddiq
2. Noshin Ulfat
3. Nishat Raihan
4. Joanna C. S. Santos
5. Marcos Zampieri
This article has no evaluationsLatest version Dec 16, 2025
Systematic Prompt Optimization for LLM-Based Backend API Generation: An Empirical Study in NestJS

This article has 1 author:
1. Himanshu Sharma
This article has no evaluationsLatest version Jan 28, 2026
QModel: A Time-Aware GitHub Mining Framework for Empirical Software Quality Studies

This article has 1 author:
1. Dmytro Polishchuk
This article has no evaluationsLatest version Jan 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multi-Sallm: A Multilingual Security Assessment of Generated Code

Systematic Prompt Optimization for LLM-Based Backend API Generation: An Empirical Study in NestJS

QModel: A Time-Aware GitHub Mining Framework for Empirical Software Quality Studies