Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark

Balázs Szalontai
Balázs Márton
Balázs Pintér
Tibor Gregorics

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Benchmark results for Large Language Models often show inconsistencies across different studies. This paper investigates the challenges of reproducing these results in automatic bugfixing using LLMs, on the HumanEvalFix benchmark. To determine the cause of the differing results in the literature, we attempted to reproduce a subset of them by evaluating 11 models in the DeepSeekCoder, CodeGemma, and CodeLlama model families, in different sizes and tunings. A total of 32 unique results were reported for these models across studies, of which we successfully reproduced 16. We identified several relevant factors that influence the results. Base models can be confused with their instruction-tuned variants, making their results better than expected. Incorrect prompt templates or generation length can decrease benchmark performance, as well as using 4-bit quantization. Using sampling instead of greedy decoding can increase variance, especially with higher temperature values. We found that precision and 8-bit quantization have less influence on benchmark results.

Version published to 10.20944/preprints202505.2321.v1
May 29, 2025

Challenges of Deploying Code Embeddings in Industry

This article has 4 authors:
1. Benedikt Fein
2. Maximilian Jungwirth
3. Gordon Fraser
4. Florian Kandlinger
This article has no evaluationsLatest version May 8, 2025
Automated Reproducibility Testing in R Markdown

This article has 2 authors:
1. Andreas Markus Brandmaier
2. Aaron Peikert
This article has no evaluationsLatest version Apr 10, 2025
Large Language Models for C Test Case Generation: A Comparative Analysis

This article has 4 authors:
1. Alexandru Guzu
2. Georgian Nicolae
3. Horia Cucu
4. Corneliu Burileanu
This article has no evaluationsLatest version May 7, 2025

Listed in

Abstract

Article activity feed

Related articles

Challenges of Deploying Code Embeddings in Industry

Automated Reproducibility Testing in R Markdown

Large Language Models for C Test Case Generation: A Comparative Analysis