Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Benchmark results for Large Language Models often show inconsistencies across different studies. This paper investigates the challenges of reproducing these results in automatic bugfixing using LLMs, on the HumanEvalFix benchmark. To determine the cause of the differing results in the literature, we attempted to reproduce a subset of them by evaluating 11 models in the DeepSeekCoder, CodeGemma, and CodeLlama model families, in different sizes and tunings. A total of 32 unique results were reported for these models across studies, of which we successfully reproduced 16. We identified several relevant factors that influence the results. Base models can be confused with their instruction-tuned variants, making their results better than expected. Incorrect prompt templates or generation length can decrease benchmark performance, as well as using 4-bit quantization. Using sampling instead of greedy decoding can increase variance, especially with higher temperature values. We found that precision and 8-bit quantization have less influence on benchmark results.

Article activity feed