Systematic Evaluation of AI-Generated Python Code: A Comparative Study across Progressive Programming Tasks

Yang Qianyi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: AI-based code assistants are on the rise in software development as powerful technologies offering streamlining of code generation and better-quality code. However, their effectiveness is very variable, and understanding their pros and cons becomes very important in regard to using them optimally. Introduction: This study evaluates the capabilities of four prominent AI-based code assistants: GitHub Copilot, Microsoft Copilot, Tabnine, and ChatGPT. This study addresses whether the quality of produced code is functional, efficient, and maintainable and whether it has areas for improvement. Methodology: The AI-generated code was compared in terms of correctness, McCabe complexity (cyclomatic complexity), efficiency, and code size. The correctness percentage was the portion of code without errors, McCabe complexity was used for measuring structural complexity, execution performance represented efficiency, and the size of the code was just by lines of code. All AI tools were benchmarked against a standard set of 100 prompts to ensure like-for-like assessment. Results: GitHub Copilot had the highest correctness at 42%, and ChatGPT generated the most complex code—which was measured with a McCabe complexity score of 2.92. Efficiencywise, ChatGPT also topped with the highest number of codes that meet the "good" criteria. On average, Tabnine produced the shortest code, whereas GitHub Copilot and ChatGPT were among the most verbose. The analysis revealed that although AI-based assistants can generate high-quality code, they usually produce code far different from solutions written by developers themselves, and it is difficult for them to cope with dependencies between classes. Conclusion: AI-based code assistants have substantial potential for improvingcode generation and software development efficiency. However, challenges remain, particularly in handling complex dependencies and producing ready-to-use code. The study suggests that leveraging the strengths of different assistants and focusing on enhancing their ability to manage complex coding scenarios could lead to significant advancements. Ongoing research and development are essential to address these limitations and fully harness the potential of AI-based code assistants in software development.

Version published to 10.21203/rs.3.rs-4955982/v1 on Research Square
Sep 23, 2024

Benchmarking Large Language Models for Data Pipeline Code Generation and Execution

This article has 4 authors:
1. Chiara Rucco
2. Motaz Saad
3. Tobia Martina
4. Antonella Longo
This article has no evaluationsLatest version Jul 2, 2025
Self-Programming AI: Code-Learning Agents for Autonomous Refactoring and Architectural Evolution

This article has 1 author:
1. Kushal Khemani
This article has no evaluationsLatest version May 20, 2025
AnnCoder: A mti-Agent-Based Code Generation and Optimization Model

This article has 5 authors:
1. Zhenhua Zhang
2. Jianfeng Wang
3. Zhengyang Li
4. Yunpeng Wang
5. Jiayun Zheng
This article has no evaluationsLatest version May 29, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Large Language Models for Data Pipeline Code Generation and Execution

Self-Programming AI: Code-Learning Agents for Autonomous Refactoring and Architectural Evolution

AnnCoder: A mti-Agent-Based Code Generation and Optimization Model