ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC

Harshita Vyas
RAVINDRA GIRIRAJ BHARDWAJ

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The effectiveness of two leading Gen AI models, ChatGPT and DeepSeek, is evaluated in addressing complex programming problems based on the ACM International Collegiate Programming Contest (ICPC), a widely accepted standard in competitive coding. The evaluation of both models, as far as readability, error handling, speed of computation, accuracy of code, and educational value, is given in the study. In a two-trial experimental setup, both models are evaluated on 145 different ICPC problems from data structures, algorithms, mathematics, geometry, advanced optimization, etc. The prompts for all these problems were standardized, and the evaluation took place across two iterations, mimicking iterative learning. The results indicate that both DeepSeek and ChatGPT improved their performance over time. Results show that DeepSeek consistently outperformed ChatGPT in code accuracy (88.28% vs. 84.14%), both generated more efficient algorithms for linear time complexity (41 vs. 19), and had lower logical error rates (7.58% vs. 15.86%). DeepSeek and ChatGPT performed almost the same in code quality scores (37.79 vs. 37.85). Approximately 46.90% of the solutions generated by DeepSeek were fully insightful, surpassing ChatGPT’s 42.07%. However, ChatGPT demonstrated significant improvement across trials, particularly drastically reducing syntax errors from 4.83–0.69%. This comparative analysis suggests that DeepSeek may be a more suitable option for high-stakes programming tasks. The findings offer valuable guidance for integrating GenAI tools into advanced programming education.

Version published to 10.21203/rs.3.rs-7077588/v1 on Research Square
Jul 9, 2025

Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

This article has 6 authors:
1. Xiandong Meng
2. Yan Wu
3. Yexin Tian
4. Xin Hu
5. Tianze Kang
6. Junliang Du
This article has no evaluationsLatest version Jul 23, 2025
Evaluating Retrieval-Based Chatbot Systems: A Focus on Speed and Performance

This article has 1 author:
1. Sowmith Reddy Thukkani
This article has no evaluationsLatest version Jun 18, 2025
Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

This article has 7 authors:
1. Md Kamrul Siam
2. Angel Varela
3. Md Jobair Hossain Faruk
4. Jerry Q. Cheng
5. Huanying Gu
6. Abdullah Al Maruf
7. Zeyar Aung
This article has no evaluationsLatest version Jun 12, 2025

Listed in

Abstract

Article activity feed

Related articles

Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

Evaluating Retrieval-Based Chatbot Systems: A Focus on Speed and Performance

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios