ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC

doi:10.21203/rs.3.rs-7077588/v1

ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC

2025 · doi:10.21203/rs.3.rs-7077588/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 120,552 characters · extracted from preprint-html · click to expand

ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC Harshita Vyas, RAVINDRA GIRIRAJ BHARDWAJ This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7077588/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The effectiveness of two leading Gen AI models, ChatGPT and DeepSeek, is evaluated in addressing complex programming problems based on the ACM International Collegiate Programming Contest (ICPC), a widely accepted standard in competitive coding. The evaluation of both models, as far as readability, error handling, speed of computation, accuracy of code, and educational value, is given in the study. In a two-trial experimental setup, both models are evaluated on 145 different ICPC problems from data structures, algorithms, mathematics, geometry, advanced optimization, etc. The prompts for all these problems were standardized, and the evaluation took place across two iterations, mimicking iterative learning. The results indicate that both DeepSeek and ChatGPT improved their performance over time. Results show that DeepSeek consistently outperformed ChatGPT in code accuracy (88.28% vs. 84.14%), both generated more efficient algorithms for linear time complexity (41 vs. 19), and had lower logical error rates (7.58% vs. 15.86%). DeepSeek and ChatGPT performed almost the same in code quality scores (37.79 vs. 37.85). Approximately 46.90% of the solutions generated by DeepSeek were fully insightful, surpassing ChatGPT’s 42.07%. However, ChatGPT demonstrated significant improvement across trials, particularly drastically reducing syntax errors from 4.83–0.69%. This comparative analysis suggests that DeepSeek may be a more suitable option for high-stakes programming tasks. The findings offer valuable guidance for integrating GenAI tools into advanced programming education. Educational Philosophy and Theory Theoretical Computer Science Artificial Intelligence and Machine Learning ACM ICPC ChatGPT Code Generation Code Quality DeepSeek Education Gen AI. Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. Introduction Generative artificial intelligence (Gen AI) creates new content, such as text, images, or music, based on patterns in the data it has been trained on. Gen AI utilizes large language models (LLMs), it involves training on extensive text datasets and employing AI models with a large number of model parameters [1]. With the rapid adoption of AI models for code generation [2], GenAI has emerged as a transformative force in computer education. Tools based on Gen AI, such as ChatGPT, have quickly become commonplace in education [3], particularly in tasks like programming [4]. ChatGPT and DeepSeek represent the latest frontier in AI-powered assistance, combining natural language processing with domain-specific knowledge to help users write, debug, and comprehend code. ChatGPT has become popular in various aspects because of its flexibility and human-like interaction, in particular. Educational settings have started using Gen AI for code generation as an area of great interest [5]. Researchers and industry have taken notice of the recently released DeepSeek [6]. This well-known open-source tool has grown in popularity for AI modeling, giving performance that is competitive but with inference costs [7]. DeepSeek-Coder, a specialized LLM custom-made for coding, gives you open-source and performance improvement for difficult programming jobs [8]. The rise of these tools has sparked an academic debate on their effectiveness and limitations in high-stakes educational contexts, especially in programming engineering dimensions to a varied degree [9]. Recent research studies highlight the growing significance of GenAI tools in programming education and software development. Studies show that models like ChatGPT enhance student engagement, debugging capabilities, and learning outcomes in coding tasks through natural language explanations and real-time feedback [10, 11]. Comparative evaluations with traditional learning methods suggest that GenAI improves comprehension of algorithms and data structures [12]. Educators are increasingly integrating these tools into classrooms for automated grading, problem generation, and concept clarification [13]. Furthermore, GenAI models trained on domain-specific corpora exhibit higher code accuracy and reduced syntactic errors than general-purpose ones [14]. Despite limitations in logical reasoning, GenAI excels at generating readable, modular code with useful documentation. A recent case study on the transformative potential of AI in software engineering using LeetCode and ChatGPT explores how prompt engineering significantly influences model performance in competitive coding contexts [15]. This study compares ChatGPT and DeepSeek on a challenging benchmark: the Association for Computing Machinery (ACM)’s International Collegiate Programming Contest (ICPC). The ICPC is one of the world’s most prestigious algorithmic programming competitions for university students, organized by the ACM. It features a multi-tiered structure comprising local and regional contests, culminating in the World Finals, which invite only the top ~1% of teams globally, where teams of three students are given 5 hours to solve 8–13 challenging problems on a single computer, assessing a range of topics. With only the top 1% of global teams moving on to the world finals, this world-renowned exam assesses the problem-solving skills of competitive programming and software code engineering. Participation in the ICPC world finals is widely recognized as a mark of distinction in the field of computer science. This competition serves as a powerful motivator for students and entry-level professionals to refine their programming skills under timed conditions and realistic problem contexts [16]. Beyond skill development, success in ACM/ICPC is widely regarded as an indicator of high intellectual aptitude. Notably, many technical job interviews incorporate algorithmic challenges that closely resemble those encountered in such competitions, giving experienced contestants a competitive edge in the recruitment process, further enhancing employability. Corporate sponsors, such as IBM, not only fund contest prizes but also actively recruit medalists through internships and direct employment pathways, often prioritizing gold-winning team members for technical roles. Moreover, the career benefits of contest participation extend beyond the private sector. Universities and academic institutions frequently hire accomplished ACM/ICPC participants as programming instructors or team coaches, creating additional employment avenues in education and research. Taken together, these illustrate why students are increasingly motivated to engage in the programming competition [17]. In this study, we employ a dual-trial framework for conducting a comprehensive evaluation of ChatGPT and DeepSeek by testing their performance on past ACM ICPC World Finals problems from the years 2013 to 2024 that were officially available. The models were tested on 145 questions covering a wide spectrum of topics by initial code generation attempts, followed by model fine-tuning using indirect training, with solutions executed and tested in a standardized C++ environment. Beyond quantifying accuracy, we assess technical code quality through execution-based metrics (syntax/logical/runtime errors) and structural analysis, including overall readability scores derived from comment density, identifier quality, indentation score, and function size. This approach bridges the gap between theoretical code generation capabilities and real-world programming competition demands by providing actionable insights into how GenAI tools can complement competitive programming training. Our methodology ensures a robust and fair assessment of each model’s strengths and limitations in competitive programming contexts. Overall, this study analyzes the performance of ChatGPT and DeepSeek across varied difficulty levels and topics, including data structures, algorithms, mathematics, computational geometry, optimization, bit manipulation, and string manipulation. This research highlights each model's strengths and limitations in solving competitive coding tasks, offering insights into their practical utility in advanced programming education and assessments. 2. Background Given the rapid adoption of AI in education, it has become increasingly important to evaluate how effectively GenAI tools perform in rigorous academic settings. Although AI originated within computer science, its adoption in education has accelerated rapidly, with AI now being integrated across various disciplines and educational levels [ 3 , 18 ]. One such high-stakes context that utilizes competitive programming, critical thinking, logical thinking, algorithmic thinking, coding, and more is required. Ranked programming contests such as ICPC consist of regional and national-qualifying tournaments, leading to the World Finals competition. All programs in ICPC entail codified constructions, i.e, they require well-structured, efficient, and logically sound programming code that adheres to strict input-output and time constraints. The use of GenAI tools on ICPC-style problems offers a promising opportunity to enhance programming education by helping students improve their algorithmic thinking, logical reasoning, and coding abilities in a structured and competitive way. Every computer science student during their graduation phase faces the critical challenge of learning and enhancing their programming skills. This turns them to use GenAI assistants like ChatGPT, DeepSeek, and others for guidance when traditional resources fall short. Alarmingly, when students cannot find effective learning resources to develop strong programming foundations, many shift their career interests toward non-programming fields such as data analytics, business management, and other domains. This trend motivated us to rigorously investigate the educational efficacy of GenAI tools, especially in competitive programming contexts. But here's the critical question: Which AI tool will actually guide them toward a correct, efficient solution versus leading them down a rabbit hole of syntactically correct but logically flawed code? There remains a significant knowledge gap, as no comprehensive comparative study has assessed the effectiveness of GenAI tools on standard programming problems such as ACM - ICPC problems. This research addresses that gap by systematically comparing ChatGPT and DeepSeek across 145 real ACM ICPC problems, providing students, educators, and institutions with evidence-based insights into which AI tools truly enhance learning outcomes in competitive programming. 2.1. Coding & AI For computer science students to thrive in the digital age, both academically and in practical applications that call for technical precision, logical thinking, and algorithmic design, programming abilities are crucial. The landscape of education is evolving due to the emergence of AI, specifically LLMs such as DeepSeek and ChatGPT, among others. By interpreting natural language prompts and generating syntactically and semantically sound code, these AI tools give students dynamic feedback and support. ChatGPT offers flexible explanations and problem-solving techniques appropriate for students of different skill levels because it has been trained on a variety of text and code datasets [ 5 ]. In contrast, DeepSeek uses fine-tuning on quantities specific to programming to provide better performance for more specialized technical challenges [ 19 ]. But, the first generations of AI-written code were especially from models such as ChatGPT, which remain to be investigated in detail. A primary worry is the validity of the produced code. Models based on AI, trained on massive datasets from certain sources, could generate code that is flawed or does not follow best practices. This is further compounded by the possibility that these models produce syntactically valid code, but it is incorrect in its meaning, so that the running application will crash or behave in an unpredictable way. Maintainability is also a key challenge when it comes to AI-generated code. AI-generated code is frequently not as clear or as well documented as human developers tend to be, making it hard to understand and to modify [ 2 ]. Platforms like GeeksforGeeks have recognized the growth of AI coding assistants in aiding students in practicing and comprehending programming logic by contest-level problem generation and real-time tutoring interfaces, and also in debugging. Motivated by these advancements, our study intends to assess the usefulness of two top AI tools, ChatGPT and DeepSeek, in advanced programming education by evaluating their real performance on extremely difficult ACM ICPC problems. 2.2. Overview of ACM – ICPC The Association for Computing Machinery (ACM), founded in 1947, is the world’s foremost scientific and educational society in computing, with a mission to advance the computing profession through publications, conferences, special interest groups, and educational resources. Among its flagship global initiatives is the International Collegiate Programming Contest (ICPC), a globally renowned and highly prestigious algorithmic programming competition for university students. The ICPC serves as a platform for fostering innovation, problem-solving skills, and collaboration in real-time, high-pressure environments. It is structured into three main tiers: local contests, where universities select their best teams; regional contests, which feature participants from broader geographic zones; and the culminating world finals, which host only the top 1% of teams worldwide from over 3,000 universities across more than 100 countries. In this contest, teams of three students work collaboratively on a single computer to solve 8–13 algorithmic problems within a 5-hour time frame. These problems vary in difficulty and test a wide range of technical skills including data structures (trees, heaps, disjoint sets), algorithms (graph theory, dynamic programming, greedy approaches, backtracking), mathematics (combinatorics, number theory, geometry), computational geometry, bit manipulation, and advanced topics such as network flows, line sweep algorithms, and segment trees. The contest emphasizes code efficiency, algorithmic thinking, and real-time problem-solving under pressure. Programming languages allowed include C, C++, and Java, with C + + being the predominant choice owing to its fast execution and rich standard library (STL). Each year, over 50,000 students participate across the ICPC, but only a select few reach the elite world finals. Winners of the world finals are awarded not only global recognition and trophies, but also internship offers, job opportunities, and monetary prizes from top tech companies such as IBM, JetBrains, Huawei, and others who sponsor the event. Some editions have awarded cash prizes up to $ 15,000 for the top team, along with medals, certificates, and often direct fast-track recruitment interviews with prestigious organizations. As thousands of students prepare annually for the ACM ICPC, the lack of dedicated, structured resources for contest-specific training remains a significant gap. Given this context, the potential of Gen AI models to support such preparation, by offering real-time feedback, enhancing algorithmic thinking, and improving code precision, demands rigorous empirical evaluation. This underscores the relevance and timeliness of our comparative study, which aims to assess the practical utility of ChatGPT and DeepSeek in meeting the unique challenges of competitive programming education. 3. Methodology 3.1. Dataset and Prompting We gathered and prepared a dataset of 145 programming problems from the official ACM ICPC archives and related online judge platforms in order to thoroughly evaluate ChatGPT and DeepSeek's code generation abilities in a competitive programming setting. In order to ensure diversity in problem difficulty, topic distribution, and algorithmic scope, the selection process covers contests from 2013 to 2024, spanning more than ten years. Based on the historical success rate and time-to-solve information supplied in contest metadata, each problem was initially divided into three standard tiers: easy, moderate, and difficult. The dataset consists of 40, 70, and 35 easy, moderate, and difficult questions. The questions were then further categorized into topical domains as illustrated in Fig. 1 . These included Data Structures, Algorithms, Mathematics, String and Bit Manipulation, Computational Geometry, Advanced Topics, and Optimization Criteria. The model's comprehension of hierarchical relationships, memory management, and application of union-find optimizations is tested in data structures through problems involving trees, heaps, and disjoint sets. Dynamic programming, graph traversal (Depth First Search/Breadth First Search), greedy strategies, and backtracking tasks in the algorithms category exposed subtle model behaviors. In the initial trial, we created a standardized prompt template for submitting problems to both models in order to guarantee evaluation consistency. The official problem statement, expected input format, and constraints were all made clear in the prompt. Each model received the same input free from formatting bias through this methodical approach. A second round of testing was carried out for problems that had incorrect answers in the first trial, either because of logical mistakes, inefficiencies, or failure to compile. Based on the nature of the prior failure, a revised prompt template was created in this phase to incorporate indirect training. We were able to replicate a learning and feedback loop that students frequently encounter during practice due to this iterative interaction. For both evaluation trials, the outputs generated by ChatGPT and DeepSeek were systematically documented in a structured tabular format [ 20 ]. Each entry captured relevant metadata, including problem ID, difficulty level, topical category, trial number, and key performance indicators such as execution success, logical correctness, and code quality metrics. Code correctness was rigorously verified not only through sample input/output cases provided in the original problem statements but also cross-checked against the official solution sketches and reference implementations published by ACM ICPC. 3.2. Performance and Quality Metrics This study employs a comprehensive multidimensional evaluation framework, as shown in Fig. 2 , to assess the performance and quality of AI-generated code solutions through both quantitative and qualitative metrics. The primary quantitative measure is accuracy, which serves as a fundamental benchmark for problem-solving capability and comparative analysis against other AI systems. Computational complexity analysis is conducted using the Big O notation to evaluate the time and space complexity, providing insights into algorithmic efficiency of the code. Error analysis encompasses three critical categories: runtime errors (execution failures due to memory overflow or invalid operations that compromise system stability), syntax errors (compilation-stage violations of language rules that prevent program execution), and logical errors (algorithmic reasoning defects resulting in incorrect outputs despite successful compilation and execution). Code quality assessment integrates four readability sub-metrics to generate a composite score: function size, indentation score (percentage of correctly formatted code blocks), identifier quality (percentage of meaningful variable names), and comment density (percentage of commented code lines). Additionally, theoretical explanation quality is evaluated on a five-point scale considering output correctness, code correctness, compendiousness, validity, and non-obviousness. These integrated qualitative and quantitative metrics ensure that generated solutions not only demonstrate functional correctness but also adhere to established software engineering principles, thereby facilitating code maintenance, collaborative development, and educational applications in professional programming environments. 3.3. Experimental Setup To ensure consistency and fairness in evaluation, we employed a manual process for inputting and testing prompts across both ChatGPT and DeepSeek models. A standardized prompt format was created for the initial trial (before training), which was uniformly applied to all 145 ACM ICPC problems. Prompt 1 that was used before training: "PROVIDE C + + CODE, SOLUTION SKETCH EXPLAINING THE APPROACH AND SAMPLE OUTPUT FOR THE GIVEN PROBLEM." This prompt was followed by the official problem statement as provided in the contest archives. If the output provided by ChatGPT or DeepSeek is incorrect or suboptimal, then we performed indirect training using a second prompt, which is "THE CODE IS GIVING WRONG OUTPUT FOR PROBLEM, REWRITE." This prompt aimed to guide models toward correction without introducing human bias or additional hints, which we called ‘indirect prompt’. All responses were collected, executed, and analyzed manually to preserve the integrity of the evaluation. The research was conducted on an HP EliteBook laptop equipped with an Intel Core i5 processor, offering a base frequency of 1.4 GHz and Turbo Boost capability up to 5.0 GHz. The system was supported by 32 GB of RAM and featured PCIe NVMe SSD storage ranging from 512 GB to 1 TB, ensuring fast data access and high-performance multitasking. The device included a 14-inch display with either WUXGA (1920 × 1200) or Full HD (1920 × 1080) resolution, providing clear visual output. For graphics processing, it utilizes Intel UHD Graphics and AMD Radeon Graphics. For the purpose of experimentally recorded research, all responses generated by ChatGPT and DeepSeek through two iterative trials were meticulously noted. Each of the 145 ACM ICPC problems was randomly scrutinized based on the provided performance and quality metrics to ascertain the suitably generated code and solutions. Compiler responses were noted to check for syntactic correctness and runtime behavior, and the code output was executed in a standardized C + + compilation setup to ascertain uniformity. To determine functional correctness, all outputs were compared against the official sample outputs provided with the problem sets. Iterations tracked any advances or reverses with regard to code correctness and quality by means of the same procedure for both the initial and improved prompt trials. Hence, the double-trial regime combined with quantitative metric analysis gave a thorough yet objective assessment of the two models' code generation abilities under real contest settings. 4. Results In the comparative analysis, it was found that DeepSeek outperformed ChatGPT across multiple dimensions of code generation on 145 ACM ICPC problems. DeepSeek demonstrated superior accuracy than ChatGPT (88.28% vs. 84.14%), lower error rates, particularly in logical consistency, and more optimal algorithmic performance, reflected by a higher proportion of O (1) and O (n) solutions. Interestingly, both ChatGPT and DeepSeek consistently produced more readable and well-documented code, as evidenced by approximately the same code quality scores. Additionally, DeepSeek achieved greater instructional depth, as confirmed by the maximum number of fully insightful Density of Insight (DOI) metric. Although ChatGPT exhibited substantial improvements after iterative refinement, DeepSeek maintained a clear performance advantage before and after training. The following subsections provide a detailed comparative evaluation of ChatGPT and DeepSeek. 4.1. Accuracy Analysis Accuracy is defined as the proportion of correct answers given by the GenAI model out of the total questions input to the model. The percentage of correctly answered questions with successfully executed code compared to the total number of questions posed was used to calculate the accuracy of the two evaluated large language models (Bhardwaj & Bedi, 2025). DeepSeek consistently outperformed ChatGPT in both experimental trials, according to the comparative analysis. ChatGPT's accuracy before training was 77.93%, while DeepSeek's was 84.14%, a 6.21 percentage point difference in favor of DeepSeek as seen in Fig. 3 (a). Both models performed better after model optimization after the training. ChatGPT's accuracy increased to 84.14%, matching DeepSeek's initial trial baseline performance. In the second trial, DeepSeek made even more progress, achieving an accuracy of 88.28%, or 4.14 percentage points. These results show that DeepSeek maintains a 4 to 6 percentage point accuracy advantage, even though both models gain a great deal from optimization. From this observation, we can say that ChatGPT's second trial accuracy and DeepSeek's first trial result were equal, indicating that DeepSeek has a naturally higher ability to generate accurate code and solve computational problems in the tested scenarios. On the other hand, it should be emphasized that ChatGPT demonstrates more efficient learning than DeepSeek, showing a 6.21 percentage point improvement after training compared to DeepSeek's 4.14 percentage point improvement. Accuracy in percentage is calculated using the Eq. ( 1 ), $$\:Accuracy\:\left(\%\right)=\frac{Number\:of\:correct\:answers}{Total\:EquationNumber\:of\:questions}\times\:100$$ 1 4.2. Error Analysis A programming error is a state that violates the specification [ 21 ]. Three main types of errors were included in the error rate analysis: syntax, logical, and runtime errors. As shown in Fig. 3 (b), it is evident that both ChatGPT and DeepSeek produce high logical errors in code, followed by syntax and runtime errors. Syntax errors are mistakes in the structure of a program's code that prevent it from being compiled or executed. It is the most common category of errors in any program. ChatGPT generated syntax errors in 4.83% number of solutions before training, which dramatically decreased to 0.69% after training, signifying a 4.14 percentage point improvement. DeepSeek, on the other hand, improved syntax accuracy only by 2.07 percentage points, by producing syntax errors in 4.83% number of solutions before training, to 2.76% after training. A logic error causes incorrect program execution, in contrast to a syntax error, which prevents execution. Across successive trials, ChatGPT exhibited a marginal reduction in logical error rate, decreasing it from 13.79% number of solutions to 13.10%. By lowering logical errors from 9.66–7.59% in the total number of solutions, DeepSeek showed enhanced logical consistency. A runtime error is an error that occurs during the execution of a program, causing it to terminate unexpectedly. It was found that runtime errors were comparatively stable and minimal in the code provided by both ChatGPT and DeepSeek. While DeepSeek showed runtime errors in 1.38% of the solutions before and after training, ChatGPT demonstrated a slight improvement by lowering the runtime error from 1.38–0.69% in the total number of solutions after training. There is a complex behavior of both ChatGPT and DeepSeek in generating errors in code and improving the code after training. Overall, from Fig. 3 (b), it is observed that while ChatGPT reduced syntax and runtime errors more effectively, DeepSeek reduced logical errors more efficiently. Figure 1 (supplementary) shows the total number of syntax, logical, and runtime errors found in the solution acquired from ChatGPT and DeepSeek before and after training. The following equations are used to calculate the syntax, logical, and runtime errors in percentage, respectively. $$\:Syntax\:error\:\left(\%\right)=\:\frac{Number\:of\:syntax\:errors}{Total\:EquationNumber\:of\:questions}\times\:100$$ 2 $$\:Logical\:error\:\left(\%\right)=\:\frac{Number\:of\:logical\:errors}{Total\:EquationNumber\:of\:questions}\times\:100$$ 3 $$\:Runtime\:error\:\left(\%\right)=\:\frac{Number\:of\:runtime\:errors}{Total\:EquationNumber\:of\:questions}\times\:100$$ 4 4.3. Time and Space Complexity Analysis Time complexity refers to the computational time an algorithm requires to execute as a function relative to the input size, while space complexity measures the amount of memory utilized during execution [ 22 ]. Both time and space complexity are usually expressed using Big O notation. The Big O calculation tool was used to calculate time and space complexity for solutions obtained from both ChatGPT and DeepSeek for 145 questions. Figure 4 shows that when compared to ChatGPT, DeepSeek generated more constant time O (1) solutions (7 vs. 6) and linear time O(n) solutions (40 vs. 31), quadratic time O(n²) solutions (47 vs. 19) and cubic O(n³) solutions (24 vs, 11) before training, suggesting a preference for more efficient algorithms. After training, DeepSeek showed a decrease in higher-complexity outputs like cubic time O(n³) solutions (4 vs. 9); and a rise in constant time solutions (8 vs. 6) and linear time solutions (41 vs. 19). Interestingly, DeepSeek's quadratic solutions complexity dropped from 47 before training to 12 after training. Conversely, ChatGPT performed worse after training, showing a rise in complexity classes 12 to 26 and a decrease in linear solutions from 31 to 19. These findings underline DeepSeek's superior capability in producing scalable and optimal code, especially in scenarios requiring stringent performance under competitive programming constraints. In terms of space complexity, as observed in Fig. 5 , both ChatGPT and DeepSeek frequently generated solutions with higher space complexity despite improvements after prompt refinement. A significant portion of output from both models fell into complex categories, such as quadratic, exponential, and factorial space, which are generally less desirable in competitive programming contexts. In the initial trial, ChatGPT produced 83 complex-space solutions, while DeepSeek generated 80; after training, these values remained high at 78 and 81, respectively. This indicates that both models, while capable of solving the problems, often relied on less memory-efficient approaches. However, after training, DeepSeek generated a higher number of quadratic-space O(n²) and linear-space O(n) solutions, 10 and 32, respectively, while ChatGPT generated 9 and 40, respectively, while significantly reducing occurrences of higher-complexity classes such as cubic, polynomial, exponential, and factorial. These findings highlight the need for further refinement in guiding GenAI models toward space-optimized solution strategies. 4.4. Code Quality and Readability Analysis Four sub-parameters: comment density, identifier quality, indentation score, and function size were used to calculate the overall code quality [ 23 ]. Comment density is defined as the number of lines containing meaningful comments and is calculated by the following Eq. ( 5 ), $$\:Comment\:density\:\left(\%\right)=\:\frac{Number\:of\:comment\:lines}{Totol\:EquationNumber\:of\:lines\:in\:code}\times\:100$$ 5 An identifier is the name assigned to variables, functions, classes, constants, or any other named entity in programming code. Number of identifiers help in determining code's complexity, function size, and maintainability, indicating density of a code. A smaller number of identifiers suggest simple or unclear code, while too many may reduce readability or indicate poor modularization in small code. In this work, a number of identifiers are calculated for each solution obtained from ChatGPT and DeepSeek before and after training. Thereafter, identifier quality is calculated using the Eq. ( 6 ), $$\:Identifier\:quality=\:\frac{Number\:of\:identifier}{Maximum\:EquationNumber\:of\:identifier}\times\:100$$ 6 Indentation is defined as the consistency of indentation throughout the code using spaces or tabs at the beginning of lines of code to visually structure the code into blocks. It helps indicate the logical hierarchy and flow of a program, especially within conditionals, loops, functions, and classes. The indentation score is calculated using the Eq. ( 7 ), $$\:Indentation\:score\:\left(\%\right)=\frac{Number\:of\:proerly\:indented\:blocks}{Total\:EquationNumber\:of\:blocks}\times\:100$$ 7 Function size is the number of functions in a code, favoring modular functions, and it refers to the length and complexity of a function or method in a code. In this work, function size is normalized using the Eq. ( 8 ), $$\:Function\:size=\:\frac{Number\:of\:functions}{10}\times\:100$$ 8 The Overall readability score is a quantitative measure used to evaluate how easy it is to read and understand code. In this work, we calculated the overall readability score using comment density, identifier quality, indentation score, and function size score by using Eq. ( 9 ), $$\:Overall\:readability\:score=\frac{CD+IQ+IS+FS}{4}$$ 9 where, CD, IQ, IS and FS are comment density, identifier quality, indentation score, and function size respectively. We calculated average comment density in both GenAI tools, ChatGPT and DeepSeek, before and after training. ChatGPT shows an average comment density of 3.29 and 3.39 before and after training. On the other hand, DeepSeek shows an average comment density of 3.06 and 3.45 before and after training. Maximum comment density is shown by ChatGPT about 40.48 stating ChatGPT has potential in providing very well documented program code. Next, we calculated identifier quality by considering small/medium code blocks. The maximum number of identifiers is provided by DeepSeek, which is 64, which is considered as base to calculate identifier quality in this work. ChatGPT showed an average identifier quality of 26.01 and 26.78, while DeepSeek provided 26.67 and 29.11 before and after training, respectively. This clearly demonstrates that the codes provided by DeepSeek are readable, maintainable, and understandable as compared to ChatGPT. Both models exhibited exceptional performance in code formatting, achieving perfect 100% indentation scores across all trials, indicating outstanding adherence to structural coding standards. On the other hand, average function size metrics showed significant differences: ChatGPT's outputs were relatively verbose, with function sizes of 20.62 before training and 21.17 after training, whereas DeepSeek produced more concise code, with function sizes of 17.38 and 18.62 before and after training, respectively. Lastly, we calculated the overall readability score. ChatGPT achieved average overall readability scores of 37.48 and 37.85, while DeepSeek scored 36.78 and 37.79 for the same tasks, respectively. Collectively, DeepSeek outperformed in all four sub-parameters average comment density, identifier quality, indentation score and function size. Also, DeepSeek's outputs were consistently more readable and maintainable, according to additional analysis of identifier naming and structural organization. Even though both models got better after training and have approximately the same overall readability score, ChatGPT can be considered the better tool for producing production-grade code because of its higher baseline performance, achieving an overall readability score of 37.85. 4.5. Density of Insight (DOI) Analysis The Density of Insight (DOI) analysis is used to assess the generated outputs' instructional quality and explanatory depth. The DOI is defined as the ratio of the number of insights available in the explanation to the total number of insight criteria, i.e., DOI = $\:\frac{Number\:of\:insights}{Total\:number\:of\:insight\:criteria}$ [ 3 ]. Figure 6 illustrates the DOI scores for both models, ChatGPT and DeepSeek, before and after training, across the range of values from 0 to 1. On a scale of 0 to 1, DOI values were calculated using five criteria: output correctness, code correctness, compendiousness, validity, and non-obviousness. While DOI scores below 1 indicated insight deficiency, a score of 1 indicated fully insightful content. DeepSeek performed better in terms of DOI before and after training. DeepSeek beat ChatGPT's 38.62% before training and 42.07% after training in the highest insight category (DOI = 1), achieving 44.14% before training and 46.90% after training. Both ChatGPT and DeepSeek produced null solutions with no insight (DOI = 0). Also, it should be emphasized that there is complex behavior when compared to the medium level of insight categories. These findings imply that DeepSeek continuously produced solutions that were more perceptive and educationally useful, covering algorithmic justification and solution methodology more thoroughly. The DOI analysis emphasizes how DeepSeek's outputs have additional instructional utility, which increases its efficacy for use cases involving instruction and explanation. 5. Discussions In this work, we identified the strengths and weaknesses of two Gen AI tools, ChatGPT and DeepSeek, in complicated learning and problem-solving scenarios on programming tasks at the ACM ICPC level. DeepSeek usually had greater accuracy and better performance than ChatGPT on most metrics, although ChatGPT also showed significant benefits in flexibility and iterative development. The advantage of domain-specific fine-tuning is reflected in the supremacy of DeepSeek, indicated by its higher accuracy and algorithmic solution ratings. Production of code that is more time- and space-optimized by means of a higher count of O(1) and O(n) solutions and fewer complex O(n 2 ) or O(n 3 ) algorithms shows that it stands well with the competitive programming perspective. Additionally, by maintaining a lower logical error rate before and after training, DeepSeek appeared to have a firmer grasp of algorithmic structure and computational logic. However, in contrast to the two-trial setting, ChatGPT displayed remarkable flexibility. Of interest here is how much the model improved over time, dropping syntax errors from 4.83–0.69% of the number of solutions, a greater decrease than DeepSeek. This suggests that the quick improvement and high sensitivity to fast optimization are ChatGPT's strengths. Along with this, ChatGPT demonstrates more efficient learning than DeepSeek, showing a 6.21 percentage point improvement after training compared to DeepSeek's 4.14 percentage point improvement. Furthermore, due to its flexibility and a larger training corpus, ChatGPT provided responses for various types of problems, which included some problems that DeepSeek sometimes failed to tackle, such as complex string manipulations. DeepSeek had the advantage with respect to generating code with good readability and structural clarity, with better function modularity, identifier quality, and comment density. Nevertheless, ChatGPT's ability to give explanations in natural language is truly remarkable. ChatGPT's articulate narratives often rendered its explanations more accessible to the novices, thus turning it into some kind of tool for those in the beginning stages of preparation, even if DeepSeek gave the more algorithmically insightful commentary. Furthermore, according to the Density of Insight (DOI) metric, although DeepSeek produced a higher number of deeply insightful suggestions, ChatGPT outputs retained their educational value in the more conceptual and beginner-friendly level of explanation. This means there exists complementary use of these two models in the learning space: one for ChatGPT with respect to range and pedagogical clarity, and DeepSeek for the depth and technical accuracy. In conclusion, ChatGPT's flexibility, quick development, and user-friendly results confirm its applicability in educational support systems, even though DeepSeek's unique architecture gives it a distinct edge in competitive programming scenarios. The potential for hybrid instructional models that use both general-purpose and domain-specific AI tools to improve programming education is indicated by their combined strengths. 6. Conclusion This work explains that Gen AI tools, specifically for the purpose of programming, have a lot of potential to improve in competitive programming instruction and training. Iterative prompt refinement helps both ChatGPT and DeepSeek in providing improved outputs. DeepSeek continuously outperforms ChatGPT in important performance metrics like solution accuracy, complexity optimization, error resilience, code readability, and instructional insight. These results highlight how useful it is to integrate these models into learning environments for subjects that require a lot of algorithms, especially when specialized assistance is required. The competitive advantage of DeepSeek implies that domain-specific fine-tuning is essential for forming AI capabilities in niche settings such as competitive coding. This study emphasizes the significance of critical model evaluation and customized application as educational institutions and students increasingly rely on AI for support. Particularly in high-stakes situations like the ACM ICPC, DeepSeek, in its current configuration, stands out as a very capable model for assisting programming pedagogy. However, more training, either in the form of indirect or direct, will be beneficial in improving the output from Gen AI tools. 7. Limitations and Future Work Even though this study is thorough, there are a few limitations to consider. One major limitation is that currently, neither DeepSeek nor ChatGPT offers metadata about response generation time. The ability to evaluate model response latency would provide important insights into their real-time applicability in competitive programming contexts where time efficiency is critical. A more comprehensive evaluation framework that considers both code quality and assistance speed may be made possible by future models or API implementations that incorporate timestamped output generation. The evaluation's exclusive use of the C + + programming language is another drawback. Java and Python, which are extensively used in professional and educational contexts, could provide a more generalizable understanding of each model's capabilities across various programming paradigms, even though C + + is the most common language in ACM ICPC contests. Additionally, the study can thoroughly examine the impact of prompt engineering in the future, even though it used a uniform prompt template to guarantee equity. The impact of different prompt structures, instructions, and contextual framing on model performance could be methodically examined in future research. For educators and developers looking to optimize the use of AI coding assistants, knowing how various prompt designs affect output accuracy, efficiency, and explanation depth may offer crucial insights. Furthermore, broadening the focus to incorporate real-time collaborative environments, like AI-assisted hackathons or pair programming simulations, would better reflect real-world use cases. Comparative robustness would be improved, and new AI tools in this field could be benchmarked by incorporating other emerging LLMs designed for code generation, such as Code Llama or StarCoder. To sum up, this study offers a strong framework for assessing Gen AI models in competitive programming; however, it can be expanded in the future by incorporating response latency tracking, multi-language support, sophisticated prompt engineering, and real-time educational integration. The incorporation of these models into dynamic, iterative real-time development environments is another possible direction for future research. Additional evidence of these models' educational value may be provided by monitoring long-term learning outcomes when students use them for guided problem-solving and by gathering qualitative input from educators and users. Declarations Supporting Information Supplementary file ‘ACM ICPC Supplementary File.docx’ is provided. Availability of data and materials The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. Conflict of Interest Authors declared no conflict of interest. Funding Not applicable Authors' contributions Harshita Vyas: Data Curation, Analysis, Visualization, Writing – original draft Ravindra G Bhardwaj: Conceptualization, Analysis, Supervision, Methodology, Visualization, Validation, Writing – review and editing All authors read and approved the final manuscript. Acknowledgements Not applicable References Varghese, J. and J. Chapiro, ChatGPT: The transformative influence of generative AI on science and healthcare. Journal of Hepatology, 2024. 80 (6): p. 977-980. DOI: https://doi.org/10.1016/j.jhep.2023.07.028. Almanasra, S. and K. Suwais, Analysis of ChatGPT-Generated Codes Across Multiple Programming Languages. IEEE Access, 2025. 13 : p. 23580-23596. DOI: 10.1109/ACCESS.2025.3538050. Bhardwaj, R.G. and H.S. Bedi, ChatGPT as an education and learning tool for engineering, technology and general studies: performance analysis of ChatGPT 3.0 on CSE, GATE and JEE examinations of India. Interactive Learning Environments, 2025. 33 (1): p. 321-334. DOI: 10.1080/10494820.2024.2344054. Yilmaz, R. and F.G. Karaoglan Yilmaz, Augmented intelligence in programming learning: Examining student views on the use of ChatGPT for programming learning. Computers in Human Behavior: Artificial Humans, 2023. 1 (2): p. 100005. DOI: https://doi.org/10.1016/j.chbah.2023.100005. Bucaioni, A., et al., Programming with ChatGPT: How far can we go? Machine Learning with Applications, 2024. 15 : p. 100526. DOI: https://doi.org/10.1016/j.mlwa.2024.100526. Manik, M.M.H., ChatGPT vs. DeepSeek: A Comparative Study on AI-Based Code Generation. arXiv preprint arXiv:.18467, 2025. Xu, Y., et al., Artificial intelligence: A powerful paradigm for scientific research. The Innovation, 2021. 2 (4): p. 100179. DOI: https://doi.org/10.1016/j.xinn.2021.100179. Guo, D., et al., DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence. arXiv preprint arXiv:.14196, 2024. Nikolic, S., et al., ChatGPT versus engineering education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment integrity. European Journal of Engineering Education, 2023. 48 (4): p. 559-614. DOI: 10.1080/03043797.2023.2213169. Yang, A., Z. Li, and J. Li, Advancing GenAI assisted programming--a comparative study on prompt efficiency and code quality between GPT-4 and GLM-4. arXiv preprint arXiv:.12782, 2024. Huang, A.Y.Q., et al., The impact of GenAI-enabled coding hints on students' programming performance and cognitive load in an SRL-based Python course. British Journal of Educational Technology, 2025. n/a (n/a). DOI: https://doi.org/10.1111/bjet.13589. Yu, L., Paradigm shift on Coding Productivity Using GenAI. arXiv preprint arXiv:.18404, 2025. Zimmermann, M., H. Janetzko, and B. Haymond. Integrating generative AI methods in computer science education: perspectives, strategies, and outcomes . in EDULEARN24 Proceedings . 2024. IATED. Cubillos, C., et al., Generative Artificial Intelligence in Computer Programming: Does It Enhance Learning, Motivation, and the Learning Environment? IEEE Access, 2025. 13 : p. 40438-40455. DOI: 10.1109/ACCESS.2025.3532883. Merkel, M. and J. Dörpinghaus, A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT. arXiv preprint arXiv:.03639, 2025. Zheng, Y. and M. Sarem. C++ Teaching Reform and Exploration Based on ACM/ICPC and Live Code . in Proceedings of the 2022 5th International Conference on Education Technology Management . 2022. Blum, J.J., Competitive programming participation rates: an examination of trends in U.S. ICPC regional contests. Discover Education, 2023. 2 (1): p. 11. DOI: 10.1007/s44217-023-00034-1. Hajkowicz, S., et al., Artificial intelligence adoption in the physical sciences, natural sciences, life sciences, social sciences and the arts and humanities: A bibliometric analysis of research publications from 1960-2021. Technology in Society, 2023. 74 : p. 102260. DOI: https://doi.org/10.1016/j.techsoc.2023.102260. Zhu, Q., et al., Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:.11931, 2024. Vyas, H.B., RAVINDRA (2025). Data Set - ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC . doi:10.17632/4ccc4dhpms.1 Laski, J., Programming faults and errors: Towards a theory of software incorrectness. Annals of Software Engineering, 1997. 4 (1): p. 79-114. DOI: 10.1023/A:1018966827888. Gao, Q. and X. Xu. The analysis and research on computational complexity . in The 26th Chinese Control and Decision Conference (2014 CCDC) . 2014. DOI: 10.1109/CCDC.2014.6852777. Feng, Y., et al. Investigating Code Generation Performance of ChatGPT with Crowdsourcing Social Data . in 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC) . 2023. DOI: 10.1109/COMPSAC57700.2023.00117. Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7077588","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":482556711,"identity":"920c1b0e-d6e8-4e5c-a00f-c008c70534f4","order_by":0,"name":"Harshita Vyas","email":"","orcid":"https://orcid.org/0009-0007-2497-3519","institution":"Stanley College of Engineering and Technology for Women","correspondingAuthor":false,"prefix":"","firstName":"Harshita","middleName":"","lastName":"Vyas","suffix":""},{"id":482556712,"identity":"a88c8fff-a4a4-4078-becb-1c2d5a65c39a","order_by":1,"name":"RAVINDRA GIRIRAJ BHARDWAJ","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8ElEQVRIiWNgGAWjYJCCAwxsEIbBByDBxk6KFsMZIC3MRNkD1cLMAyYJKDZvP2N4uKDsnjyDRPKDYptf2+T5mBkYP3zMwa1F5kyOweEZ54oNGyTSDIxz+24btjEzMEvO3IZbiwRDWsJh3rYExgbpBKCWntuMQC1szLz4tPA/A2uxb5BO/2Bs2XPbnrAWieQDIC2JDdI5BsYMP24nEqHl8YHDPOcSktvk3xQY9jbcTm5jZmzG7xf+xObPPGUJtv08x7cZ/Phz23Z+e/PBDx/xaIEDYNSwGTC2gZiMDUSohwDmBwx/iFY8CkbBKBgFIwgAAPyVS9l66sE9AAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0003-3816-9437","institution":"Birla Institute of Technology and Science, Pilani, Dubai Campus","correspondingAuthor":true,"prefix":"","firstName":"RAVINDRA","middleName":"GIRIRAJ","lastName":"BHARDWAJ","suffix":""}],"badges":[],"createdAt":"2025-07-08 19:02:25","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":true,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":true},"doi":"10.21203/rs.3.rs-7077588/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7077588/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":86325775,"identity":"a398f804-5682-4191-b398-cd40b8461b6b","added_by":"auto","created_at":"2025-07-09 10:42:13","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":283379,"visible":true,"origin":"","legend":"\u003cp\u003eCategorization of questions based on topics and difficulty level\u003c/p\u003e","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/43ed792658c20ff98a7e8c58.png"},{"id":86325776,"identity":"dbb13298-9e73-4f37-82c7-2b9c1432484a","added_by":"auto","created_at":"2025-07-09 10:42:13","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":45551,"visible":true,"origin":"","legend":"\u003cp\u003eFlowchart demonstrating the dual trial framework adopted for evaluating code generated by ChatGPT and DeepSeek on ACM ICPC problems\u003c/p\u003e","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/c0c87d31d47bd27d4c818a4a.png"},{"id":86325772,"identity":"a49f7d55-96d7-45a6-bf6a-1a50b6d5f4e7","added_by":"auto","created_at":"2025-07-09 10:42:13","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":88471,"visible":true,"origin":"","legend":"\u003cp\u003e(a) Comparison of the accuracy of ChatGPT and DeepSeek before and after training. (b) Occurrence of syntax, logical, and runtime errors before and after training in ChatGPT and DeepSeek code generation\u003c/p\u003e","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/ccfe4e83a7cd03054aecaf3a.png"},{"id":86325961,"identity":"636f69a4-25a3-4380-bc66-8ad32e86cb8e","added_by":"auto","created_at":"2025-07-09 10:50:13","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":104469,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of nine types of time complexities found in the solution obtained from ChatGPT and DeepSeek before and after training.\u003c/p\u003e","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/de88c2ef87843609eb908818.png"},{"id":86325786,"identity":"994fa326-dd43-40dd-be1a-3213c4c730b8","added_by":"auto","created_at":"2025-07-09 10:42:14","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":112119,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of nine types of space complexities found in the solution obtained from ChatGPT and DeepSeek before and after training.\u003c/p\u003e","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/024528fd19478e940fee718c.png"},{"id":86325779,"identity":"6aa5f447-4a33-449a-8d07-6f8687b5141b","added_by":"auto","created_at":"2025-07-09 10:42:13","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":90402,"visible":true,"origin":"","legend":"\u003cp\u003eDensity of Insight calculated in the solution acquired from ChatGPT and DeepSeek before and after training\u003c/p\u003e","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/1d287b4dc7410e7b7b6b9db8.png"},{"id":86326779,"identity":"b9390e03-2a9b-45cb-bde9-c2cfbf4dbf04","added_by":"auto","created_at":"2025-07-09 11:06:23","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1941095,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7077588/v1/60304bf3-2d80-4b47-a95d-140b113bdea1.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eGenerative artificial intelligence (Gen AI) creates new content, such as text, images, or music, based on patterns in the data it has been trained on.\u0026nbsp;Gen AI utilizes large language models (LLMs), it involves training on extensive text datasets and employing AI models with a large number of model parameters [1]. With the rapid adoption of AI models for code generation [2], GenAI has emerged as a transformative force in computer education. Tools based on\u0026nbsp;Gen AI, such as ChatGPT, have quickly become commonplace in education [3], particularly in tasks like programming\u0026nbsp;[4].\u0026nbsp;ChatGPT and DeepSeek represent the latest frontier in AI-powered assistance, combining natural language processing with domain-specific knowledge to help users write, debug, and comprehend code.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eChatGPT has become popular in various aspects because of its flexibility and human-like interaction, in particular. Educational settings have started using Gen AI for code generation as an area of great interest [5]. Researchers and industry have taken notice of the recently released DeepSeek [6]. This well-known open-source tool has grown in popularity for AI modeling, giving performance that is competitive but with inference costs [7].\u0026nbsp;DeepSeek-Coder, a specialized LLM custom-made for coding, gives you open-source and performance improvement for difficult programming jobs\u0026nbsp;[8].\u0026nbsp;The rise of these tools has sparked an academic debate on their effectiveness and limitations in high-stakes educational contexts, especially in programming engineering dimensions to a varied degree\u0026nbsp;[9]. Recent research studies highlight the growing significance of GenAI tools in programming education and software development. Studies show that models like ChatGPT enhance student engagement, debugging capabilities, and learning outcomes in coding tasks through natural language explanations and real-time feedback\u0026nbsp;[10, 11]. \u0026nbsp;Comparative evaluations with traditional learning methods suggest that GenAI improves comprehension of algorithms and data structures\u0026nbsp;[12]. Educators are increasingly integrating these tools into classrooms for automated grading, problem generation, and concept clarification\u0026nbsp;[13]. Furthermore, GenAI models trained on domain-specific corpora exhibit higher code accuracy and reduced syntactic errors than general-purpose ones\u0026nbsp;[14]. \u0026nbsp;Despite limitations in logical reasoning, GenAI excels at generating readable, modular code with useful documentation. A recent case study on the transformative potential of AI in software engineering using LeetCode and ChatGPT explores how prompt engineering significantly influences model performance in competitive coding contexts\u0026nbsp;[15].\u003c/p\u003e\n\u003cp\u003eThis study compares ChatGPT and DeepSeek on a challenging benchmark: the Association for Computing Machinery (ACM)’s International Collegiate Programming Contest (ICPC). The ICPC is one of the world’s most prestigious algorithmic programming competitions for university students, organized by the ACM. It features a multi-tiered structure comprising local and regional contests, culminating in the World Finals, which invite only the top ~1% of teams globally, where teams of three students are given 5 hours to solve 8–13 challenging problems on a single computer, assessing a range of topics. With only the top 1% of global teams moving on to the world finals, this world-renowned exam assesses the problem-solving skills of competitive programming and software code engineering. Participation in the ICPC world finals is widely recognized as a mark of distinction in the field of computer science. This competition serves as a powerful motivator for students and entry-level professionals to refine their programming skills under timed conditions and realistic problem contexts [16]. Beyond skill development, success in ACM/ICPC is widely regarded as an indicator of high intellectual aptitude. Notably, many technical job interviews incorporate algorithmic challenges that closely resemble those encountered in such competitions, giving experienced contestants a competitive edge in the recruitment process, further enhancing employability. Corporate sponsors, such as IBM, not only fund contest prizes but also actively recruit medalists through internships and direct employment pathways, often prioritizing gold-winning team members for technical roles. Moreover, the career benefits of contest participation extend beyond the private sector. Universities and academic institutions frequently hire accomplished ACM/ICPC participants as programming instructors or team coaches, creating additional employment avenues in education and research. Taken together, these illustrate why students are increasingly motivated to engage in the programming competition [17].\u003c/p\u003e\n\u003cp\u003eIn this study, we employ a dual-trial framework for conducting a comprehensive evaluation of ChatGPT and DeepSeek by testing their performance on past ACM ICPC World Finals problems from the years 2013 to 2024 that were officially available. The models were tested on 145 questions covering a wide spectrum of topics by initial code generation attempts, followed by model fine-tuning using indirect training, with solutions executed and tested in a standardized C++ environment. Beyond quantifying accuracy, we assess technical code quality through execution-based metrics (syntax/logical/runtime errors) and structural analysis, including overall readability scores derived from comment density, identifier quality, indentation score, and function size. This approach bridges the gap between theoretical code generation capabilities and real-world programming competition demands by providing actionable insights into how GenAI tools can complement competitive programming training. Our methodology ensures a robust and fair assessment of each model’s strengths and limitations in competitive programming contexts. Overall, this study analyzes the performance of ChatGPT and DeepSeek across varied difficulty levels and topics, including data structures, algorithms, mathematics, computational geometry, optimization, bit manipulation, and string manipulation. This research highlights each model's strengths and limitations in solving competitive coding tasks, offering insights into their practical utility in advanced programming education and assessments.\u003c/p\u003e"},{"header":"2. Background","content":"\u003cp\u003eGiven the rapid adoption of AI in education, it has become increasingly important to evaluate how effectively GenAI tools perform in rigorous academic settings. Although AI originated within computer science, its adoption in education has accelerated rapidly, with AI now being integrated across various disciplines and educational levels [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. One such high-stakes context that\u0026ensp;utilizes competitive programming, critical thinking, logical\u0026ensp;thinking, algorithmic thinking, coding, and more is required. Ranked programming contests such as ICPC consist of regional\u0026ensp;and national-qualifying tournaments, leading to the World\u0026ensp;Finals competition. All\u0026ensp;programs in ICPC entail codified constructions, i.e, they require well-structured, efficient, and logically sound programming code that adheres to strict input-output and time constraints. The use of GenAI tools on ICPC-style problems offers a promising opportunity to enhance programming education by helping students improve their algorithmic thinking, logical reasoning, and coding abilities in a structured and competitive way.\u003c/p\u003e\u003cp\u003eEvery computer science student during their graduation phase faces the critical challenge of learning and enhancing their programming skills. This turns them to use GenAI assistants like ChatGPT, DeepSeek, and others for guidance when traditional resources fall short. Alarmingly, when students cannot find effective learning resources to develop strong programming foundations, many shift their career interests toward non-programming fields such as data analytics, business management, and other domains. This trend motivated us to rigorously investigate the educational efficacy of GenAI tools, especially in competitive programming contexts. But here's the critical question: Which AI tool will actually guide them toward a correct, efficient solution versus leading them down a rabbit hole of syntactically correct but logically flawed code? There remains a significant knowledge gap, as no comprehensive comparative study has assessed the effectiveness of GenAI tools on standard programming problems such as ACM - ICPC problems. This research addresses that gap by systematically comparing ChatGPT and DeepSeek across 145 real ACM ICPC problems, providing students, educators, and institutions with evidence-based insights into which AI tools truly enhance learning outcomes in competitive programming.\u003c/p\u003e\u003cdiv id=\"Sec2\" class=\"Section2\"\u003e\u003ch2\u003e2.1. Coding \u0026amp; AI\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eFor computer science students to thrive in the digital age, both academically and in practical applications that call for technical precision, logical thinking, and algorithmic design, programming abilities are crucial. The landscape of education is evolving due to the emergence of AI, specifically LLMs such as DeepSeek and ChatGPT, among others. By interpreting natural language prompts and generating syntactically and semantically sound code, these AI tools give students dynamic feedback and support. ChatGPT offers flexible explanations and problem-solving techniques appropriate for students of different skill levels because it has been trained on a variety of text and code datasets [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. In contrast, DeepSeek uses fine-tuning on quantities specific to programming to provide better performance for more specialized technical challenges [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. But, the first generations of AI-written code\u0026ensp;were especially from models such as ChatGPT, which\u0026ensp;remain to be investigated in detail. A primary worry is the validity of the produced code. Models based on AI, trained on massive datasets from certain\u0026ensp;sources, could generate code that is flawed or does not follow best practices. This is further compounded by the possibility that these models\u0026ensp;produce syntactically valid code, but it is incorrect in its meaning, so that the running application will\u0026ensp;crash or behave in an unpredictable way. Maintainability is also a key challenge when\u0026ensp;it comes to AI-generated code. AI-generated code is frequently not as clear or as well documented as\u0026ensp;human developers tend to be, making it hard to\u0026ensp;understand and to modify [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Platforms like GeeksforGeeks have recognized the growth of AI coding assistants in aiding students in practicing and comprehending programming logic by contest-level problem generation and real-time tutoring interfaces, and also in debugging. Motivated by these advancements, our study intends to assess the usefulness of two top AI tools, ChatGPT and DeepSeek, in advanced programming education by evaluating their real performance on extremely difficult ACM ICPC problems.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.2. Overview of ACM \u0026ndash; ICPC\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe Association for Computing Machinery (ACM), founded in 1947, is the world\u0026rsquo;s foremost scientific and educational society in computing, with a mission to advance the computing profession through publications, conferences, special interest groups, and educational resources. Among its flagship global initiatives is the International Collegiate Programming Contest (ICPC), a globally renowned and highly prestigious algorithmic programming competition for university students. The ICPC serves as a platform for fostering innovation, problem-solving skills, and collaboration in real-time, high-pressure environments. It is structured into three main tiers: local contests, where universities select their best teams; regional contests, which feature participants from broader geographic zones; and the culminating world finals, which host only the top 1% of teams worldwide from over 3,000 universities across more than 100 countries.\u003c/p\u003e\u003cp\u003eIn this contest, teams of three students work collaboratively on a single computer to solve 8\u0026ndash;13 algorithmic problems within a 5-hour time frame. These problems vary in difficulty and test a wide range of technical skills including data structures (trees, heaps, disjoint sets), algorithms (graph theory, dynamic programming, greedy approaches, backtracking), mathematics (combinatorics, number theory, geometry), computational geometry, bit manipulation, and advanced topics such as network flows, line sweep algorithms, and segment trees. The contest emphasizes code efficiency, algorithmic thinking, and real-time problem-solving under pressure. Programming languages allowed include C, C++, and Java, with C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;being the predominant choice owing to its fast execution and rich standard library (STL). Each year, over 50,000 students participate across the ICPC, but only a select few reach the elite world finals. Winners of the world finals are awarded not only global recognition and trophies, but also internship offers, job opportunities, and monetary prizes from top tech companies such as IBM, JetBrains, Huawei, and others who sponsor the event. Some editions have awarded cash prizes up to \u003cspan\u003e$\u003c/span\u003e15,000 for the top team, along with medals, certificates, and often direct fast-track recruitment interviews with prestigious organizations. As thousands of students prepare annually for the ACM ICPC, the lack of dedicated, structured resources for contest-specific training remains a significant gap. Given this context, the potential of Gen AI models to support such preparation, by offering real-time feedback, enhancing algorithmic thinking, and improving code precision, demands rigorous empirical evaluation. This underscores the relevance and timeliness of our comparative study, which aims to assess the practical utility of ChatGPT and DeepSeek in meeting the unique challenges of competitive programming education.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Dataset and Prompting\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eWe gathered and prepared a dataset of 145 programming problems from the official ACM ICPC archives and related online judge platforms in order to thoroughly evaluate ChatGPT and DeepSeek's code generation abilities in a competitive programming setting. In order to ensure diversity in problem difficulty, topic distribution, and algorithmic scope, the selection process covers contests from 2013 to 2024, spanning more than ten years. Based on the historical success rate and time-to-solve information supplied in contest metadata, each problem was initially divided into three standard tiers: easy, moderate, and difficult. The dataset consists of 40, 70, and 35 easy, moderate, and difficult questions. The questions were then further categorized into topical domains as illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. These included Data Structures, Algorithms, Mathematics, String and Bit Manipulation, Computational Geometry, Advanced Topics, and Optimization Criteria. The model's comprehension of hierarchical relationships, memory management, and application of union-find optimizations is tested in data structures through problems involving trees, heaps, and disjoint sets. Dynamic programming, graph traversal (Depth First Search/Breadth First Search), greedy strategies, and backtracking tasks in the algorithms category exposed subtle model behaviors.\u003c/p\u003e\u003cp\u003eIn the initial trial, we created a standardized prompt template for submitting problems to both models in order to guarantee evaluation consistency. The official problem statement, expected input format, and constraints were all made clear in the prompt. Each model received the same input free from formatting bias through this methodical approach. A second round of testing was carried out for problems that had incorrect answers in the first trial, either because of logical mistakes, inefficiencies, or failure to compile. Based on the nature of the prior failure, a revised prompt template was created in this phase to incorporate indirect training. We were able to replicate a learning and feedback loop that students frequently encounter during practice due to this iterative interaction. For both evaluation trials, the outputs generated by ChatGPT and DeepSeek were systematically documented in a structured tabular format [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Each entry captured relevant metadata, including problem ID, difficulty level, topical category, trial number, and key performance indicators such as execution success, logical correctness, and code quality metrics. Code correctness was rigorously verified not only through sample input/output cases provided in the original problem statements but also cross-checked against the official solution sketches and reference implementations published by ACM ICPC.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e3.2. Performance and Quality Metrics\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThis study employs a comprehensive multidimensional evaluation framework, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, to assess the performance and quality of AI-generated code solutions through both quantitative and qualitative metrics. The primary quantitative measure is accuracy, which serves as a fundamental benchmark for problem-solving capability and comparative analysis against other AI systems. Computational complexity analysis is conducted using the Big O notation to evaluate the time and space complexity, providing insights into algorithmic efficiency of the code. Error analysis encompasses three critical categories: runtime errors (execution failures due to memory overflow or invalid operations that compromise system stability), syntax errors (compilation-stage violations of language rules that prevent program execution), and logical errors (algorithmic reasoning defects resulting in incorrect outputs despite successful compilation and execution). Code quality assessment integrates four readability sub-metrics to generate a composite score: function size, indentation score (percentage of correctly formatted code blocks), identifier quality (percentage of meaningful variable names), and comment density (percentage of commented code lines). Additionally, theoretical explanation quality is evaluated on a five-point scale considering output correctness, code correctness, compendiousness, validity, and non-obviousness. These integrated qualitative and quantitative metrics ensure that generated solutions not only demonstrate functional correctness but also adhere to established software engineering principles, thereby facilitating code maintenance, collaborative development, and educational applications in professional programming environments.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e3.3. Experimental Setup\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eTo ensure consistency and fairness in evaluation, we employed a manual process for inputting and testing prompts across both ChatGPT and DeepSeek models. A standardized prompt format was created for the initial trial (before training), which was uniformly applied to all 145 ACM ICPC problems. Prompt 1 that was used before training: \u003cem\u003e\"PROVIDE C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;CODE, SOLUTION SKETCH EXPLAINING THE APPROACH AND SAMPLE OUTPUT FOR THE GIVEN PROBLEM.\"\u003c/em\u003e This prompt was followed by the official problem statement as provided in the contest archives. If the output provided by ChatGPT or DeepSeek is incorrect or suboptimal, then we performed indirect training using a second prompt, which is \u003cem\u003e\"THE CODE IS GIVING WRONG OUTPUT FOR PROBLEM, REWRITE.\"\u003c/em\u003e This prompt aimed to guide models toward correction without introducing human bias or additional hints, which we called \u0026lsquo;indirect prompt\u0026rsquo;. All responses were collected, executed, and analyzed manually to preserve the integrity of the evaluation.\u003c/p\u003e\u003cp\u003eThe research was conducted on an HP EliteBook laptop equipped with an Intel Core i5 processor, offering a base frequency of 1.4 GHz and Turbo Boost capability up to 5.0 GHz. The system was supported by 32 GB of RAM and featured PCIe NVMe SSD storage ranging from 512 GB to 1 TB, ensuring fast data access and high-performance multitasking. The device included a 14-inch display with either WUXGA (1920 \u0026times; 1200) or Full HD (1920 \u0026times; 1080) resolution, providing clear visual output. For graphics processing, it utilizes Intel UHD Graphics and AMD Radeon Graphics.\u003c/p\u003e\u003cp\u003eFor the purpose of experimentally recorded research, all responses generated by ChatGPT and DeepSeek through two iterative trials were meticulously noted. Each of the 145 ACM ICPC problems was randomly scrutinized based on the provided performance and quality metrics to ascertain the suitably generated code and solutions. Compiler responses were noted to check for syntactic correctness and runtime behavior, and the code output was executed in a standardized C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;compilation setup to ascertain uniformity. To determine functional correctness, all outputs were compared against the official sample outputs provided with the problem sets. Iterations tracked any advances or reverses with regard to code correctness and quality by means of the same procedure for both the initial and improved prompt trials. Hence, the double-trial regime combined with quantitative metric analysis gave a thorough yet objective assessment of the two models' code generation abilities under real contest settings.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Results","content":"\u003cp\u003eIn the comparative analysis, it was found that DeepSeek outperformed ChatGPT across multiple dimensions of code generation on 145 ACM ICPC problems. DeepSeek demonstrated superior accuracy than ChatGPT (88.28% vs. 84.14%), lower error rates, particularly in logical consistency, and more optimal algorithmic performance, reflected by a higher proportion of O (1) and O (n) solutions. Interestingly, both ChatGPT and DeepSeek consistently produced more readable and well-documented code, as evidenced by approximately the same code quality scores. Additionally, DeepSeek achieved greater instructional depth, as confirmed by the maximum number of fully insightful Density of Insight (DOI) metric. Although ChatGPT exhibited substantial improvements after iterative refinement, DeepSeek maintained a clear performance advantage before and after training. The following subsections provide a detailed comparative evaluation of ChatGPT and DeepSeek.\u003c/p\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Accuracy Analysis\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eAccuracy is defined as the proportion of correct answers given by the GenAI model out of the total questions input to the model. The percentage of correctly answered questions with successfully executed code compared to the total number of questions posed was used to calculate the accuracy of the two evaluated large language models (Bhardwaj \u0026amp; Bedi, 2025). DeepSeek consistently outperformed ChatGPT in both experimental trials, according to the comparative analysis. ChatGPT's accuracy before training was 77.93%, while DeepSeek's was 84.14%, a 6.21 percentage point difference in favor of DeepSeek as seen in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e (a). Both models performed better after model optimization after the training. ChatGPT's accuracy increased to 84.14%, matching DeepSeek's initial trial baseline performance. In the second trial, DeepSeek made even more progress, achieving an accuracy of 88.28%, or 4.14 percentage points. These results show that DeepSeek maintains a 4 to 6 percentage point accuracy advantage, even though both models gain a great deal from optimization. From this observation, we can say that ChatGPT's second trial accuracy and DeepSeek's first trial result were equal, indicating that DeepSeek has a naturally higher ability to generate accurate code and solve computational problems in the tested scenarios. On the other hand, it should be emphasized that ChatGPT demonstrates more efficient learning than DeepSeek, showing a 6.21 percentage point improvement after training compared to DeepSeek's 4.14 percentage point improvement. Accuracy in percentage is calculated using the Eq.\u0026nbsp;(\u003cspan refid=\"Equ1\" class=\"InternalRef\"\u003e1\u003c/span\u003e),\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:Accuracy\\:\\left(\\%\\right)=\\frac{Number\\:of\\:correct\\:answers}{Total\\:EquationNumber\\:of\\:questions}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e4.2. Error Analysis\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eA programming error is a state that violates the specification [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Three main types of errors were included in the error rate analysis: syntax, logical, and runtime errors. As shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e (b), it is evident that both ChatGPT and DeepSeek produce high logical errors in code, followed by syntax and runtime errors. Syntax errors are mistakes in the structure of a program's code that prevent it from being compiled or executed. It is the most common category of errors in any program. ChatGPT generated syntax errors in 4.83% number of solutions before training, which dramatically decreased to 0.69% after training, signifying a 4.14 percentage point improvement. DeepSeek, on the other hand, improved syntax accuracy only by 2.07 percentage points, by producing syntax errors in 4.83% number of solutions before training, to 2.76% after training. A logic error causes incorrect program execution, in contrast to a syntax error, which prevents execution. Across successive trials, ChatGPT exhibited a marginal reduction in logical error rate, decreasing it from 13.79% number of solutions to 13.10%. By lowering logical errors from 9.66\u0026ndash;7.59% in the total number of solutions, DeepSeek showed enhanced logical consistency. A runtime error is an error that occurs during the execution of a program, causing it to terminate unexpectedly. It was found that runtime errors were comparatively stable and minimal in the code provided by both ChatGPT and DeepSeek. While DeepSeek showed runtime errors in 1.38% of the solutions before and after training, ChatGPT demonstrated a slight improvement by lowering the runtime error from 1.38\u0026ndash;0.69% in the total number of solutions after training. There is a complex behavior of both ChatGPT and DeepSeek in generating errors in code and improving the code after training. Overall, from Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e(b), it is observed that while ChatGPT reduced syntax and runtime errors more effectively, DeepSeek reduced logical errors more efficiently. Figure\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e (supplementary) shows the total number of syntax, logical, and runtime errors found in the solution acquired from ChatGPT and DeepSeek before and after training. The following equations are used to calculate the syntax, logical, and runtime errors in percentage, respectively.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\:Syntax\\:error\\:\\left(\\%\\right)=\\:\\frac{Number\\:of\\:syntax\\:errors}{Total\\:EquationNumber\\:of\\:questions}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$\\:Logical\\:error\\:\\left(\\%\\right)=\\:\\frac{Number\\:of\\:logical\\:errors}{Total\\:EquationNumber\\:of\\:questions}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$$\\:Runtime\\:error\\:\\left(\\%\\right)=\\:\\frac{Number\\:of\\:runtime\\:errors}{Total\\:EquationNumber\\:of\\:questions}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e4.3. Time and Space Complexity Analysis\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eTime complexity refers to the computational time an algorithm requires to execute as a function relative to the input size, while space complexity measures the amount of memory utilized during execution [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Both time and space complexity are usually expressed using Big O notation. The Big O calculation tool was used to calculate time and space complexity for solutions obtained from both ChatGPT and DeepSeek for 145 questions. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e shows that when compared to ChatGPT, DeepSeek generated more constant time O (1) solutions (7 vs. 6) and linear time O(n) solutions (40 vs. 31), quadratic time O(n\u0026sup2;) solutions (47 vs. 19) and cubic O(n\u0026sup3;) solutions (24 vs, 11) before training, suggesting a preference for more efficient algorithms. After training, DeepSeek showed a decrease in higher-complexity outputs like cubic time O(n\u0026sup3;) solutions (4 vs. 9); and a rise in constant time solutions (8 vs. 6) and linear time solutions (41 vs. 19). Interestingly, DeepSeek's quadratic solutions complexity dropped from 47 before training to 12 after training. Conversely, ChatGPT performed worse after training, showing a rise in complexity classes 12 to 26 and a decrease in linear solutions from 31 to 19. These findings underline DeepSeek's superior capability in producing scalable and optimal code, especially in scenarios requiring stringent performance under competitive programming constraints.\u003c/p\u003e\u003cp\u003eIn terms of space complexity, as observed in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, both ChatGPT and DeepSeek frequently generated solutions with higher space complexity despite improvements after prompt refinement. A significant portion of output from both models fell into complex categories, such as quadratic, exponential, and factorial space, which are generally less desirable in competitive programming contexts. In the initial trial, ChatGPT produced 83 complex-space solutions, while DeepSeek generated 80; after training, these values remained high at 78 and 81, respectively. This indicates that both models, while capable of solving the problems, often relied on less memory-efficient approaches. However, after training, DeepSeek generated a higher number of quadratic-space O(n\u0026sup2;) and linear-space O(n) solutions, 10 and 32, respectively, while ChatGPT generated 9 and 40, respectively, while significantly reducing occurrences of higher-complexity classes such as cubic, polynomial, exponential, and factorial. These findings highlight the need for further refinement in guiding GenAI models toward space-optimized solution strategies.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e4.4. Code Quality and Readability Analysis\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eFour sub-parameters: comment density, identifier quality, indentation score, and function size were used to calculate the overall code quality [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. Comment density is defined as the number of lines containing meaningful comments and is calculated by the following Eq.\u0026nbsp;(\u003cspan refid=\"Equ5\" class=\"InternalRef\"\u003e5\u003c/span\u003e),\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$$\\:Comment\\:density\\:\\left(\\%\\right)=\\:\\frac{Number\\:of\\:comment\\:lines}{Totol\\:EquationNumber\\:of\\:lines\\:in\\:code}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eAn identifier is the name assigned to variables, functions, classes, constants, or any other named entity in programming code. Number of identifiers help in determining code's complexity, function size, and maintainability, indicating density of a code. A smaller number of identifiers suggest simple or unclear code, while too many may reduce readability or indicate poor modularization in small code. In this work, a number of identifiers are calculated for each solution obtained from ChatGPT and DeepSeek before and after training. Thereafter, identifier quality is calculated using the Eq.\u0026nbsp;(\u003cspan refid=\"Equ6\" class=\"InternalRef\"\u003e6\u003c/span\u003e),\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$$\\:Identifier\\:quality=\\:\\frac{Number\\:of\\:identifier}{Maximum\\:EquationNumber\\:of\\:identifier}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eIndentation is defined as the consistency of indentation throughout the code using spaces or tabs at the beginning of lines of code to visually structure the code into blocks. It helps indicate the logical hierarchy and flow of a program, especially within conditionals, loops, functions, and classes. The indentation score is calculated using the Eq.\u0026nbsp;(\u003cspan refid=\"Equ7\" class=\"InternalRef\"\u003e7\u003c/span\u003e),\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$$\\:Indentation\\:score\\:\\left(\\%\\right)=\\frac{Number\\:of\\:proerly\\:indented\\:blocks}{Total\\:EquationNumber\\:of\\:blocks}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eFunction size is the number of functions in a code, favoring modular functions, and it refers to the length and complexity of a function or method in a code. In this work, function size is normalized using the Eq.\u0026nbsp;(\u003cspan refid=\"Equ8\" class=\"InternalRef\"\u003e8\u003c/span\u003e),\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ8\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ8\" name=\"EquationSource\"\u003e\n$$\\:Function\\:size=\\:\\frac{Number\\:of\\:functions}{10}\\times\\:100$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e8\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe Overall readability score is a quantitative measure used to evaluate how easy it is to read and understand code. In this work, we calculated the overall readability score using comment density, identifier quality, indentation score, and function size score by using Eq.\u0026nbsp;(\u003cspan refid=\"Equ9\" class=\"InternalRef\"\u003e9\u003c/span\u003e),\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Equ9\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ9\" name=\"EquationSource\"\u003e\n$$\\:Overall\\:readability\\:score=\\frac{CD+IQ+IS+FS}{4}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e9\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003ewhere, CD, IQ, IS and FS are comment density, identifier quality, indentation score, and function size respectively.\u003c/p\u003e\u003cp\u003eWe calculated average comment density in both GenAI tools, ChatGPT and DeepSeek, before and after training. ChatGPT shows an average comment density of 3.29 and 3.39 before and after training. On the other hand, DeepSeek shows an average comment density of 3.06 and 3.45 before and after training. Maximum comment density is shown by ChatGPT about 40.48 stating ChatGPT has potential in providing very well documented program code. Next, we calculated identifier quality by considering small/medium code blocks. The maximum number of identifiers is provided by DeepSeek, which is 64, which is considered as base to calculate identifier quality in this work. ChatGPT showed an average identifier quality of 26.01 and 26.78, while DeepSeek provided 26.67 and 29.11 before and after training, respectively. This clearly demonstrates that the codes provided by DeepSeek are readable, maintainable, and understandable as compared to ChatGPT.\u003c/p\u003e\u003cp\u003eBoth models exhibited exceptional performance in code formatting, achieving perfect 100% indentation scores across all trials, indicating outstanding adherence to structural coding standards. On the other hand, average function size metrics showed significant differences: ChatGPT's outputs were relatively verbose, with function sizes of 20.62 before training and 21.17 after training, whereas DeepSeek produced more concise code, with function sizes of 17.38 and 18.62 before and after training, respectively. Lastly, we calculated the overall readability score. ChatGPT achieved average overall readability scores of 37.48 and 37.85, while DeepSeek scored 36.78 and 37.79 for the same tasks, respectively.\u003c/p\u003e\u003cp\u003eCollectively, DeepSeek outperformed in all four sub-parameters average comment density, identifier quality, indentation score and function size. Also, DeepSeek's outputs were consistently more readable and maintainable, according to additional analysis of identifier naming and structural organization. Even though both models got better after training and have approximately the same overall readability score, ChatGPT can be considered the better tool for producing production-grade code because of its higher baseline performance, achieving an overall readability score of 37.85.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e4.5. Density of Insight (DOI) Analysis\u003c/h2\u003e\u003cp\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003eThe Density of Insight (DOI) analysis is used to assess the generated outputs' instructional quality and explanatory depth. The DOI is defined as the ratio of the number of insights available in the explanation to the total number of insight criteria, i.e., DOI = \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\frac{Number\\:of\\:insights}{Total\\:number\\:of\\:insight\\:criteria}\$\u003c/span\u003e\u003c/span\u003e [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Figure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e illustrates the DOI scores for both models, ChatGPT and DeepSeek, before and after training, across the range of values from 0 to 1. On a scale of 0 to 1, DOI values were calculated using five criteria: output correctness, code correctness, compendiousness, validity, and non-obviousness. While DOI scores below 1 indicated insight deficiency, a score of 1 indicated fully insightful content. DeepSeek performed better in terms of DOI before and after training. DeepSeek beat ChatGPT's 38.62% before training and 42.07% after training in the highest insight category (DOI = 1), achieving 44.14% before training and 46.90% after training. Both ChatGPT and DeepSeek produced null solutions with no insight (DOI = 0). Also, it should be emphasized that there is complex behavior when compared to the medium level of insight categories. These findings imply that DeepSeek continuously produced solutions that were more perceptive and educationally useful, covering algorithmic justification and solution methodology more thoroughly. The DOI analysis emphasizes how DeepSeek's outputs have additional instructional utility, which increases its efficacy for use cases involving instruction and explanation.\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Discussions","content":"\u003cp\u003eIn this work, we identified the strengths and weaknesses of two Gen AI tools, ChatGPT and DeepSeek, in complicated learning and problem-solving scenarios on programming tasks at the ACM ICPC level. DeepSeek usually had greater accuracy and better performance than ChatGPT on most metrics, although ChatGPT also showed significant benefits in flexibility and iterative development. The advantage of domain-specific fine-tuning is reflected in the supremacy of DeepSeek, indicated by its higher accuracy and algorithmic solution ratings. Production of code that is more time- and space-optimized by means of a higher count of O(1) and O(n) solutions and fewer complex O(n\u003csup\u003e2\u003c/sup\u003e) or O(n\u003csup\u003e3\u003c/sup\u003e) algorithms shows that it stands well with the competitive programming perspective. Additionally, by maintaining a lower logical error rate before and after training, DeepSeek appeared to have a firmer grasp of algorithmic structure and computational logic. However, in contrast to the two-trial setting, ChatGPT displayed remarkable flexibility. Of interest here is how much the model improved over time, dropping syntax errors from 4.83\u0026ndash;0.69% of the number of solutions, a greater decrease than DeepSeek. This suggests that the quick improvement and high sensitivity to fast optimization are ChatGPT's strengths. Along with this, ChatGPT demonstrates more efficient learning than DeepSeek, showing a 6.21 percentage point improvement after training compared to DeepSeek's 4.14 percentage point improvement.\u003c/p\u003e\u003cp\u003eFurthermore, due to its flexibility and a larger training corpus, ChatGPT provided responses for various types of problems, which included some problems that DeepSeek sometimes failed to tackle, such as complex string manipulations. DeepSeek had the advantage with respect to generating code with good readability and structural clarity, with better function modularity, identifier quality, and comment density. Nevertheless, ChatGPT's ability to give explanations in natural language is truly remarkable. ChatGPT's articulate narratives often rendered its explanations more accessible to the novices, thus turning it into some kind of tool for those in the beginning stages of preparation, even if DeepSeek gave the more algorithmically insightful commentary. Furthermore, according to the Density of Insight (DOI) metric, although DeepSeek produced a higher number of deeply insightful suggestions, ChatGPT outputs retained their educational value in the more conceptual and beginner-friendly level of explanation. This means there exists complementary use of these two models in the learning space: one for ChatGPT with respect to range and pedagogical clarity, and DeepSeek for the depth and technical accuracy. In conclusion, ChatGPT's flexibility, quick development, and user-friendly results confirm its applicability in educational support systems, even though DeepSeek's unique architecture gives it a distinct edge in competitive programming scenarios. The potential for hybrid instructional models that use both general-purpose and domain-specific AI tools to improve programming education is indicated by their combined strengths.\u003c/p\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eThis work explains that Gen AI tools, specifically for the purpose of programming, have a lot of potential to improve in competitive programming instruction and training. Iterative prompt refinement helps both ChatGPT and DeepSeek in providing improved outputs. DeepSeek continuously outperforms ChatGPT in important performance metrics like solution accuracy, complexity optimization, error resilience, code readability, and instructional insight. These results highlight how useful it is to integrate these models into learning environments for subjects that require a lot of algorithms, especially when specialized assistance is required. The competitive advantage of DeepSeek implies that domain-specific fine-tuning is essential for forming AI capabilities in niche settings such as competitive coding. This study emphasizes the significance of critical model evaluation and customized application as educational institutions and students increasingly rely on AI for support. Particularly in high-stakes situations like the ACM ICPC, DeepSeek, in its current configuration, stands out as a very capable model for assisting programming pedagogy. However, more training, either in the form of indirect or direct, will be beneficial in improving the output from Gen AI tools.\u003c/p\u003e"},{"header":"7. Limitations and Future Work","content":"\u003cp\u003eEven though this study is thorough, there are a few limitations to consider. One major limitation is that currently, neither DeepSeek nor ChatGPT offers metadata about response generation time. The ability to evaluate model response latency would provide important insights into their real-time applicability in competitive programming contexts where time efficiency is critical. A more comprehensive evaluation framework that considers both code quality and assistance speed may be made possible by future models or API implementations that incorporate timestamped output generation. The evaluation's exclusive use of the C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;programming language is another drawback. Java and Python, which are extensively used in professional and educational contexts, could provide a more generalizable understanding of each model's capabilities across various programming paradigms, even though C\u0026thinsp;+\u0026thinsp;+\u0026thinsp;is the most common language in ACM ICPC contests.\u003c/p\u003e\u003cp\u003eAdditionally, the study can thoroughly examine the impact of prompt engineering in the future, even though it used a uniform prompt template to guarantee equity. The impact of different prompt structures, instructions, and contextual framing on model performance could be methodically examined in future research. For educators and developers looking to optimize the use of AI coding assistants, knowing how various prompt designs affect output accuracy, efficiency, and explanation depth may offer crucial insights. Furthermore, broadening the focus to incorporate real-time collaborative environments, like AI-assisted hackathons or pair programming simulations, would better reflect real-world use cases. Comparative robustness would be improved, and new AI tools in this field could be benchmarked by incorporating other emerging LLMs designed for code generation, such as Code Llama or StarCoder. To sum up, this study offers a strong framework for assessing Gen AI models in competitive programming; however, it can be expanded in the future by incorporating response latency tracking, multi-language support, sophisticated prompt engineering, and real-time educational integration. The incorporation of these models into dynamic, iterative real-time development environments is another possible direction for future research. Additional evidence of these models' educational value may be provided by monitoring long-term learning outcomes when students use them for guided problem-solving and by gathering qualitative input from educators and users.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eSupporting Information\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSupplementary file ‘ACM ICPC Supplementary File.docx’ is provided.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of Interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAuthors declared no conflict of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors' contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003eHarshita Vyas:\u003cem\u003e\u0026nbsp;Data Curation, Analysis, Visualization, Writing – original draft\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003eRavindra G Bhardwaj:\u003cem\u003e\u0026nbsp;Conceptualization, Analysis, Supervision, Methodology, Visualization, Validation, Writing – review and editing\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eAll authors read and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eVarghese, J. and J. Chapiro, \u003cem\u003eChatGPT: The transformative influence of generative AI on science and healthcare.\u003c/em\u003e Journal of Hepatology, 2024. \u003cstrong\u003e80\u003c/strong\u003e(6): p. 977-980. DOI: https://doi.org/10.1016/j.jhep.2023.07.028.\u003c/li\u003e\n\u003cli\u003eAlmanasra, S. and K. Suwais, \u003cem\u003eAnalysis of ChatGPT-Generated Codes Across Multiple Programming Languages.\u003c/em\u003e IEEE Access, 2025. \u003cstrong\u003e13\u003c/strong\u003e: p. 23580-23596. DOI: 10.1109/ACCESS.2025.3538050.\u003c/li\u003e\n\u003cli\u003eBhardwaj, R.G. and H.S. Bedi, \u003cem\u003eChatGPT as an education and learning tool for engineering, technology and general studies: performance analysis of ChatGPT 3.0 on CSE, GATE and JEE examinations of India.\u003c/em\u003e Interactive Learning Environments, 2025. \u003cstrong\u003e33\u003c/strong\u003e(1): p. 321-334. DOI: 10.1080/10494820.2024.2344054.\u003c/li\u003e\n\u003cli\u003eYilmaz, R. and F.G. Karaoglan Yilmaz, \u003cem\u003eAugmented intelligence in programming learning: Examining student views on the use of ChatGPT for programming learning.\u003c/em\u003e Computers in Human Behavior: Artificial Humans, 2023. \u003cstrong\u003e1\u003c/strong\u003e(2): p. 100005. DOI: https://doi.org/10.1016/j.chbah.2023.100005.\u003c/li\u003e\n\u003cli\u003eBucaioni, A., et al., \u003cem\u003eProgramming with ChatGPT: How far can we go?\u003c/em\u003e Machine Learning with Applications, 2024. \u003cstrong\u003e15\u003c/strong\u003e: p. 100526. DOI: https://doi.org/10.1016/j.mlwa.2024.100526.\u003c/li\u003e\n\u003cli\u003eManik, M.M.H., \u003cem\u003eChatGPT vs. DeepSeek: A Comparative Study on AI-Based Code Generation.\u003c/em\u003e arXiv preprint arXiv:.18467, 2025.\u003c/li\u003e\n\u003cli\u003eXu, Y., et al., \u003cem\u003eArtificial intelligence: A powerful paradigm for scientific research.\u003c/em\u003e The Innovation, 2021. \u003cstrong\u003e2\u003c/strong\u003e(4): p. 100179. DOI: https://doi.org/10.1016/j.xinn.2021.100179.\u003c/li\u003e\n\u003cli\u003eGuo, D., et al., \u003cem\u003eDeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence.\u003c/em\u003e arXiv preprint arXiv:.14196, 2024.\u003c/li\u003e\n\u003cli\u003eNikolic, S., et al., \u003cem\u003eChatGPT versus engineering education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment integrity.\u003c/em\u003e European Journal of Engineering Education, 2023. \u003cstrong\u003e48\u003c/strong\u003e(4): p. 559-614. DOI: 10.1080/03043797.2023.2213169.\u003c/li\u003e\n\u003cli\u003eYang, A., Z. Li, and J. Li, \u003cem\u003eAdvancing GenAI assisted programming--a comparative study on prompt efficiency and code quality between GPT-4 and GLM-4.\u003c/em\u003e arXiv preprint arXiv:.12782, 2024.\u003c/li\u003e\n\u003cli\u003eHuang, A.Y.Q., et al., \u003cem\u003eThe impact of GenAI-enabled coding hints on students\u0026apos; programming performance and cognitive load in an SRL-based Python course.\u003c/em\u003e British Journal of Educational Technology, 2025. \u003cstrong\u003en/a\u003c/strong\u003e(n/a). DOI: https://doi.org/10.1111/bjet.13589.\u003c/li\u003e\n\u003cli\u003eYu, L., \u003cem\u003eParadigm shift on Coding Productivity Using GenAI.\u003c/em\u003e arXiv preprint arXiv:.18404, 2025.\u003c/li\u003e\n\u003cli\u003eZimmermann, M., H. Janetzko, and B. Haymond. \u003cem\u003eIntegrating generative AI methods in computer science education: perspectives, strategies, and outcomes\u003c/em\u003e. in \u003cem\u003eEDULEARN24 Proceedings\u003c/em\u003e. 2024. IATED.\u003c/li\u003e\n\u003cli\u003eCubillos, C., et al., \u003cem\u003eGenerative Artificial Intelligence in Computer Programming: Does It Enhance Learning, Motivation, and the Learning Environment?\u003c/em\u003e IEEE Access, 2025. \u003cstrong\u003e13\u003c/strong\u003e: p. 40438-40455. DOI: 10.1109/ACCESS.2025.3532883.\u003c/li\u003e\n\u003cli\u003eMerkel, M. and J. D\u0026ouml;rpinghaus, \u003cem\u003eA case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT.\u003c/em\u003e arXiv preprint arXiv:.03639, 2025.\u003c/li\u003e\n\u003cli\u003eZheng, Y. and M. Sarem. \u003cem\u003eC++ Teaching Reform and Exploration Based on ACM/ICPC and Live Code\u003c/em\u003e. in \u003cem\u003eProceedings of the 2022 5th International Conference on Education Technology Management\u003c/em\u003e. 2022.\u003c/li\u003e\n\u003cli\u003eBlum, J.J., \u003cem\u003eCompetitive programming participation rates: an examination of trends in U.S. ICPC regional contests.\u003c/em\u003e Discover Education, 2023. \u003cstrong\u003e2\u003c/strong\u003e(1): p. 11. DOI: 10.1007/s44217-023-00034-1.\u003c/li\u003e\n\u003cli\u003eHajkowicz, S., et al., \u003cem\u003eArtificial intelligence adoption in the physical sciences, natural sciences, life sciences, social sciences and the arts and humanities: A bibliometric analysis of research publications from 1960-2021.\u003c/em\u003e Technology in Society, 2023. \u003cstrong\u003e74\u003c/strong\u003e: p. 102260. DOI: https://doi.org/10.1016/j.techsoc.2023.102260.\u003c/li\u003e\n\u003cli\u003eZhu, Q., et al., \u003cem\u003eDeepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.\u003c/em\u003e arXiv preprint arXiv:.11931, 2024.\u003c/li\u003e\n\u003cli\u003eVyas, H.B., RAVINDRA (2025). \u003cem\u003eData Set - ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark \u0026ndash; ACM ICPC\u003c/em\u003e. doi:10.17632/4ccc4dhpms.1\u003c/li\u003e\n\u003cli\u003eLaski, J., \u003cem\u003eProgramming faults and errors: Towards a theory of software incorrectness.\u003c/em\u003e Annals of Software Engineering, 1997. \u003cstrong\u003e4\u003c/strong\u003e(1): p. 79-114. DOI: 10.1023/A:1018966827888.\u003c/li\u003e\n\u003cli\u003eGao, Q. and X. Xu. \u003cem\u003eThe analysis and research on computational complexity\u003c/em\u003e. in \u003cem\u003eThe 26th Chinese Control and Decision Conference (2014 CCDC)\u003c/em\u003e. 2014. DOI: 10.1109/CCDC.2014.6852777.\u003c/li\u003e\n\u003cli\u003eFeng, Y., et al. \u003cem\u003eInvestigating Code Generation Performance of ChatGPT with Crowdsourcing Social Data\u003c/em\u003e. in \u003cem\u003e2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)\u003c/em\u003e. 2023. DOI: 10.1109/COMPSAC57700.2023.00117.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Birla Institute of Technology and Science, Pilani - Dubai Campus","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"ACM ICPC, ChatGPT, Code Generation, Code Quality, DeepSeek, Education, Gen AI.","lastPublishedDoi":"10.21203/rs.3.rs-7077588/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7077588/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe effectiveness of two leading Gen AI models, ChatGPT and DeepSeek, is evaluated in addressing complex programming problems based on the ACM International Collegiate Programming Contest (ICPC), a widely accepted standard in competitive coding. The evaluation of both models, as far as readability, error handling, speed of computation, accuracy of code, and educational value, is given in the study. In a two-trial experimental setup, both models are evaluated on 145 different ICPC problems from data structures, algorithms, mathematics, geometry, advanced optimization, etc. The prompts for all these problems were standardized, and the evaluation took place across two iterations, mimicking iterative learning. The results indicate that both DeepSeek and ChatGPT improved their performance over time. Results show that DeepSeek consistently outperformed ChatGPT in code accuracy (88.28% vs. 84.14%), both generated more efficient algorithms for linear time complexity (41 vs. 19), and had lower logical error rates (7.58% vs. 15.86%). DeepSeek and ChatGPT performed almost the same in code quality scores (37.79 vs. 37.85). Approximately 46.90% of the solutions generated by DeepSeek were fully insightful, surpassing ChatGPT\u0026rsquo;s 42.07%. However, ChatGPT demonstrated significant improvement across trials, particularly drastically reducing syntax errors from 4.83\u0026ndash;0.69%. This comparative analysis suggests that DeepSeek may be a more suitable option for high-stakes programming tasks. The findings offer valuable guidance for integrating GenAI tools into advanced programming education.\u003c/p\u003e","manuscriptTitle":"ChatGPT vs DeepSeek: A Comparative Evaluation on the International Computer Science Benchmark – ACM ICPC","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-09 10:42:08","doi":"10.21203/rs.3.rs-7077588/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"46d09357-ac92-49b6-8ed2-95e628d923d2","owner":[],"postedDate":"July 9th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":51237286,"name":"Educational Philosophy and Theory"},{"id":51237287,"name":"Theoretical Computer Science"},{"id":51237288,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-07-09T10:42:09+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-09 10:42:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7077588","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7077588","identity":"rs-7077588","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0