The effectiveness of generative artificial intelligence models in software development is determined not only by their ability to generate correct solutions but also by their adherence to quality metrics and their resilience to exceptional scenarios. In this context, a comparative evaluation was conducted on four models using 10 fundamental algorithm problems and 10 object-oriented programming problems in the C# programming language. The generated solutions were assessed in terms of time complexity, memory usage, lines of code, number of variables and methods, and execution time. In addition, meaningful edge-case scenarios were employed to measure error tolerance and exception handling performance. The findings indicate that all models produced functionally valid solutions, yet exhibited limitations in advanced software engineering practices such as modularity, comprehensive error management, performance measurement, and unit testing. The analysis revealed that ChatGPT and Gemini stood out in terms of structure and consistency, Claude demonstrated greater reliability in handling errors, while Copilot offered advantages in code simplicity. Overall, the results highlight the importance of evaluating generative AI models not only under ideal conditions but also in atypical scenarios to ensure software quality and reliability.
| Primary Language | English |
|---|---|
| Subjects | Natural Language Processing, Automated Software Engineering, Programming Languages |
| Journal Section | Research Article |
| Authors | |
| Submission Date | September 16, 2025 |
| Acceptance Date | November 25, 2025 |
| Publication Date | December 29, 2025 |
| Published in Issue | Year 2025 Volume: 13 Issue: 2 |
Manas Journal of Engineering