Research Article

Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)

Volume: 9 Number: 2 February 1, 2025
EN

Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)

Abstract

Grid-based games, such as Tic-Tac-Toe, Connect-Four, and Gomoku, offer a valuable platform for evaluating large language models (LLMs) in reasoning, rule comprehension, and strategic thinking which are key skills for advancing Artificial General Intelligence (AGI). Current evaluation benchmarks often focus on tasks like natural language understanding or domain-specific problem-solving, lacking in multi-step reasoning and decision-making assessments. This study introduces an extensible benchmark framework leveraging these games to evaluate LLMs using three prompt types: list, illustration, and image. The framework's modular design facilitates the addition of new games, dynamic rule changes, and advanced prompt engineering techniques, enabling deeper examination of LLM capabilities. Through 2,310 simulated matches, we evaluated leading LLMs, including Claude 3.5 Sonnet, GPT-4 Turbo, and Llama3-70B. Results revealed significant performance variations, with simpler games like Tic-Tac-Toe yielding fewer invalid moves, while more complex games like Connect-Four and Gomoku posed greater challenges. List prompts were generally well-handled, while illustration and image prompts led to higher rates of disqualifications and missed opportunities. The findings underscore the utility of grid-based games as benchmarks for evaluating strategic thinking and adaptability, with implications for robotics, autonomous systems, and interactive AI. Limitations in handling visual data and complex scenarios suggest areas for improvement. The open-source nature of the benchmark encourages transparency and community contributions, fostering collaborative advancements in LLM research. Future directions include expanding to more complex games, refining prompt techniques, and exploring dynamic rule changes to deepen insights into LLM reasoning capabilities. This study lays the groundwork for advancing AI evaluation through flexible and comprehensive benchmarking tools, guiding progress toward more sophisticated and real-world applications.

Keywords

References

  1. [1] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian. A comprehensive overview of large language models. arXiv, 2023.
  2. [2] B. Goertzel and C. Pennachin, editors. Artificial General Intelligence, volume 2. Springer, New York, NY, USA, 2007.
  3. [3] I. Sutskever. The exciting, perilous journey toward agi, 2024. Available online: https://www.ted.com/talks/ilya sutskever the exciting perilous journey toward agi (accessed on 7 June 2024).
  4. [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  5. [5] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018. arXiv:1810.04805.
  6. [6] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training, 2024. Available THE JOURNAL of COGNITIVE SYSTEMS, Vol.9, No.2, 2024 43 Copyright © The Journal of Cognitive Systems (JCS) ISSN: 2548-0650 http://dergipark.gov.tr/jcs online: https://paperswithcode.com/paper/improving-languageunderstanding-by (accessed on 7 June 2024).
  7. [7] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2303.12712 (accessed on 7 June 2024).
  8. [8] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.11805 (accessed on 7 June 2024).

Details

Primary Language

English

Subjects

Artificial Intelligence (Other)

Journal Section

Research Article

Authors

Edell Colby This is me
United States

Harper Jackson This is me
United States

Publication Date

February 1, 2025

Submission Date

January 1, 2025

Acceptance Date

January 10, 2025

Published in Issue

Year 2024 Volume: 9 Number: 2

APA
Topsakal, O., Colby, E., & Jackson, H. (2025). Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). The Journal of Cognitive Systems, 9(2), 8-19. https://doi.org/10.52876/jcs.1611181
AMA
1.Topsakal O, Colby E, Jackson H. Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). JCS. 2025;9(2):8-19. doi:10.52876/jcs.1611181
Chicago
Topsakal, Oguzhan, Edell Colby, and Harper Jackson. 2025. “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”. The Journal of Cognitive Systems 9 (2): 8-19. https://doi.org/10.52876/jcs.1611181.
EndNote
Topsakal O, Colby E, Jackson H (February 1, 2025) Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). The Journal of Cognitive Systems 9 2 8–19.
IEEE
[1]O. Topsakal, E. Colby, and H. Jackson, “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”, JCS, vol. 9, no. 2, pp. 8–19, Feb. 2025, doi: 10.52876/jcs.1611181.
ISNAD
Topsakal, Oguzhan - Colby, Edell - Jackson, Harper. “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”. The Journal of Cognitive Systems 9/2 (February 1, 2025): 8-19. https://doi.org/10.52876/jcs.1611181.
JAMA
1.Topsakal O, Colby E, Jackson H. Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). JCS. 2025;9:8–19.
MLA
Topsakal, Oguzhan, et al. “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”. The Journal of Cognitive Systems, vol. 9, no. 2, Feb. 2025, pp. 8-19, doi:10.52876/jcs.1611181.
Vancouver
1.Oguzhan Topsakal, Edell Colby, Harper Jackson. Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). JCS. 2025 Feb. 1;9(2):8-19. doi:10.52876/jcs.1611181