Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)

Oguzhan Topsakal; Edell Colby; Harper Jackson

doi:10.52876/jcs.1611181

EN

Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)

Abstract

Grid-based games, such as Tic-Tac-Toe, Connect-Four, and Gomoku, offer a valuable platform for evaluating large language models (LLMs) in reasoning, rule comprehension, and strategic thinking which are key skills for advancing Artificial General Intelligence (AGI). Current evaluation benchmarks often focus on tasks like natural language understanding or domain-specific problem-solving, lacking in multi-step reasoning and decision-making assessments. This study introduces an extensible benchmark framework leveraging these games to evaluate LLMs using three prompt types: list, illustration, and image. The framework's modular design facilitates the addition of new games, dynamic rule changes, and advanced prompt engineering techniques, enabling deeper examination of LLM capabilities. Through 2,310 simulated matches, we evaluated leading LLMs, including Claude 3.5 Sonnet, GPT-4 Turbo, and Llama3-70B. Results revealed significant performance variations, with simpler games like Tic-Tac-Toe yielding fewer invalid moves, while more complex games like Connect-Four and Gomoku posed greater challenges. List prompts were generally well-handled, while illustration and image prompts led to higher rates of disqualifications and missed opportunities. The findings underscore the utility of grid-based games as benchmarks for evaluating strategic thinking and adaptability, with implications for robotics, autonomous systems, and interactive AI. Limitations in handling visual data and complex scenarios suggest areas for improvement. The open-source nature of the benchmark encourages transparency and community contributions, fostering collaborative advancements in LLM research. Future directions include expanding to more complex games, refining prompt techniques, and exploring dynamic rule changes to deepen insights into LLM reasoning capabilities. This study lays the groundwork for advancing AI evaluation through flexible and comprehensive benchmarking tools, guiding progress toward more sophisticated and real-world applications.

Keywords

References

[1] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian. A comprehensive overview of large language models. arXiv, 2023.
[2] B. Goertzel and C. Pennachin, editors. Artificial General Intelligence, volume 2. Springer, New York, NY, USA, 2007.
[3] I. Sutskever. The exciting, perilous journey toward agi, 2024. Available online: https://www.ted.com/talks/ilya sutskever the exciting perilous journey toward agi (accessed on 7 June 2024).
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
[5] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018. arXiv:1810.04805.
[6] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training, 2024. Available THE JOURNAL of COGNITIVE SYSTEMS, Vol.9, No.2, 2024 43 Copyright © The Journal of Cognitive Systems (JCS) ISSN: 2548-0650 http://dergipark.gov.tr/jcs online: https://paperswithcode.com/paper/improving-languageunderstanding-by (accessed on 7 June 2024).
[7] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2303.12712 (accessed on 7 June 2024).
[8] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.11805 (accessed on 7 June 2024).

[9] Anthropic. Model card and evaluations for claude models, 2024. Available online: https://www-files.anthropic.com/production/images/ Model-Card-Claude-2.pdf (accessed on 7 June 2024).
[10] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozie`re, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2302.13971 (accessed on 7 June 2024).
[11] Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy. Challenges and applications of large language models. arXiv, 2023. arXiv:2307.10169.
[12] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao. Large language models: A survey. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2402.06196 (accessed on 7 June 2024).
[13] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/ 2307.03109 (accessed on 7 June 2024).
[14] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/ abs/1804.07461 (accessed on 7 June 2024).
[15] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/1905.00537 (accessed on 7 June 2024).
[16] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv [Cs.CL], 2024. Available online: http://arxiv. org/abs/2211.09110 (accessed on 7 June 2024).
[17] Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. Available online: https://arxiv.org/abs/ 2009.03300 (accessed on 7 June 2024).
[18] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv [Cs.CL], 2024. Available online: http://arxiv. org/abs/2206.04615 (accessed on 7 June 2024).
[19] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv, 2018. arXiv:1803.05457.
[20] S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv, 2021. arXiv:2109.07958.
[21] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hel- laswag: Can a machine really finish your sentence? arXiv, 2019. arXiv:1905.07830.
[22] C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv, 2024. arXiv:2406.19314.
[23] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
[24] Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, et al. Evaluat- ing large language models: A comprehensive survey. arXiv, 2023. arXiv:2310.19736.
[25] Andrew Ng. We need better evals for llm applications, 2024. Available online: https://www.deeplearning.ai/the-batch/ weneed-better-evals-for-llm-applications/ (accessed on 7 June 2024).
[26] Q. Tan, A. Kazemi, and R. Mihalcea. Text-based games as a challenging benchmark for large language models, 2024. Available online: https://openreview.net/forum?id=2g4m5S knF (accessed on 7 June 2024).
[27] D. Qiao, C. Wu, Y. Liang, J. Li, and N. Duan. Gameeval: Evaluating llms on conversational games. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2308.10032 (accessed on 7 June 2024).
[28] Y. Wu, X. Tang, T. M. Mitchell, and Y. Li. Smartplay: A benchmark for llms as intelligent agents. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2310.01557 (accessed on 7 June 2024).
[29] E. Akata, L. Schulz, J. Coda-Forno, S. J. Oh, M. Bethge, and E. Schulz. Playing repeated games with large language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2305.16867 (accessed on 7 June 2024).
[30] C. F. Tsai, X. Zhou, S. S. Liu, J. Li, M. Yu, and H. Mei. Can large language models play text games well? current state-of-the-art and open questions. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/ 2304.02868 (accessed on 7 June 2024).
[31] C. Fan, J. Chen, Y. Jin, and H. He. Can large language models serve as rational players in game theory? a systematic analysis. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.05488 (accessed on 7 June 2024).
[32] J. Duan, R. Zhang, J. Diffenderfer, B. Kailkhura, L. Sun, E. StengelEskin, et al. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv, 2024. arXiv:2402.12348.
[33] A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, et al. Game-bench: Evaluating strategic reasoning abilities of llm agents. arXiv, 2024. arXiv:2406.06613.
[34] O. Topsakal and J. B. Harper. Benchmarking large language model (llm) performance for game playing via tic-tac-toe. Electronics, 13(8), 2024.
[35] R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, and G. N. Yannakakis. Large language models and games: A survey and roadmap. arXiv, 2024. arXiv:2402.18659.
[36] S. Hu, T. Huang, F. Ilhan, S. Tekin, G. Liu, R. Kompella, and L. Liu. A survey on large language model-based game agents. arXiv, 2024. arXiv:2404.02039.
[37] LLM Game Benchmark. Llm game benchmark, 2024. Available online: https://github.com/research-outcome/LLM-Game-Benchmark/ (ac- cessed on 19 June 2024).
[38] L. V. Allis, H. J. van den Herik, and M. P. Huntjens. Go-moku solved by new search techniques. Computational Intelligence, 12(1):7–72, 1996.
[39] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, et al. A survey of large language models. arXiv, 2023. arXiv:2303.18223.

Details

Primary Language

English

Subjects

Artificial Intelligence (Other)

Journal Section

Research Article

Authors

Oguzhan Topsakal ^*
0000-0002-9731-6946
United States

Edell Colby This is me
United States

Harper Jackson This is me
United States

Publication Date

February 1, 2025

Submission Date

January 1, 2025

Acceptance Date

January 10, 2025

Published in Issue

Year 2024 Volume: 9 Number: 2

DOI

https://doi.org/10.52876/jcs.1611181

IZ

https://izlik.org/JA84RY77FY

Cite

RIS / Bibtex

APA

Topsakal, O., Colby, E., & Jackson, H. (2025). Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). The Journal of Cognitive Systems, 9(2), 8-19. https://doi.org/10.52876/jcs.1611181

AMA

1.Topsakal O, Colby E, Jackson H. Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). JCS. 2025;9(2):8-19. doi:10.52876/jcs.1611181

Chicago

Topsakal, Oguzhan, Edell Colby, and Harper Jackson. 2025. “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”. The Journal of Cognitive Systems 9 (2): 8-19. https://doi.org/10.52876/jcs.1611181.

EndNote

Topsakal O, Colby E, Jackson H (February 1, 2025) Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). The Journal of Cognitive Systems 9 2 8–19.

IEEE

[1]O. Topsakal, E. Colby, and H. Jackson, “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”, JCS, vol. 9, no. 2, pp. 8–19, Feb. 2025, doi: 10.52876/jcs.1611181.

ISNAD

Topsakal, Oguzhan - Colby, Edell - Jackson, Harper. “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”. The Journal of Cognitive Systems 9/2 (February 1, 2025): 8-19. https://doi.org/10.52876/jcs.1611181.

JAMA

1.Topsakal O, Colby E, Jackson H. Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). JCS. 2025;9:8–19.

MLA

Topsakal, Oguzhan, et al. “Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)”. The Journal of Cognitive Systems, vol. 9, no. 2, Feb. 2025, pp. 8-19, doi:10.52876/jcs.1611181.

Vancouver

1.Oguzhan Topsakal, Edell Colby, Harper Jackson. Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). JCS. 2025 Feb. 1;9(2):8-19. doi:10.52876/jcs.1611181