Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)

Oguzhan Topsakal; Edell Colby; Harper Jackson

doi:10.52876/jcs.1611181

Research Article

Year 2024, Volume: 9 Issue: 2, 8 - 19, 01.02.2025

Oguzhan Topsakal , Edell Colby Harper Jackson

https://doi.org/10.52876/jcs.1611181

Abstract

References

[1] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian. A comprehensive overview of large language models. arXiv, 2023.
[2] B. Goertzel and C. Pennachin, editors. Artificial General Intelligence, volume 2. Springer, New York, NY, USA, 2007.
[3] I. Sutskever. The exciting, perilous journey toward agi, 2024. Available online: https://www.ted.com/talks/ilya sutskever the exciting perilous journey toward agi (accessed on 7 June 2024).
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
[5] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018. arXiv:1810.04805.
[6] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training, 2024. Available THE JOURNAL of COGNITIVE SYSTEMS, Vol.9, No.2, 2024 43 Copyright © The Journal of Cognitive Systems (JCS) ISSN: 2548-0650 http://dergipark.gov.tr/jcs online: https://paperswithcode.com/paper/improving-languageunderstanding-by (accessed on 7 June 2024).
[7] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2303.12712 (accessed on 7 June 2024).
[8] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.11805 (accessed on 7 June 2024).
[9] Anthropic. Model card and evaluations for claude models, 2024. Available online: https://www-files.anthropic.com/production/images/ Model-Card-Claude-2.pdf (accessed on 7 June 2024).
[10] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozie`re, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2302.13971 (accessed on 7 June 2024).
[11] Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy. Challenges and applications of large language models. arXiv, 2023. arXiv:2307.10169.
[12] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao. Large language models: A survey. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2402.06196 (accessed on 7 June 2024).
[13] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/ 2307.03109 (accessed on 7 June 2024).
[14] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/ abs/1804.07461 (accessed on 7 June 2024).
[15] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/1905.00537 (accessed on 7 June 2024).
[16] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv [Cs.CL], 2024. Available online: http://arxiv. org/abs/2211.09110 (accessed on 7 June 2024).
[17] Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. Available online: https://arxiv.org/abs/ 2009.03300 (accessed on 7 June 2024).
[18] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv [Cs.CL], 2024. Available online: http://arxiv. org/abs/2206.04615 (accessed on 7 June 2024).
[19] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv, 2018. arXiv:1803.05457.
[20] S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv, 2021. arXiv:2109.07958.
[21] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hel- laswag: Can a machine really finish your sentence? arXiv, 2019. arXiv:1905.07830.
[22] C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv, 2024. arXiv:2406.19314.
[23] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
[24] Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, et al. Evaluat- ing large language models: A comprehensive survey. arXiv, 2023. arXiv:2310.19736.
[25] Andrew Ng. We need better evals for llm applications, 2024. Available online: https://www.deeplearning.ai/the-batch/ weneed-better-evals-for-llm-applications/ (accessed on 7 June 2024).
[26] Q. Tan, A. Kazemi, and R. Mihalcea. Text-based games as a challenging benchmark for large language models, 2024. Available online: https://openreview.net/forum?id=2g4m5S knF (accessed on 7 June 2024).
[27] D. Qiao, C. Wu, Y. Liang, J. Li, and N. Duan. Gameeval: Evaluating llms on conversational games. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2308.10032 (accessed on 7 June 2024).
[28] Y. Wu, X. Tang, T. M. Mitchell, and Y. Li. Smartplay: A benchmark for llms as intelligent agents. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2310.01557 (accessed on 7 June 2024).
[29] E. Akata, L. Schulz, J. Coda-Forno, S. J. Oh, M. Bethge, and E. Schulz. Playing repeated games with large language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2305.16867 (accessed on 7 June 2024).
[30] C. F. Tsai, X. Zhou, S. S. Liu, J. Li, M. Yu, and H. Mei. Can large language models play text games well? current state-of-the-art and open questions. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/ 2304.02868 (accessed on 7 June 2024).
[31] C. Fan, J. Chen, Y. Jin, and H. He. Can large language models serve as rational players in game theory? a systematic analysis. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.05488 (accessed on 7 June 2024).
[32] J. Duan, R. Zhang, J. Diffenderfer, B. Kailkhura, L. Sun, E. StengelEskin, et al. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv, 2024. arXiv:2402.12348.
[33] A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, et al. Game-bench: Evaluating strategic reasoning abilities of llm agents. arXiv, 2024. arXiv:2406.06613.
[34] O. Topsakal and J. B. Harper. Benchmarking large language model (llm) performance for game playing via tic-tac-toe. Electronics, 13(8), 2024.
[35] R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, and G. N. Yannakakis. Large language models and games: A survey and roadmap. arXiv, 2024. arXiv:2402.18659.
[36] S. Hu, T. Huang, F. Ilhan, S. Tekin, G. Liu, R. Kompella, and L. Liu. A survey on large language model-based game agents. arXiv, 2024. arXiv:2404.02039.
[37] LLM Game Benchmark. Llm game benchmark, 2024. Available online: https://github.com/research-outcome/LLM-Game-Benchmark/ (ac- cessed on 19 June 2024).
[38] L. V. Allis, H. J. van den Herik, and M. P. Huntjens. Go-moku solved by new search techniques. Computational Intelligence, 12(1):7–72, 1996.
[39] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, et al. A survey of large language models. arXiv, 2023. arXiv:2303.18223.

Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI)

Year 2024, Volume: 9 Issue: 2, 8 - 19, 01.02.2025

Oguzhan Topsakal , Edell Colby Harper Jackson

https://doi.org/10.52876/jcs.1611181

Abstract

Grid-based games, such as Tic-Tac-Toe, Connect-Four, and Gomoku, offer a valuable platform for evaluating large language models (LLMs) in reasoning, rule comprehension, and strategic thinking which are key skills for advancing Artificial General Intelligence (AGI). Current evaluation benchmarks often focus on tasks like natural language understanding or domain-specific problem-solving, lacking in multi-step reasoning and decision-making assessments. This study introduces an extensible benchmark framework leveraging these games to evaluate LLMs using three prompt types: list, illustration, and image. The framework's modular design facilitates the addition of new games, dynamic rule changes, and advanced prompt engineering techniques, enabling deeper examination of LLM capabilities. Through 2,310 simulated matches, we evaluated leading LLMs, including Claude 3.5 Sonnet, GPT-4 Turbo, and Llama3-70B. Results revealed significant performance variations, with simpler games like Tic-Tac-Toe yielding fewer invalid moves, while more complex games like Connect-Four and Gomoku posed greater challenges. List prompts were generally well-handled, while illustration and image prompts led to higher rates of disqualifications and missed opportunities. The findings underscore the utility of grid-based games as benchmarks for evaluating strategic thinking and adaptability, with implications for robotics, autonomous systems, and interactive AI. Limitations in handling visual data and complex scenarios suggest areas for improvement. The open-source nature of the benchmark encourages transparency and community contributions, fostering collaborative advancements in LLM research. Future directions include expanding to more complex games, refining prompt techniques, and exploring dynamic rule changes to deepen insights into LLM reasoning capabilities. This study lays the groundwork for advancing AI evaluation through flexible and comprehensive benchmarking tools, guiding progress toward more sophisticated and real-world applications.

Keywords

AGI, Artificial General Intelligence, Benchmark, Large Language Model, LLM, Game, LLM Benchmarking, AI Evaluation, Prompt Engineering, Open-Source, Game Simulation, Strategic Thinking, Grid Games, Rule Comprehension

References

[1] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian. A comprehensive overview of large language models. arXiv, 2023.
[2] B. Goertzel and C. Pennachin, editors. Artificial General Intelligence, volume 2. Springer, New York, NY, USA, 2007.
[3] I. Sutskever. The exciting, perilous journey toward agi, 2024. Available online: https://www.ted.com/talks/ilya sutskever the exciting perilous journey toward agi (accessed on 7 June 2024).
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
[5] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018. arXiv:1810.04805.
[6] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training, 2024. Available THE JOURNAL of COGNITIVE SYSTEMS, Vol.9, No.2, 2024 43 Copyright © The Journal of Cognitive Systems (JCS) ISSN: 2548-0650 http://dergipark.gov.tr/jcs online: https://paperswithcode.com/paper/improving-languageunderstanding-by (accessed on 7 June 2024).
[7] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2303.12712 (accessed on 7 June 2024).
[8] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.11805 (accessed on 7 June 2024).
[9] Anthropic. Model card and evaluations for claude models, 2024. Available online: https://www-files.anthropic.com/production/images/ Model-Card-Claude-2.pdf (accessed on 7 June 2024).
[10] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozie`re, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2302.13971 (accessed on 7 June 2024).
[11] Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy. Challenges and applications of large language models. arXiv, 2023. arXiv:2307.10169.
[12] S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao. Large language models: A survey. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2402.06196 (accessed on 7 June 2024).
[13] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/ 2307.03109 (accessed on 7 June 2024).
[14] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/ abs/1804.07461 (accessed on 7 June 2024).
[15] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/1905.00537 (accessed on 7 June 2024).
[16] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv [Cs.CL], 2024. Available online: http://arxiv. org/abs/2211.09110 (accessed on 7 June 2024).
[17] Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. Available online: https://arxiv.org/abs/ 2009.03300 (accessed on 7 June 2024).
[18] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv [Cs.CL], 2024. Available online: http://arxiv. org/abs/2206.04615 (accessed on 7 June 2024).
[19] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv, 2018. arXiv:1803.05457.
[20] S. Lin, J. Hilton, and O. Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv, 2021. arXiv:2109.07958.
[21] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hel- laswag: Can a machine really finish your sentence? arXiv, 2019. arXiv:1905.07830.
[22] C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv, 2024. arXiv:2406.19314.
[23] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
[24] Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, et al. Evaluat- ing large language models: A comprehensive survey. arXiv, 2023. arXiv:2310.19736.
[25] Andrew Ng. We need better evals for llm applications, 2024. Available online: https://www.deeplearning.ai/the-batch/ weneed-better-evals-for-llm-applications/ (accessed on 7 June 2024).
[26] Q. Tan, A. Kazemi, and R. Mihalcea. Text-based games as a challenging benchmark for large language models, 2024. Available online: https://openreview.net/forum?id=2g4m5S knF (accessed on 7 June 2024).
[27] D. Qiao, C. Wu, Y. Liang, J. Li, and N. Duan. Gameeval: Evaluating llms on conversational games. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2308.10032 (accessed on 7 June 2024).
[28] Y. Wu, X. Tang, T. M. Mitchell, and Y. Li. Smartplay: A benchmark for llms as intelligent agents. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2310.01557 (accessed on 7 June 2024).
[29] E. Akata, L. Schulz, J. Coda-Forno, S. J. Oh, M. Bethge, and E. Schulz. Playing repeated games with large language models. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2305.16867 (accessed on 7 June 2024).
[30] C. F. Tsai, X. Zhou, S. S. Liu, J. Li, M. Yu, and H. Mei. Can large language models play text games well? current state-of-the-art and open questions. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/ 2304.02868 (accessed on 7 June 2024).
[31] C. Fan, J. Chen, Y. Jin, and H. He. Can large language models serve as rational players in game theory? a systematic analysis. arXiv [Cs.CL], 2024. Available online: http://arxiv.org/abs/2312.05488 (accessed on 7 June 2024).
[32] J. Duan, R. Zhang, J. Diffenderfer, B. Kailkhura, L. Sun, E. StengelEskin, et al. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv, 2024. arXiv:2402.12348.
[33] A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, et al. Game-bench: Evaluating strategic reasoning abilities of llm agents. arXiv, 2024. arXiv:2406.06613.
[34] O. Topsakal and J. B. Harper. Benchmarking large language model (llm) performance for game playing via tic-tac-toe. Electronics, 13(8), 2024.
[35] R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, and G. N. Yannakakis. Large language models and games: A survey and roadmap. arXiv, 2024. arXiv:2402.18659.
[36] S. Hu, T. Huang, F. Ilhan, S. Tekin, G. Liu, R. Kompella, and L. Liu. A survey on large language model-based game agents. arXiv, 2024. arXiv:2404.02039.
[37] LLM Game Benchmark. Llm game benchmark, 2024. Available online: https://github.com/research-outcome/LLM-Game-Benchmark/ (ac- cessed on 19 June 2024).
[38] L. V. Allis, H. J. van den Herik, and M. P. Huntjens. Go-moku solved by new search techniques. Computational Intelligence, 12(1):7–72, 1996.
[39] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, et al. A survey of large language models. arXiv, 2023. arXiv:2303.18223.

There are 39 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence (Other)
Journal Section	Articles
Authors	Oguzhan Topsakal 0000-0002-9731-6946 Edell Colby This is me Harper Jackson This is me
Publication Date	February 1, 2025
Submission Date	January 1, 2025
Acceptance Date	January 10, 2025
Published in Issue	Year 2024 Volume: 9 Issue: 2

Cite

APA	Topsakal, O., Colby, E., & Jackson, H. (2025). Evaluating the Performance of Large Language Models (LLMs) Through Grid-Based Game Competitions: An Extensible Benchmark and Leaderboard on the Path to Artificial General Intelligence (AGI). The Journal of Cognitive Systems, 9(2), 8-19. https://doi.org/10.52876/jcs.1611181

Download Cover Image

Article Files

Full Text