Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models

The potential data contamination issue in contemporary large language models (LLMs) benchmarks presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, they predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce \textsc{Squid Game}, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, including instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition in performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. We also compare prominent LLM benchmarks and \textsc{Squid Game}, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. Project page: https://github.com/zijianchen98/LLM_Squid_Game.

📜 Original Paper Content