LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions’ severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.

💡 Research Summary

This paper provides one of the first systematic evaluations of large language models (LLMs) acting as decision‑makers in real‑world geopolitical crisis simulations. The authors replicated four multi‑round simulation exercises—Arctic security, US‑China‑Taiwan tensions, Middle‑East dynamics, and a US wildfire response—originally run with MBA students. In each scenario participants chose one action from predefined economic, security, and political options (each annotated with a six‑point severity level) and supplied a justification. The same briefing materials, action menus, and prompts were fed to six state‑of‑the‑art chat‑style LLMs (Claude Sonnet 4.6, ChatGPT 5.2, Gemini 3, Grok 4.1, Mistral 3, and Qwen 3.5‑Plus).

Alignment between model‑selected actions and human choices was measured with micro‑ and macro‑averaged F1 scores. In the first round, Gemini (micro‑F1 = 0.540) and Claude (0.533) achieved the highest agreement, while Qwen and Grok lagged behind. By the second round, alignment dropped dramatically for all models (0.16–0.33), indicating that initial convergence stems from shared default heuristics, but extended strategic reasoning amplifies model‑specific divergences. Inter‑model consistency (Krippendorff’s α ≈ 0.41) was slightly higher than model‑human consistency (α ≈ 0.39), suggesting that the models share common training priors that do not fully capture expert judgment. Scenario‑specific analysis revealed the highest human‑model agreement in the cooperative wildfire exercise and the lowest in the competitive Arctic and US‑China‑Taiwan simulations.

Risk calibration was examined by tracking the severity level of chosen actions across rounds. All agents, human and artificial, displayed an upward shift in median severity from round 1 to round 2, reflecting escalation as contexts evolved.

To probe the argumentative style of the justifications, the authors built a theory‑grounded framing taxonomy derived from International Relations (IR) literature, covering realism (strategic stability, deterrence), liberal institutionalism (multilateral cooperation, rule‑based legitimacy), and constructivism (humanitarian values, climate justice). Using GPT‑4o for automatic annotation, each explanation received primary and secondary frames. Across all models, explanations were dominated by normative‑cooperative frames emphasizing stability, coordination, and risk mitigation; adversarial or deterrence‑focused frames were rare. Lexical analyses (token length, type‑token ratio, TF‑IDF cosine similarity) showed modest diversity, with Claude and Gemini slightly more varied, while Grok and Qwen tended toward repetitive phrasing.

The study concludes that while LLMs can approximate human baseline strategic heuristics in early decision points, they diverge over time, producing distinct behavioral profiles and limited adversarial reasoning. The strong bias toward normative framing raises concerns about AI‑generated policy advice being overly risk‑averse or consensus‑seeking. The authors recommend future work on calibrating risk tolerance, enriching adversarial framing, and exploring multi‑agent environments where LLMs can learn from dynamic interaction and theory‑of‑mind reasoning.

LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment