From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models

From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep learning agents can achieve high performance in complex game domains without often understanding the underlying causal game mechanics. To address this, we investigate Causal Induction: the ability to infer governing laws from observational data, by tasking Large Language Models (LLMs) with reverse-engineering Video Game Description Language (VGDL) rules from gameplay traces. To reduce redundancy, we select nine representative games from the General Video Game AI (GVGAI) framework using semantic embeddings and clustering. We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL. Both approaches are evaluated across multiple prompting strategies and controlled context regimes, varying the amount and form of information provided to the model, from just raw gameplay observations to partial VGDL specifications. Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation, achieving preference win rates of up to 81% in blind evaluations and yielding fewer logically inconsistent rules. These learned SCMs can be used for downstream use cases such as causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.


💡 Research Summary

This paper investigates whether large language models (LLMs) can perform causal induction—inferring the underlying rules of a video game from raw gameplay observations. The authors use the General Video Game AI (GVGAI) framework, whose games are defined in the human‑readable Video Game Description Language (VGDL). Because VGDL explicitly encodes sprites, level layouts, interaction rules, and termination conditions, it can be viewed as a concrete structural causal model (SCM) where sprites correspond to endogenous variables, level mappings to exogenous inputs, and interaction rules to structural equations.

To obtain a robust test set, the authors first translate the VGDL of 80 GVGAI games into concise natural‑language summaries (<100 words). These summaries are embedded with a Sentence‑Transformer (S‑BERT) model, and K‑Means clustering (k = 9) is applied to the 384‑dimensional vectors. The game nearest each cluster centroid is selected, yielding a semantically diverse benchmark of nine representative games.

Model selection begins with a quick classification test on ten random games, identifying Qwen‑3‑8B as the most cost‑effective performer. Additional “reasoning” models from the Qwen family, including a quantized Qwen‑32B, are evaluated for a Pareto trade‑off between accuracy and runtime; Qwen‑3‑8B and Qwen‑32B lie on the frontier and are used for the main experiments.

Two tasks are defined:

  1. Multi‑Class Game Identification – Given a short (10‑frame) ASCII‑grid trace and a set of candidate game descriptions, the LLM must output the correct game label. Four prompt variants are tested to separate memorization from genuine causal reasoning: (a) Standard – fixed expert‑written descriptions; (b) Cons – model‑refined expert descriptions; (c) Dest – descriptions generated solely from the game name; (d) VGDL – summaries of the raw VGDL. Accuracy is recorded for each model‑prompt combination.

  2. VGDL Synthesis – The core contribution is a hierarchical context injection scheme with five levels (L0–L4). At each level the model receives increasingly rich information: raw observations only (L0), VGDL grammar plus an example (L1), the target game’s name and natural‑language description (L2), a dictionary of distractor game descriptions (L3), and finally a partially‑filled VGDL file with missing InteractionSet and TerminationSet (L4). For every level two parallel generation streams are run:

    • Stream A (Direct) – The model directly produces executable VGDL code from the supplied context.
    • Stream B (SCM‑Guided) – The model first drafts an explicit SCM (listing variables and structural equations) and then translates this SCM into VGDL.

Evaluation metrics include exact string match to the ground‑truth VGDL, counts of logical inconsistencies (e.g., contradictory collision rules), and blind human preference judgments.

Results show that the SCM‑guided pipeline consistently outperforms direct synthesis. Across all context levels, especially at L4 (the “Completionist” condition), the SCM approach yields higher accuracy, fewer logical errors, and a 81 % win rate in blind human evaluations. Direct synthesis often produces syntactically valid code but with subtle rule violations that break gameplay.

The paper’s contributions are threefold: (1) a semantically diverse, nine‑game benchmark derived via embedding‑based clustering; (2) a self‑conditioned evaluation framework that isolates model memorization from true causal understanding; (3) a dual‑stream generation architecture that demonstrates the practical benefit of forcing LLMs to perform intermediate causal reasoning before code generation.

Beyond the immediate task, the learned SCMs are shown to be useful for downstream applications such as causal reinforcement learning (where interventions can be simulated), interpretable agent design, and procedurally generating new games that are guaranteed to be logically consistent.

In discussion, the authors acknowledge limitations: experiments are confined to 2‑D grid games with short observation windows, and the approach relies on a fixed VGDL grammar. Future work is suggested on extending to 3‑D environments, longer and multi‑agent traces, and more general programming languages. Nonetheless, the study provides compelling evidence that LLMs, when guided to construct explicit causal models, can move beyond pattern matching toward genuine rule induction, opening new avenues for AI systems that need to understand and manipulate the causal structure of complex interactive domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment