Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Behavior cloning has seen a resurgence as scaling model and data sizes demonstrate strong performance. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license. Empirically, we show that our best model achieves performance competitive with human players across a variety of 3D games. We use this recipe to investigate the scaling laws of behavior cloning, with a focus on causal reasoning. In a controlled toy setting, we first demonstrate that increasing training data and network depth leads to the model learning a more causal policy. We then validate these findings at scale, analyzing models up to 1.2 billion parameters. We observe that the causal improvements seen in the toy domain hold true as model size and training steps increase.


💡 Research Summary

This paper revisits behavior cloning (BC) as a scalable approach for learning game‑playing policies and demonstrates that increasing both model size and dataset volume systematically improves causal reasoning. The authors release a massive, high‑quality dataset comprising over 8,300 hours of human gameplay across a diverse set of 3D titles, amounting to roughly 600 million image‑action pairs captured at 20 FPS. In addition to raw visual frames and corresponding keyboard/mouse actions, the dataset includes automatically generated textual instructions (via a commercial vision‑language model), a small fraction of human correction trajectories collected in a DAgger‑style loop, and a large corpus of unlabeled public gameplay videos.

The core model, named Pixels2Play (P2P), is a decoder‑only transformer policy that processes multimodal inputs: a small set of visual tokens derived from the first six layers of EfficientNet‑B0, a frozen text embedding from EmbeddingGemma, and a “reasoning” token that grants an extra computation step. Ground‑truth action tokens are fed to the transformer during training, and a single latent action prediction token is emitted. An auxiliary lightweight action decoder then autoregressively expands this token into eight detailed action tokens (four keyboard keys, two mouse‑movement axes, two mouse‑button states). This design keeps the token count per timestep low, enabling real‑time inference at 20 Hz on a consumer‑grade GPU (e.g., RTX 5090) while still modeling the full combinatorial action space.

To quantify causal confusion—where a policy relies on spurious correlations such as brake‑light illumination rather than the true visual cause—the authors devise a causality metric in a controlled toy environment. Experiments reveal a clear power‑law relationship: as training data and model depth increase, the causality score rises, indicating that larger models attend more to genuine causal signals. Scaling experiments span four model sizes (150 M, 350 M, 750 M, 1.2 B parameters) and five data fractions (6 %–100 % of the full set). Test loss decreases predictably with data volume, and causal confusion diminishes markedly for the biggest models, confirming that scaling mitigates the non‑causal behavior often observed in small BC agents.

Practical engineering challenges are also addressed. The authors identify a training‑inference gap caused by differences in video compression and resizing pipelines. Experiments show that using RGB color space for resizing and ensuring bit‑identical resizing functions (by unifying PyTorch and Rust implementations) substantially reduces this gap. They further balance video encoding quality and speed by employing a mixture of QP values (6–18) compatible with NVIDIA hardware, which only supports YUV encoding.

Evaluation on several commercial and custom 3D games (e.g., “Quarter Odis”, “Simple‑FPS”, “Hovercraft”) demonstrates human‑competitive performance. Real‑time deployment exhibits average latency below 45 ms per frame, confirming that the model runs smoothly on consumer hardware. Qualitative analyses illustrate more human‑like sustained actions and better long‑term planning compared to baselines that predict all action tokens directly.

In summary, the paper makes three major contributions: (1) an openly licensed, large‑scale multi‑game BC dataset with multimodal annotations; (2) a lightweight, transformer‑based policy architecture capable of real‑time inference on consumer GPUs; and (3) empirical evidence that scaling both model capacity and data volume systematically improves causal reasoning in BC policies. By releasing code, data, and pretrained checkpoints, the work provides a reproducible foundation for future research in game AI, robotics, and any domain where imitation learning from visual streams is desired.


Comments & Academic Discussion

Loading comments...

Leave a Comment