Can vision language models learn intuitive physics from interaction?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

💡 Research Summary

The paper investigates whether large vision‑language models (VLMs) can acquire intuitive physics by interacting with an environment, a hypothesis motivated by cognitive‑science findings that human children learn robust physical concepts through active experimentation. The authors take a state‑of‑the‑art 8‑billion‑parameter Qwen3‑VL model (quantized to 4‑bit) and fine‑tune it under two contrasting regimes: (1) an interactive reinforcement‑learning (RL) condition using Group‑Relative Policy Optimization (GRPO), where the model receives an image of a block tower, a textual prompt, and must output a token sequence representing an action (e.g., “move block 2 units left”). The environment, built on the ThreeDWorld physics engine, returns a scalar reward based on the stability of the resulting tower and the distance of the moved block from the optimal position. (2) a non‑interactive supervised fine‑tuning (SFT) baseline, where the same data are presented with ground‑truth action tokens and the model is trained with cross‑entropy loss. Both regimes employ parameter‑efficient fine‑tuning (PEFT) via low‑rank LoRA adapters (rank = 16) inserted in every transformer layer, and both are trained for 10 000 steps on a single 80 GB A100 GPU, keeping hyper‑parameters as identical as possible.

The experimental corpus consists of two synthetic datasets generated by the physics engine: “top‑block” (a displaced block on the tower’s apex) and “side‑block” (a displaced block on the floor next to the tower). Each dataset contains towers of 2–4 randomly colored cubes rendered from a fixed camera angle (256 × 256 RGB). Four tasks are defined: (i) binary stability judgment (is the tower stable?), (ii) x‑only relocation (output a single integer moving the block horizontally toward the centre), (iii) x‑y relocation (output two integers moving the block both horizontally and vertically), and (iv) a transfer task using real‑world wooden tower photographs from Lerer et al. (2016). Reward functions are carefully crafted: correct binary answers receive +1, incorrect 0, unparsable –1; for relocation tasks a Gaussian‑shaped reward peaks when the block lands at the optimal position and is scaled by a factor of 20 for stable towers, with penalties for unparsable or physically impossible moves.

Results show that both GRPO and SFT achieve near‑perfect performance on the tasks they are trained on. For binary stability on the top‑block dataset, both reach 0.969 accuracy (ceiling = 1). For the x‑only relocation tasks, both achieve an average reward of ~19.999 out of a maximum of 20, indicating that the models learn to output the optimal integer. However, when evaluated on held‑out tasks that differ in either dataset or action type, performance collapses dramatically. Models trained on the top‑block dataset do not generalize to side‑block scenarios, and models trained on x‑only do not transfer to x‑y, despite the x‑dimension being identical. Moreover, on the real‑world photograph benchmark, both interactive and non‑interactive models perform at chance level. The authors also probe internal activations: linear decoders trained on each layer can predict physical quantities such as block offset and tower stability with high correlation, especially in deeper layers, suggesting that the models encode the relevant physics internally. Yet this latent competence does not translate into reliable textual outputs, highlighting a disconnect between representation and language generation.

The authors discuss why interactive RL did not confer the expected generalization advantage. Their RL setup is limited to a single‑step decision problem; human children, by contrast, engage in multi‑step, temporally extended experiments that expose them to a richer causal structure. The limited capacity of the LoRA adapters (rank = 16) and the modest number of training steps may also restrict the model’s ability to reshape its internal dynamics. Additionally, the reward shaping, while intuitive, may not provide sufficient gradient signal for learning abstract physical principles beyond the narrow task distribution. The paper therefore concludes that while interaction can boost within‑task performance, it is insufficient for endowing VLMs with human‑like intuitive physics. Future work should explore multi‑step RL, richer multimodal feedback (e.g., tactile or proprioceptive signals), larger adapters or full‑model fine‑tuning, and perhaps hybrid objectives that directly supervise latent physical variables. Only by aligning the learning process more closely with the embodied, exploratory nature of human cognition can VLMs hope to acquire robust, transferable physical intuition.

Can vision language models learn intuitive physics from interaction?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment