실제 세계와 같은 복합 환경을 위한 LLM VLM 에이전트 시뮬레이터 SimWorld

February 23, 2026

Reading time: 8 minute

...

📝 Abstract

While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (e.g., by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: ( 1 ) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) rich interface for LLM/VLM agents, with multi-modal world inputs/feedback and open-vocabulary action outputs at varying levels of abstraction; and (3) diverse extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT-4o, Gemini-2.5-Flash, Claude-3.5, and DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines: https://simworld.org .

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Technical Report SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds Jiawei Ren 1 * Yan Zhuang 2 * Xiaokang Ye 1 * Lingjun Mao 1 Xuhong He 3 Jianzhi Shen 4 Mrinaal Dogra 1 Yiming Liang 5 Ruixuan Zhang 4 Tianai Yue 4 Yiqing Yang 6 Eric Liu 7 Ryan Wu 4 Kevin Benavente 1 Rajiv Mandya Nagaraju 7 Muhammad Faayez 4 Xiyan Zhang 4 Dhruv Vivek Sharma 1 Xianrui Zhong 3 Ziqiao Ma 8 Tianmin Shu 4 † Zhiting Hu 1 † Lianhui Qin 1 † 1 UCSD 2 UVA 3 UIUC 4 JHU 5 Purdue 6 PolyU 7 USC 8 UMich https://simworld.org Social Interaction Traffic Open- endedness Realistic Simulation Diverse Use LLM/VLM agents in SimWorld Open-ended Environment Large-scale Simulation Real-world Planning Recorded FPV Video Rotation Data Synthesis Position Action Open-ended Action Physics Diverse Scenes Text-to-3D Generation Robots Spawn a taxi car collision replanning Figure 1: An Overview of the SimWorld Simulator, featuring three key designs: (1) realistic, open-ended world simulation, (2) rich interface for LLM/VLM agents, and (3) diverse physical and social reasoning scenarios.

Equal contribution; † Equal advising arXiv:2512.01078v2 [cs.AI] 22 Jan 2026 While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (e.g., by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: (1) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) rich interface for LLM/VLM agents, with multi-modal world inputs/feedback and open-vocabulary action outputs at varying levels of abstraction; and (3) diverse extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT-4o, Gemini-2.5-Flash, Claude-3.5, and DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines: https://simworld.org . Table of Contents 1 Introduction 3 2 The SimWorld Simulator 4 2.1 Unreal Engine Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Diverse Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Rich Assets and Physics Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Procedural City Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 LLM-based Scene Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Waypoint System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.4 Traffic System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.5 Gym-like Interface for Agent-Environment Interaction . . . . . . . . . . . . . . . . . . 11 2.3 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Agent Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Observation Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Action Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 UnrealCV+ Communication Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Case Study: Delivery Task 15 3.1 Task Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Related Works 20 2

Introduction Large language and vision models (e.g., LLMs and VLMs) have emerged as powerful foundations for building intelligent agents, demonstrating remarkable reasoning capab

View Original ArXiv

This content is AI-processed based on ArXiv data.

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found