SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
ArXiv ID: 2512.10046
Date: 2025-12-10
Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu

📝 Abstract

Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics (SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that stateof-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.

💡 Deep Analysis

Deep Dive into SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration.

📄 Full Content

SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration Yan Zhuang1 Jiawei Ren2∗ Xiaokang Ye2∗ Jianzhi Shen3 Ruixuan Zhang3 Tianai Yue3 Muhammad Faayez3 Xuhong He4 Xiyan Zhang3 Ziqiao Ma5 Lianhui Qin2‡ Zhiting Hu2‡ Tianmin Shu3‡ 1University of Virginia 2UC San Diego 3Johns Hopkins University 4Carnegie Mellon University 5University of Michigan Abstract Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld- Robotics (SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and trafﬁc systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and trafﬁc; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and trafﬁc, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state- of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments. Project website: SimWorld-Robotics 1 Introduction There has been tremendous progress in engineering general-purpose robotics that can follow human instructions and perform open-ended tasks [2, 28, 15, 27, 46], thanks to the advances in robot foundation models. Training these models requires a large amount of data, much of which can be generated in high-ﬁdelity embodied simulators, such as Habitat 3 [40], RoboTHOR [10], TDW [14], VirtualHome [38], Virtual Community [61] and BEHAVIOR [27]. They can also be systematically evaluated in diverse scenarios created in these simulators. However, current embodied simulators for robotics have been focused on tabletop [35, 58, 34, 22, 59] or household tasks [48, 27, 26, 47, 46]. In this work, we want to study how to create a realistic and scalable embodied simulator for outdoor robotics tasks. * Equal contribution. ‡ Equal advising. 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2512.10046v3 [cs.AI] 23 Jan 2026 Figure 1: Overview of SimWorld Robotics (SWR). Built upon Unreal Engine 5, SWR is a simulation platform for large-scale, photorealistic, and dynamic urban environments. It offers diverse high-ﬁdelity building and object assets, supports embodied agents with rich action spaces, includes a background trafﬁc system powered by city-scale waypoint generation, and enables comprehensive city procedural generation. Compared to indoor scenarios, robotics in outdoor environments, in particular, large urban environments, introduces additional challenges, such as (1) 3D perception, spatial reasoning and grounding in large environments; (2) safe navigation in dynamic scenes with people and trafﬁc; (3) long-range spatial memory; and (4) multi-agent collaboration and communication in task. There have been urban simulators developed in recent years. However, to address the critical challenges faced by real-world robotics in urban environments, they lack the necessary realism, customizability, scalability, and versatility. For instance, well-known simulators such as AirSim [45], CARLA [12] mainly focus on autonomous driving domains. While it supports some manual customization to the provided city environments, it does not support procedural city environment generation. It also does not support the ﬂexible control of embodied agents (such as mobile robots or pedestrians) other than vehicles. More recent city simulators, such as MetaDrive [29], MetaUrban [56], signiﬁcantly improve the scalability. However, the simulated environments still lack photorealism as shown in Figure 2. Therefore, we introduce SimWorld-Robotics (SWR), a new embodied AI simulation platform for large-scale, photorealistic, and dynamic urban environments. As illustrated in Figure 1, SWR offers diverse high-ﬁdelity building and object assets, multiple types of embodied agents with rich action spaces, a waypoint-based background trafﬁc system, and a c

…(Full text truncated)…

📄 Read Full PDF on ArXiv