SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation

February 09, 2026

Reading time: 22 minute

...

📝 Original Info

Title: SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
ArXiv ID: 2511.21135
Date: 2025-11-26
Authors: Ziyi Chen, Yingnan Guo, Zedong Chu, Minghua Luo, Yanfen Shen, Mingchao Sun, Junjun Hu, Shichao Xie, Kuan Yang, Pei Shi, Zhining Gu, Lu Liu, Honglin Han, Xiaolong Wu, Mu Xu, Yu Zhang

📝 Abstract

Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multistage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.

📄 Full Content

As embodied agents become increasingly integrated into everyday social environments, robotic navigation must prioritize not only operational efficiency but also social awareness to ensure safety and compliance with established social norms. However, most existing approaches [24,34,35,38,40] focus primarily on shortest-path planning and collision avoidance, while overlooking the social compliance essential to real-world deployment (e.g., a robotic guide dog). Consequently, trajectories that appear optimal from a geometric or efficiency perspective may still lead to socially disruptive or inappropriate behaviors, such as jaywalking and traversing restricted zones (e.g., landscaped lawns).

To bridge this gap, we propose SocialNav, a hierarchical foundation model for socially aware navigation that integrates the understanding of social norms with the generation of socially compliant trajectories. SocialNav consists of two core components: (1) the Brain Module, built upon a vision-language model (VLM), which encodes rich social navigation priors and is capable of generating interpretable chain-of-thought (CoT) explanations or explicitly predicting socially traversable regions; and (2) the Action Expert, based on conditional flow matching [21], which translates the semantic priors provided by the Brain Module into robot-executable trajectories that adhere to social norms.

Developing such a model requires rich, multimodal data that encodes both cognitive knowledge and action-oriented intuition that are currently scarce in existing embodied navigation corpora. To this end, we construct the Soc-Nav Dataset, a large-scale heterogeneous corpus of 7 million samples that integrates two complementary modalities: (1) the Cognitive Activation Dataset (CAD), which encapsulates navigational knowledge through chain-of-thought (CoT) reasoning, social traversability prediction, and embodied question answering; and (2) the Expert Trajectories Pyramid (ETP), which aggregates trajectories from internet videos, simulated environments, and real-world robot deployments to distill rich, context-aware action priors tailored for complex social navigation scenarios.

As is widely recognized, aligning agent behavior with social norms goes beyond the representational and reasoning capabilities of standard imitation learning. Even when social priors are implicitly embedded in demonstration data, behavior cloning fails to capture the causal structure underlying normative conduct. To address this limitation, we propose Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly promotes socially compliant behavior through norm-aware reward mechanisms. This approach enables agents to internalize the underlying principles governing social conventions, rather than merely mimicking surface-level actions.

Finally, we introduce the SocNav Benchmark, a highfidelity evaluation platform blending physics simulation (Isaac Sim [27]) and photorealistic rendering (3DGS [18]) in 9 newly captured large-scale social scenes. Our benchmark enables comprehensive comparisons of both fundamental navigation capabilities and social compliance. Our contributions are summarized as follows:

• SocialNav Foundation Model: a hierarchical brain-action architecture that unifies high-level social norm understanding with the generation of normcompliant trajectories.

• SAFE-GRPO RL Framework: the first flow-based RL framework for embodied navigation designed to enforce social compliance through norm-aware reward shaping.

Visual navigation [3] has evolved from classical SLAMbased methods [4,17,41,42] to end-to-end learning approaches such as GNM [34], ViNT [35], and NoMaD [40]. To improve generalization, a line of work has expanded training corpora beyond expert demonstrations. Some methods, such as CityWalker [24] and MBRA [14], leverage massive internet-sourced video collections to capture diverse action priors, while others [8,12] employ simulation platforms [32,48] to generate controlled, rule-compliant trajectories at scale. Concurrently, VLMs are used to enhance semantic understanding [9,29,46,49,51]. However, these directions still struggle with high-level reasoning and alignment with human values in socially complex scenarios.

Social navigation requires adherence to social conventions, moving beyond early handcrafted costs [7,31] to VLMbased reasoning [19,25,45]. However, VLM reasoning is often disconnected from low-level action generation. Recently, flow matching (FM) [21] has been widely adopted in vision-and-language-action (VLA) models [15,50] for its ability to model multimodal action distributions, but these models are typically limited to behavior cloning. In the navigation domain, pure imitation learning often lacks the causal understanding needed to robustly adapt to novel or dynamic social situations. Therefore, there is an inherent need for models that can not only generate actions but also align with complex human preferences and social norms. Foundational work, such as GRPO [10,37] and Flow-GRPO [23], demonstrates combining generative models with online RL for human preference alignment, inspiring our approach.

To enable the development and evaluation of socially-aware embodied navigation systems, we introduce the SocNav Dataset and the SocNav Bench. Together, they form a comprehensive ecosystem for training, validating, and benchmarking embodied agents.

We present the SocNav Dataset, a large-scale, multi-source heterogeneous dataset to support robust, generalizable, and socially-aware navigation. Comprising over 7 million training samples, the SocNav Dataset consists of two core components: the Expert Trajectories Pyramid (ETP) for learn- Across all scenes, trajectories are generated on manually annotated traversable road networks. These trajectories encompass not only standard on-road navigation but also challenging recovery scenarios, such as near-collisions. This rich dataset enables the model to learn both efficient navigation and robust recovery behaviors. • Trajectories from real-world robot data (D real ). This layer provides 340K high-quality trajectories collected from autonomous robots deployed in real-world environments (from public datasets including SCAND [16], Huron [13], Recon [33], and CityWalker teleoperation data [24]). These trajectories provide ground-truth metric accuracy, physical realism, and sensor consistency, capturing true physical dynamics, sensor noise, and environmental interactions. They are ideal for supervised finetuning (SFT) and closing the sim-to-real gap. person images and curated prompt templates, we prompted Qwen2.5VL-72B [1] to generate 825K Chainof-Thought (CoT) samples. These samples, comprising step-by-step textual rationales for navigation decisions, were designed to teach the agent explicit reasoning.

• General Visual Question Answering (VQA): To ensure the model maintains general-world knowledge, we curated 1 million general VQA samples from [5,6,11,20,22,36]. This task trains the agent to reason about spatial relationships and object properties in the environment. Together, the ETP and CAD establish the SocNav Dataset as a comprehensive foundation for training foundational navigation agents-unifying scale, realism, and cognition within a single, coherent data framework.

We introduce the SocNav Bench, a unified, high-fidelity evaluation platform for socially-aware navigation.

It achieves a unique blend of realism by combining the physics simulation of Isaac Sim with the photorealistic rendering of 3DGS. The benchmark is built upon 9 new largescale social scenes we captured and reconstructed using 3DGS, covering a total area of 73K m 2 . These new scenes include diverse, human-centric environments: three parks, three street-level roads, two offices, and one campus. All evaluations employ a standardized setup: a unified Unitree Go2 robot model and a consistent locomotion policy. To enable realistic physical interaction, each 3DGS scene is converted into a mesh and imported into Isaac Sim for accurate collision feedback. Additionally, to simulate potential dynamic collisions, digital humans are randomly introduced into the scenes. For rigorous benchmarking, we sample 10 start-goal pairs at distances of 20m and 100m within each scene, creating 20 evaluation cases per scene.

We formulate the foundational navigation task as a visionbased, history-conditioned point-goal navigation problem in diverse environments, similar to the setup in City-Walker [24].

At each time step t, the agent receives a sequence of recent monocular visual observations O t-n:t = {o t-n , . . . , o t }, where o ∈ R H×W ×3 , along with their associated 2D positional information P t-n:t = {p t-n , . . . , p t } with p ∈ R 2 . Given a specified 2D target location g ∈ R 2 , the objective is to learn a policy π θ that maps the historical observations and positions to a sequence of future actions:

Unless otherwise specified, we set n = 5 and m = 5.

Our complete model architecture, illustrated in Fig. 3, adopts a hierarchical “brain-action” design for robust and socially aware embodied navigation. It consists of two tightly coupled branches with complementary roles: one for high-level semantic understanding and the other for lowlevel action generation.

The Brain Module. The Brain Module serves as the cognitive core of the system, implemented as a Vision-Language Model (VLM) denoted by π VLM . It performs generative, autoregressive textual reasoning to infer critical environmental semantics and can produce 3 types of interpretable outputs:

• Socially traversable regions: Represented as polygons, these delineate areas such as sidewalks, crosswalks, and stairs. • CoT reasoning: Step-by-step textual explanations for navigational decisions. • VQA: Responses to free-form questions that enhance scene understanding. By explicitly generating textual outputs, the VLM provides interpretable insights essential for safe and socially responsible navigation. This design enables the agent not only to see but also to reason about its surroundings, forming the cognitive foundation for socially aware navigation. The Action Expert. This action expert specializes in endto-end trajectory generation. Inspired by recent advancements [2,15] in action prediction, we leverage conditional flow matching [21] to model a distribution of actions. The module is conditioned on latent semantic features extracted from the last-layer features of the VLM:

This mechanism enables the action expert to produce efficient and socially compliant trajectories in complex environments by decoupling high-level reasoning from lowlevel control while preserving a strong semantic connection between them.

Our training pipeline follows a multi-stage strategy designed to progressively instill both general navigation priors and social compliance into SocialNav.

Stage 1: Pre-training for General Navigation Ability.

In the first stage, we aim to activate the VLM’s navigation capability and train the flow model to predict low-level waypoints. This is achieved through pretraining on the ETP datasets (D video and D sim ) together with the cognitive activation dataset D cog . D video provides diverse real-world navigation scenarios with implicit expert behaviors, while D sim introduces challenging synthetic cases to enhance robustness in rare and complex situations. The D cog further improves the VLM’s reasoning and decision-making through CoT and VQA tasks, and equips it with the ability to predict traversable regions-laying a solid foundation for subsequent social-norm alignment.

Stage2: Fine-tuning with High-Quality Real-World Data. In the second stage, we fine-tune the model on high-quality expert trajectories collected from real-world robots (D real ) to reduce the sim-to-real gap. During this phase, the VLM is frozen, and only the action expert is optimized. This approach preserves the VLM’s semantic and social reasoning capabilities while allowing the flow model to adapt to real-world dynamics and spatial scales.

Stage3: Reinforcement Learning for Social Rule Alignment. Although the previous stages equip the model with strong navigation priors and real-world adaptability, imitation learning still lacks causal reasoning in social environments. We therefore introduce SAFE-GRPO (Socially-Aware Flow Exploration GRPO), a reinforcement learning stage that explicitly aligns the policy with human social conventions. The model is trained using expert trajectories from the SocCity within D sim , which provides accurate and rich pathway annotations crucial for precise reward feedback.

To encourage diverse yet meaningful exploration, we draw inspiration from the flow-based formulation of Flow-GRPO [23,43], converting the deterministic ordinary differential equation (ODE) of the flow policy into a stochastic differential equation (SDE):

where σ t controls the exploration magnitude. Here, v flow represents the velocity field of the flow policy, which is conditioned on both the current state x t and time t, as well as additional context Z VLM provided by the VLM. Unlike unstructured stochastic exploration, our approach is controlled and semantically grounded: randomness is introduced only during flow integration, while the semantic conditioning signal derived from the VLM “Brain” remains fixed throughout. This latent prior encodes high-level spatial and social cues. Without explicit comprehension of scene semantics and walkable areas, conventional RL agents struggle to formulate or discover socially-compliant behaviors solely from inefficient exploration and sparse rewards.

Trajectories that are collision-free and socially valid receive higher rewards, reinforcing alignment between action generation and human expectations. The overall reward balances social compliance and navigational efficiency:

where R social is the primary reward, derived from a semantic occupancy map M occ that encourages the agent to maintain safe clearance from all non-traversable areas, R expert promotes consistency with expert trajectories, R smooth enforces natural motion continuity, and R eff rewards efficient progress toward the goal. Together, these terms ensure

To ensure a fair comparison, we evaluate SocialNav in three distinct settings: (1) open-loop benchmark proposed by CityWalker [24], (2) closed-loop evaluation on our SocNav Benchmark, and (3) real-world robotic deployments.

Metrics. For open-loop evaluation, we use Maximum Average Orientation Error (MAOE) following City-Walker [24]. Closed-loop performance is measured by success rate (SR; defined as reaching within 3 meters of the goal with fewer than three collisions), route completion (RC), and success weighted by path length (SPL). To assess social compliance, we introduce the Distance Compliance Rate (DCR) and Time Compliance Rate (TCR):

where s is a binary indicator (1 for success, 0 for failure), d compliant denotes the distance traveled within socially compliant regions, and d actual represents the total actual distance traveled. The TCR is formulated similarly.

Model Details. We employ Qwen2.5-VL (3B) as the brain module. The action expert is designed as a Diffusion Transformer [28] with L = 12 layers, H = 12 attention heads per layer, and hidden dimension D = 1536. During inference, the trajectory denoising is performed iteratively for K = 5 steps.

Training Details. Pre-training Stage: The full model is trained end-to-end using AdamW [26] (3 epochs, 96 H20 GPUs) with a batch size of 192 and a learning rate of 5 × 10 -5 . Finetuning Stage: Only the action expert is finetuned on 32 H20 GPUs with a batch size of 256 and a learning rate of 1 × 10 -5 . SAFE-GRPO: We further optimize the action expert on 16 H20 GPUs, using a rollout batch size of 128 and a learning rate of 5 × 10 -7 .

We compare SocialNav against state-of-the-art (SOTA) open-source point-based navigation methods. The selected baselines include CityWalker [24], ViNT [35], GNM [34],

and NoMaD [40]. While ViNT, GNM, and NoMaD were originally designed for image-goal navigation, we retrained them for point-goal tasks to ensure a fair comparison. In Tab. 2 and Tab. 3, an asterisk (*) denotes the models that were exclusively trained on the D real dataset.

We evaluate on the open-loop benchmark introduced by CityWalker [24]. As shown in Tab. 1, our approach con- sistently outperforms prior methods across key scenarios, demonstrating improved generalization and robustness.

Since ground truth trajectories are human-operated, these results indicate that our method more closely aligns with human social walking norms.

We conduct closed-loop evaluation on our SocNav Benchmark, with all environments unseen during training.

Quantitative results are shown in Tab. 2 and qualitative visualizations in Fig. 4. Navigation Performance. SocialNav achieves state-ofthe-art results across all navigation metrics. Specifically, it significantly outperforms CityWalker [24], the second-best method, with the concurrent increase of +38.3 in SR, +26.5 in RC and +32.7 in SPL. Social Compliance. SocialNav achieves remarkable improvements in social compliance, attaining a DCR of 82.5 and a TCR of 82.9, both more than double those of City-Walker (DCR: 36.1, TCR: 36.6), and significantly better than other baselines. Representative qualitative visualizations in Fig. 4 further highlight these advantages: Social-Nav consistently selects paths that adhere to sidewalks and designated walkways, whereas the baseline frequently opts for shorter but socially inappropriate routes traversing restricted areas. This substantial improvement validates our model’s ability to internalize complex social norms. Importantly, these gains in social compliance are attained without sacrificing navigation performance.

In this work, we presented SocialNav, a hierarchical foundation model capable of socially aware embodied navigation. While SocialNav provides a scalable and generalizable framework for socially intelligent navigation, several directions remain for future exploration. First, our reinforcement learning paradigm could be extended beyond semantic traversability to capture a broader range of contextdependent human conventions. Second, the current reward formulation relies on hand-crafted rules; integrating visionlanguage models to deliver richer, more adaptive reward signals represents a promising step toward stronger human alignment. We believe this work marks an important milestone toward embodied agents that can navigate complex, dynamic social environments with genuine social awareness. an angle within [45 • , 90 • ]. This forces the agent to start by facing away from the correct path, requiring an immediate and decisive turn.

τ rec is generated by linearly interpolating between the recovery start state s rec and the convergence point p conv with a fixed spatial step of approximately 5 cm. To simulate natural human micro-corrections and avoid trivial straight-line paths, we perturb each interpolated point q t with small, zero-mean Gaussian noise:

The final perturbed path τ rec ′ is formed by concatenating τ rec with the path from p conv to g. It is kept only if all its points remain in cells of M occ with sufficient clearance from obstacles. These systematically constructed recovery trajectories mimic plausible failure scenarios, such as starting from the wrong orientation or position, and provide explicit supervision on how to execute safe and efficient corrective actions. This enriches the training data far more effectively than simple random noise, significantly improving the policy’s robustness in challenging situations.

Figure 5 illustrates typical standard and recovery trajectories on M occ , highlighting the diversity introduced by local recovery paths.

Socially traversable regions provide the supervision signal for training the SocialNav Brain to predict socially compli-ant polygons. We describe the annotation pipeline and the corresponding guidelines.

Internet-scale data collection. We gather first-person Internet videos and images depicting pedestrians moving in outdoor scenes such as streets, campuses, and parks. 2. Automatic filtering. A vision-language model is used to filter out frames with unsuitable viewpoints (e.g., too close to walls, heavily occluded, or dominated by sky/ground) and to discard low-quality images. 3. Manual polygon annotation. Human annotators draw coarse polygons on the remaining frames to delineate socially traversable regions. We create one or more polygons to cover all regions that is legally and socially allowed to walk.

• Socially Traversable Regions. These are defined as outdoor areas where pedestrians are permitted to walk by both legal regulations and common social norms. Annotators were instructed to label surfaces such as sidewalks, pedestrian-only streets, marked crosswalks, public plazas, and accessible outdoor staircases. For indoor scenes, all non-obstacle areas are considered traversable. • Non-Traversable Regions. This category encompasses all areas that are unsafe, illegal, or socially unacceptable for pedestrian traffic. Key examples include motor vehicle lanes, bike lanes, bus lanes, green belts, ornamental lawns, flower beds, water bodies, etc. • Annotation Protocol and Polygon Standards. Annotators draw coarse, low-vertex polygons to cover the full extent of the walkable surface; pixel-perfect alignment with curbs or markings is not required. Annotation focuses on the walkable region directly connected to the camera’s viewpoint. A single polygon can span multiple connected surfaces (e.g., a sidewalk leading into a plaza). Isolated regions that cannot be reached without crossing non-traversable areas are ignored, except for pedestrian safety islands that are visibly reachable via crosswalks.

Figure 6 presents qualitative comparisons of predicted socially traversable regions across six unseen test scenes.

Each column corresponds to a different scene, while the four rows show: (1) Qwen2.5-VL predictions in SocCity, (2) SocialNav predictions in SocCity, (3) Qwen2.5-VL predictions in the real world, and (4) SocialNav predictions in the real world. Green polygons denote model-predicted socially traversable regions. Across both simulated and real scenes, the base Qwen2.5-VL-3B model often produces polygons that are coarse, spatially misaligned, or leak into socially invalid Figure 6. Predicted socially traversable regions on unseen scenes. Green polygons denote predicted socially traversable regions, and red arrows highlight areas incorrectly classified as traversable. SocialNav yields more semantically aligned polygons in both domains.

areas such as curb edges, grass, or vehicle lanes. In contrast, the SocialNav Brain, trained with our large-scale traversability annotations, provides predictions that are more structured and socially consistent: polygons more tightly follow sidewalks, pedestrian paths, and crosswalks, and avoid visually similar but non-walkable regions. The improvements hold across all six diverse unseen scenes, indicating that fine-tuning not only enhances accuracy but also significantly strengthens cross-domain generalization.

To construct the Cognitive Activation Dataset (CAD) for navigation, we elicit chain-of-thought (CoT) style explanations using a instruction prompt. The goal is to obtain, for each navigation step, (i) a structured reasoning trace that explains the agent’s decision in terms of scene layout, social norms, and future consequences, and (ii) a final discrete action selected from a predefined action space. This subsection details the prompt design and the corresponding input-output specification.

Task and Role Description. We explicitly cast the brain model as the Thinking Module of a professional navigation system. Its responsibility is to produce a logically coherent CoT and a final movement decision.

Then the prompt enforces a strict three-stage reasoning protocol: 1. Global situation analysis: jointly parse all input information, including the robot state, the target location, and visual observations. Prompt Template. Figure 7 presents the complete English-language prompt template used to generate the navigation CoTs for our CAD. The resulting CoT and Decision segments are stored alongside the visual observations and trajectory states, forming rich supervision for the Brain Module to acquire socially aware navigation reasoning.

Figure 8 provides a qualitative example of the chain-ofthought (CoT) generated by our Brain Module for a navigation decision in an unseen urban intersection. The CoT showcases the model’s ability to perform complex, multistage reasoning by integrating its state, goal, and rich perceptual understanding.

In this example, the model correctly identifies a conflict between the long-term goal vector (to the forward-right) and the immediate safety and social constraints. Instead of pursuing a direct path, the model decomposes the problem, recognizing that the sanctioned crosswalk directly ahead, aided by a green pedestrian signal, is the necessary intermediate step. It explicitly evaluates actions against a learned hierarchy of navigational rules-1) Safety, 2) Social Compliance, and 3) Task Efficiency-to invalidate unsafe actions Role and task description: You are now the Thinking Module of a professional navigation AI system. Your single core task is to generate, for every movement of a quadruped robot dog, a comprehensive, in-depth, and logically rigorous Chain of Thought (CoT), and then output a clear movement decision. This CoT must integrate all available environmental information and explain why the final decision is the optimal choice under the current situation.

Workflow (you must strictly follow these three steps, without omission or reordering): 1. Comprehensive analysis: parse all input information, including the robot’s own state, the goal position, and environmental perception. 2. CoT generation: based on your analysis, build a structured, logically sound reasoning process and evaluate all plausible alternatives. 3. Final decision: select one and only one action from the candidate action space, and output it in the specified format. (e.g., Move Forward-Right) and inefficient ones (e.g., Stay Still). The final decision to Go Straight is thus not a simple reactive choice but the output of a deliber-ate plan to safely and compliantly progress toward the goal.

Social Compliance Reward R social : This is the primary incentive for respecting both physical safety and social norms.

From M occ ∈ {0, 1} H×W , we compute a Distance Transform (DT) map D(x), which assigns each traversable location x its Euclidean distance to the nearest non-traversable cell. This DT map encodes both collision avoidance and social distancing principles-higher values indicate safer, more normatively acceptable regions. Let {x t } T t=1 be the predicted trajectory in world coordinates, and let dpred = 1 T T t=1 D(x t ) denotes the average obstacle-free clearance along the path. Similarly, we compute dgt for the expert trajectory. The social reward is then formulated as:

where σ(•) denotes the sigmoid function, and hyperparameters α = 0.5, β = 2.0 control sensitivity and scaling. This formulation rewards trajectories that maintain comparable or greater clearance than the expert. Expert Trajectory Similarity Reward (R expert ): We measure similarity in both spatial proximity and directional consistency. Given predicted trajectory p and expert g in world coordinates:

where:

with w d = 0.7, w θ = 0.3, τ d = 1.0 m. Here, ∆θ avg is the average angular difference between consecutive displacement vectors.

Trajectory Smoothness Reward (R smooth ): We encourage consistent step lengths by penalizing high variance in interstep distances:

where α s = 0.8, and std(•) computes the standard deviation of step magnitudes. A lower variance yields higher reward, promoting natural gait-like movement.

Path Efficiency Reward (R efficiency ): To encourage forward progress without excessive detours, we compare the agent’s net advancement to that of the expert:

where α l = 5.0, β l = 2.0. By combining these reward components, our design ensures that the agent learns to navigate not only effectively, but also in a manner that is predictable, respectful, and aligned with human expectations within shared environments.

We ablate the R social defined in Eq. (4) of the main paper. For the variant, we set the weight of the R social to zero while keeping all other training configurations unchanged.

The results in Table 8 show that R social is crucial for high DCR and TCR; without it, the agent tends to take shorter but socially risky shortcuts.

We provide an extended comparison on the CityWalker open-loop benchmark. The model variants marked with asterisk (*) denotes the models that were exclusively trained on the D real dataset.

Compared to the official CityWalker model, the retrained CityWalker* shows consistent improvements across all scenarios, confirming that our D real provides a more comprehensive navigation motion priors. Similarly, GNM*, ViNT*, and NoMaD* benefit from retraining. SocialNav (Full) achieves the lowest MAOE in every scenario and in the overall mean. The largest relative gains are observed in the Turn and Crossing categories, where understanding of social layout and high-level semantics is particularly important. This trend supports our claim that integrating the Brain Module with the flow-based Action Expert leads to trajectories that more closely match humanoperated behaviors.

To further illustrate the real-world performance of Social-Nav, we visualize third-person view from the Unitree Go2 deployments described in Table 3.

The visualizations highlight that SocialNav can be executed in real-time on a cloud server with an NVIDIA A10 GPU, maintaining over 5 Hz control frequency while preferring socially acceptable walkways even when shorter but socially inappropriate shortcuts exist.

Effect of Data Composition. To dissect the impact of various data components on SocialNav’s performance, we conduct an ablation study by progressively adding D video , D sim , and D cog into the IL pipeline(No. 1-4 in Tab. 4). • Effect of D video (No.2 vs. No.1): Large-

📄 Read Full PDF on ArXiv