Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy
Striking a balance between efficiency and transparent motion is a core challenge in human-robot collaboration, as highly expressive movements often incur unnecessary time and energy costs. In collaborative environments, legibility allows a human obse…
Authors: Adrien Jacquet Crétides, Mouad Abrini, Hamed Rahimi
Enco ding Predictabilit y and Legibilit y for St yle-Conditioned Diffusion P olicy A drien Jacquet Crétides , Mouad Abrini , Hamed Rahimi , and Mohamed Chetouani Institut des Systèmes In telligen ts et de Robotique (ISIR), CNRS UMR7222, Inserm ERL U1150, Sorbonne Université, P aris, F rance {first_name.last_name}@sorbonne-universite.fr Abstract. Striking a balance b et ween efficiency and transparen t motion is a core c hallenge in human-robot collab oration, as highly expressiv e mo vemen ts often incur unnecessary time and energy costs. In collabo- rativ e environmen ts, legibility allows a human observ er a better under- standing of the rob ot’s actions, increasing safety and trust. Ho wev er, these behaviors result in sub-optimal and exaggerated tra jectories that are redundan t in low-am biguity scenarios where the rob ot’s goal is al- ready obvious. T o address this trade-off, we propose Style-Conditioned Diffusion P olicy (SCDP), a mo dular framework that constrains the tra- jectory generation of a pre-trained diffusion mo del tow ard either legibil- it y or efficiency based on the environmen t’s configuration. Our metho d utilizes a p ost-training pip eline that freezes the base policy and trains a light w eight scene enco der and conditioning predictor to mo dulate the diffusion pro cess. A t inference time, an ambiguit y detection module ac- tiv ates the appropriate conditioning, prioritizing expressive motion only for am biguous goals and reverting to efficien t paths otherwise. W e ev alu- ate SCDP on manipulation and na vigation tasks, and results show that it enhances legibilit y in ambiguous settings while preserving optimal ef- ficiency when legibilit y is unnecessary , all without retraining the base p olicy . 1 In tro duction A chieving a balance b et ween op erational efficiency and transparent rob ot mo- tion remains a significan t c hallenge in human-robot collab oration [3]. While these expressive b eha viors often result in sub-optimal or exaggerated tra jecto- ries, they are a necessary trade-off to ensure motion legibility , a modality where the rob ot’s mov ement alone allows an external observ er to correctly infer its in- tended goal [11]. This b ecomes esp ecially critical on minimalist platforms or in constrained en vironments where other communication channels, such as speech or gaze [14] are unav ailable. Ultimately , legibility is essential to achiev e safe and in tuitive in teractions betw een h umans and rob ots [21]. T o acquire such expressive b eha viors, Imitation Learning has recently served as a foundational approac h, 2 A. Jacquet Crétides et al. Diffusion Policy Spatial Ambiguity No Spatial Ambiguity SCDP (Ours) Legibility Diffuser Fig. 1. St yle-Conditioned Diffusion Policy is an offline imitation learning framework that allo ws for motion conditioning dep ending on the en vironment’s context. In ambigu- ous scenes (top), SCDP pro duces in tent-expressiv e motion to resolve goal am biguity . When am biguity is low (b ottom), it prioritizes task efficiency , av oiding sub-optimal and exaggerated tra jectories. enabling rob ots to mimic strategies directly from data [24]. Building up on this, Diffusion Mo dels [5] hav e emerged as a solution offering a robust framework for mo deling these complex tra jectory distributions in a sto c hastic y et controllable manner [1]. While legibilit y is an implicit but p o werful form of comm unication, it of- ten comes at the cost of efficiency . Making a tra jectory more expressive ma y require deviations from a classical path, that would b e more direct or optimal from an usual rob ot generation p oin t of view [11]. In many practical scenarios, b eing legible is not alwa ys necessary [3]. F or example, when the goal is clearly distinguishable or when ambiguit y is low, a standard, efficient tra jectory can b e sufficien t for an external observer to infer quic kly what is the goal of the motion. Therefore, finding a trade-off b et ween legibility and efficiency is a key challenge, b y dev eloping systems that can adaptively pro duce legible motion only when required, balancing expressiv eness with task p erformance. W e introduce in this work Style-Conditione d Diffusion Policy (SCDP), a no vel approach that mo dulates tra jectory generation by constraining a diffusion mo del tow ard a sp ecific style. Dep ending on the spatial am biguity of the environ- men t, SCDP dynamically shifts the rob ot’s motion b et w een legibility and pre- dictabilit y , as shown in figure 1. Our metho d consists of a p ost-training pip eline, freezing the base diffusion mo del and training external en vironment-a w are mod- Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 3 ules, namely a scene enco der and legibility/predictabilit y predictor multi-la yer p erceptrons, to generate an additional conditioning based on the scene ambigu- it y without requiring any mo dification to the diffusion model’s internal structure. Through exp erimen ts, we sho w that our architecture: (i) enables st yle-conditioned tra jectory generation without retraining or altering the base p olicy , (ii) sup- p orts adaptive and scene-dep enden t mo dulation based on an ambiguit y detec- tion mechanism and (iii) improv es motion in terpretability in ambiguous scenarios while fo cusing on efficiency when legibilit y is not needed. 2 Related W ork 2.1 Legible Motion Generation The concept of legible motion for rob otics has emerged as a key-topic for HRI, emphasizing the design of mov eme n ts that clearly comm unicates the rob ot’s in- ten tions to human observers [22]. Dragan et al. [3] first introduced the distinction b et w een legible and predictable motion, and mo deled h uman inference through a Bay esian framework, relying on cost functions and optimization to maximize the probability of the correct goal b eing iden tified b y the human observ er, giv en the robot’s tra jectory from its start to its current p osition [11]. Learning-based approac hes ha ve since gained prominence, utilizing reinforcement learning to op- timize legibilit y metrics [20] and sup ervised observer mo dels to predict h uman goal inference [23]. A notable example is Legibility Diffuser [4], which employs diffusion with classifier-free guidance [12] to generate legible motion. How ev er, this metho d remains agnostic to environmen t’s spatial ambiguit y and relies on man ually tuned guidance w eights at test time, whic h motiv ates the need for a more environmen t-aw are and in terpretable mec hanism to encode legibility in to the generation pro cess. 2.2 Rob ot Motion and Conditional Generativ e Mo dels Generativ e mo dels ha ve gained increasing attention in rob otic motion genera- tion due to their ability to learn complex, high-dimensional distributions of mo- tion data from demonstrations [15]. More recen tly , diffusion mo dels ha ve shown promising results in generating diverse and temporally coheren t robot tra jec- tories, esp ecially in manipulation and lo comotion tasks [1]. Notably , Diffusion P olicy [2] still demonstrates comp etitiv e performance across visuomotor manipu- lation tasks. Generation is conditioned on observ ation and curren t robot’s state, using the FiLM metho d [7]. Bey ond the scop e of rob otics, recent adv ancemen ts in generative mo deling ha ve introduced conditioning tec hniques to constrain the diffusion pro cess [18, 19]. Sp ecifically , Li et al [6] added an additional context v ector to constrain the diffusion generation, allowing for fine-grained control o ver the sampling pro cess without altering the base mo del. Inspired by this, w e prop ose an external, environmen t-aw are mo dule that mo dulates tra jectory generation based on the scene’s context. 4 A. Jacquet Crétides et al. 3 Preliminaries 3.1 Diffusion P olicy Our work builds up on Diffusion P olicy in its U-Net form [2, 8], which adapts the Denoising Diffusion Probabilistic Mo dels (DDPM) [5] framework to learn conditioned action sequences from demonstrations. The denoising up date rule is defined as: X k − 1 t = α ( X k t − γ ϵ θ ( O t , X k t , k ) + N (0 , σ 2 I )) (1) where X k t represen ts the noisy action at step k , and ϵ θ estimates the noise comp onen t added to the action sequence, conditioned on the observ ation O t . While observ ation can be vision-based, w e choose to limit it in our work to the goal state g ∗ and the curren t state of the rob ot s t . 3.2 Legibilit y and Predictability Legibilit y refers to the rob ot’s abilit y to communicate its intended goal to a hu- man observer through the expressive c haracteristics of its motion alone [11]. By doing legible, or in tent expressive actions, a rob ot allows an external observer to ha ve a b etter understanding of the rob ot’s action, increasing safety and efficiency in human-robot collab oration [16, 21]. A legible motion will seek to maximize the probabilit y that an external observer can infer which ob jectiv e is targeted, even if it means reducing the effectiveness of the motion, by achieving a longer and sub-optimal tra jectory . It can b e defined as follo ws [11]: legibilit y ( ξ ) = R P ( g ∗ | ξ s → ξ ( t ) ) f ( t ) dt R f ( t ) dt (2) where g ∗ is the targeted goal, ξ ( t ) represen ts the partial tra jectory up to time t , and f ( t ) is a temporal weigh ting function (e.g., f ( t ) = t ), emphasizing early steps of the tra jectory , where observer’s uncertaint y is the highest due to m ultiple p oten tial goals b eing equally plausible. While legibility fo cuses on the inference of the goal from the motion, pre- dictabilit y reflects the observer’s abilit y to anticipate the tra jectory itself, pro- vided the goal is already known. Mathematically , if legibilit y maximizes P ( g | ξ ) , predictabilit y seeks to maximize P ( ξ | g ) . A predictable tra jectory aligns with the human observ er’s exp ectation of ho w an action should be p erformed, whic h t ypically corresp onds to the most efficient or cost-optimal path. 3.3 Spatial Am biguity W e fo cus in this work on spatial ambiguit y , which dep ends on the en vironment’s configuration. Given a goal space G , w e formalize a scene as spatially ambiguous for a target goal g ∗ ∈ G relative to a set of distractor goals g − b y examining the observ er’s ability to infer the correct inten t from an efficient tra jectory . Let ξ ∗ Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 5 denote an optimal tra jectory tow ard g ∗ , t ypically characterized by the shortest Euclidean distance. F ormally , giv en a confidence threshold τ , we define a scene as am biguous if there exists at least one goal g − ∈ G \ { g ∗ } such that the following condition holds for the efficien t tra jectory ξ ∗ : P ( g ∗ | ξ ∗ s → ξ ( t ) ) − P ( g − | ξ ∗ s → ξ ( t ) ) < τ (3) Where P ( g | ξ s → ξ ( t ) ) represents the p osterior probability of a goal g given the observ ed partial tra jectory ξ ∗ s → ξ ( t ) . 4 St yle-Conditioned Diffusion P olicy W e propose a composite framew ork that builds on Diffusion Policy [2] as the base mo del and uses an additional mo dule to condition it. It is comp osed of three Multi-La yer P erceptrons (MLPs), one acting as a scene enco der and generating a con text vector corresp onding on the environmen t the rob ot acts in, and the t wo others acting as conditioning predictors dep ending on this context. A t inference time, we use an am biguity detection module to determine if legibility or efficiency is required. 4.1 T ask F ormalization W e consider rob ot motion generation as a sequential decision-making problem in which an agent must mov e to ward a sp ecific goal. Let G ⊂ R 2 denote the set of p ossible goals, where g ∈ G represents a p osition for the rob ot to reach. W e mo del this problem as a discrete-time, infinite-horizon Marko v Decision Pro cess (MDP), M = ( S , A , T , ρ 0 ) , where S is the state space, A the action space, T ( ·| s, a ) the state transition distribution, and ρ 0 ( · ) the initial state distribution. A t each step, the agent observes a state s t and uses a p olicy π to select an action a t = π ( s t ) , where the states and actions corresp ond to the rob ot’s current and next joint configurations. The agen t then transitions to the next state s t +1 ∼ T ( ·| s t , a t ) . W e assume having a dataset of N d task demonstrations D = { ξ i } N d i =1 where eac h demonstration is a tra jectory ξ i = ( s i 0 , a i 0 , s i 1 , a i 1 , . . . , s iT ) , along with the en vironment state, namely a p ositiv e goal g ∗ ∈ G . In the case of legible motion generation, our ob jective is to generate motion tow ard g ∗ that is distinct from motion to ward any negative goal g − ∈ G \ { g ∗ } , suc h that p ( g ∗ | s t , a t ) > p ( g − | s t , a t ) . On the other hand, in the case of efficien t motion, our ob jective is to minimize the cost of reaching g ∗ , whic h corresp onds to maximizing the lik eliho o d of the tra jectory giv en the goal, p ( ξ | g ∗ ) . 4.2 Scene Enco ding As legibilit y is entirely dep ending on the environmen t’s configuration and the presen t ob jects’ p ositions, w e introduce a scene encoder, that is trained inde- p enden tly from the rest of the architecture, to learn a laten t representation of 6 A. Jacquet Crétides et al. the en vironment and of the spatial relations b etw een the differen t goals. This MLP takes as inputs the differen t goals’ co ordinates g ∗ and g − in the scene. F or every goal g − i ∈ G \ { g ∗ } , w e compute r i = g − i − g ∗ the relativ e vec- tor to g ∗ , and j i = ∥ g − i − g ∗ ∥ 2 ∈ R the euclidean distance to g ∗ . W e then construct the enriched vectors ˜ g i = g − i r i j i ⊤ . The g ∗ enric hed vector is de- fined as ˜ g ∗ = g ∗ 0 0 ⊤ ∈ R 5 , acting as the origin of the scene’s coordinate system and grounding the relative v ectors of the other goals. Given N the num- b er of negativ e goals, we can concatenate these enriched vectors into a single one, x = h ˜ g ∗ ⊤ ˜ g ⊤ 1 . . . ˜ g ⊤ N i ⊤ ∈ R 5 × ( N +1) , that will be passed into the enco der S : R 5 × ( N +1) → R s to get a latent contextual v ector of the scene : c = S ( x ) ∈ R s (4) T o train this encoder, we employ a reconstruction-based approac h using an auto encoder arc hitecture [17], where the scene enco der is jointly trained with a deco der. 4.3 Constraining the Diffusion Diusion P olic y MSE L oss Cont e xt V ect or Scene Encoder L e gibility Pr edict or x t x t-1 Noise (a) T raining Pro cess. g*, s, t L e gibility x t x t-1 (b) U-Net Conditioning. Fig. 2. (a) The predictor mo dule is integrated via a post-training pip eline where the base Diffusion Policy weigh ts remain frozen. By training the light weigh t MLP on a subset of expressive demonstrations, the module learns to sp ecifically comp ensate for the residuals b et ween the st yle-sp ecific tra jectories and the general paths the base mo del was originally trained to repro duce.(b) The conditioning from the predictor is only applied to the b ottlenec k of the diffusion U-Net using FiLM to denoise X at each timestep t . W e design a second MLP , resp onsible of the tra jectory style enco ding. Fig- ure 3 shows the p ost training pro cess. After training the base Diffusion P olicy’s U-Net on a large set of demonstrations, we freeze its weigh ts and add the scene Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 7 enco der, also pretrained, and the predictor MLP . W e then train this new com- p osite mo del on a smaller dataset, containing only style-specific demonstrations. As the base model is frozen, only the style MLP is up dated by the MSE loss through this training phase, and has to learn to comp ensate through the genera- tion the difference b et ween the legible demonstrations seen here and the classical demonstrations Diffusion P olicy has b een trained to repro duce. T raining on a sp ecific data subset, the context vector c generated b y the scene enco der passes through the MLP , that generates t wo vectors γ and β : γ = W γ c + b γ ∈ R l , β = W β c + b β ∈ R l (5) where l is the feature dimension of the b ottleneck lay er, W γ , W β ∈ R l × d and b γ , b β ∈ R l are learned w eight matrices. These parameters γ , β are then used to mo dulate the U-Net’s middle lay er vector h using a FiLM conditioning [7]: FiLM ( h ) = γ ⊙ h + β , h ∈ R l (6) This results in an additional mo dule that can b e added to the base diffusion mo del to constrain its generation to a desired controlled distribution. 4.4 Am biguity Detection Diusion P olic y A ctions Cont e xt V ect or Scene Encoder L e gibility Pr edict or Ambig uity Det ection Pr edict ability Pr edict or (a) Ev aluation Pro cess. s t g* g - e (b) Ellipse of Ambiguit y . Fig. 3. (a) The environmen t state is passed through the ambiguit y detection mo dule to determine if the scene is spatially ambiguous and decide which conditioning should b e used. (b) Visualization of the ellipse of ambiguit y used for the scene’s classification. The scene is lab eled as spatially ambiguous when g − falls inside the elliptical b oundary . T o approximate the probabilistic definition of spatial am biguity from 3.3, W e define an ellipse of ambiguit y E that represents the spatial zone of confusion where a negativ e goal g − is most lik ely to cause an observ er to misinterpret the rob ot’s in tent. The ellipse is constructed suc h that the rob ot’s current state s t acts as one of its fo cal p oin ts, and is cen tered at a p oin t e lo cated b et w een 8 A. Jacquet Crétides et al. the robot’s current state s t and the goal g ∗ . This cen ter is defined as: e = s t + κ ( g ∗ − s t ) , where κ ∈ (0 . 5 , 1) is a scaling factor. A scene is considered spatially ambiguous if an y negative goal g − falls within this elliptical b oundary: Am biguity ( g ∗ , g − , s t ) = ( 1 if ( g − − e ) ⊤ M ( g − − e ) ≤ 1 0 otherwise (7) Here, M is a symmetric pos itiv e-definite matrix that defines the orien tation and the semi-axes of the ellipse. Figure 4.4 sho ws the arc hitecture used at infer- ence time and a visualization of the ellipse of ambiguit y for a given environmen t. The am biguity detection module serv es as an arbitrator that selects the opti- mal conditioning for the Diffusion P olicy based on the en vironment’s risk of goal-confusion. If the scene is flagged as spatially ambiguous, the legibilit y pre- dictor is activ ated to constrain the diffusion pro cess tow ard a more expressive tra jectory . Otherwise, the predictability predictor constrains the generation to the most efficien t path. 5 Ev aluation Our experiments are designed to ev aluate ho w effectiv ely the mo del balances legibilit y and efficiency across v arying environmen tal contexts. T o address this trade-off, we consider the following researc h questions: • Can SCDP increase legibility p erformance when the situation requires higher clarit y? • Do es SCDP effectively reduce legibilit y to prioritize efficiency in unam bigu- ous environmen ts? T o provide robust answ ers, we test these questions across diverse datasets fea- turing a large range of spatial configurations. 5.1 Datasets W e conduct our exp erimen ts on t wo tasks. Each task is divided into tw o envi- ronmen tal scenarios: spatial am biguity and no spatial ambiguit y . Blo c k Reac h : the Block Reac h task is a b enchmark commonly used in legibilit y studies [3]. In this setup, tw o ob jects are placed randomly on the scene, and a manipulator arm, here a F rank a Emik a Panda rob ot, is required to reach one of them. Na vigation : In this task, a mobile rob ot has to reach one of the tw o goals presen t in the ro om. The rob ot used here is a T urtleb ot. T o train our agen ts, we collected datasets of 200 demonstrations for each task using the Gazeb o simulator to p erform and record the rob ots’ states and actions. These demonstrations were pro cedurally generated using quadratic Bezier curv es to ensure a diverse distribution of tra jectories that range from near-optimal straigh t lines to highly curved paths. Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 9 Fig. 4. Visualization of SCDP and baselines’ inferences in ambiguous (top) and non- am biguous (b ottom) scene configurations for the navigation task. While Diffusion P ol- icy captures the entire data distribution and Legibility Diffuser collapses on the most legible mo de, SCDP constrains its generation depending on the scene configuration. 5.2 Baselines W e compare and ev aluate our metho d with resp ect to other diffusion-based base- lines. Diffusion P olicy [2] : Diffusion P olicy is the original DDPM-based mo del our architecture is built on. Similarly to our method, it is here used in its goal- conditioned form, to isolate the effect of our con tribution. Legibilit y Diffuser [4] : Also built on a Diffusion Policy , Legibility Diffuser is a v ariant that in tro duces a guidance term to pull the tra jectory a wa y from a negativ e goal g − while going to g ∗ . As the official implemen tation is not av ailable, their metho d was repro duced for comparison, using the following v alues for their h yp erparameters: w t = 4 . 5 , α = 0 . 9 and λ = 0 . 5 . P arameters were chosen to recreate a similar b ehavior to the one presented in the ev aluation of the baseline. T raining Dataset : W e additionally rep ort the v alues of the datasets used to train the different baselines. 5.3 Implemen tation Details This section details the architectural and training configurations for the SCDP framew ork and the comparativ e baselines. Base Diffusion Policy: W e employ a U-Net-based Diffusion Policy [2] with appro ximately 70 M parameters for the differen ts baselines and SCDP . The base mo del is trained on a foundational dataset of 200 exp ert demonstrations for 100 ep ochs. 10 A. Jacquet Crétides et al. Scene Encoder: T o extract laten t environmen tal context, we utilize a 3- la yer MLP architecture comprising 26 , 000 parameters. This mo dule is pre- trained using a reconstruction loss on 5 , 000 randomized goal configurations o ver 50 ep o c hs to ensure a robust representation of spatial relations. Conditioning Predictors: The style-specific mo dulation is handled by tw o separate 4-lay er MLP backbones ( 1 M parameters eac h) representing the legibility and predictability predictors. These mo dules are fine-tuned for 300 ep o c hs using sp ecialized subsets con taining the top 20% most legible and 20% most efficient demonstrations from the original datasets, resp ectiv ely . 5.4 Metrics W e ev aluate the generated tra jectories using t wo primary p erformance indica- tors, which are then combined in to a final adaptiv e transparency score metric. Primary Metrics The first metric, detachmen t score, w as introduced b y the authors of L egibilit y Diffuser [4], and serves as a pro xy of legibility when targeted and negative goals are close. It quan tifies the divergence from the negative goal g − along the tra jectory ξ s → g ∗ : D ( ξ s 0 → g ∗ ) = X s t ∈ ξ s → g ∗ ∥ g − − s t ∥ 2 t (8) The second metric is tra jectory efficiency , calculated as the recipro cal of the total Euclidean distance trav eled during motion execution: E ( ξ s 0 → g ∗ ) = X s t ∈ ξ s → g ∗ ∥ s t − s t − 1 ∥ 2 + ϵ − 1 (9) where ϵ is a small constant to ensure numerical stability . V alues for D and E are normalized using the Min-Max metho d with resp ect to the v alues obtained from the training dataset to ensure they share a comparable scale. A daptive T ransparency Score T o ev aluate ho w effectively the mo del navi- gates the conflict b et ween legibilit y and efficiency , we define a trade-off metric, the adaptive transparency score ( T ): T = (1 − w amb ) ˆ D + w amb ˆ E (10) where ˆ D and ˆ E are the normalized detachmen t and efficiency scores. The w eight w amb is mo deled as a contin uous sigmoid function of the Euclidean dis- tance j b et ween the goals g ∗ and g − , whic h is also normalized using the Min-Max metho d: w amb ( j ) = 1 1 + e − u ( j − x 0 ) (11) Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 11 T able 1. Performance Comparison using Adaptiv e T ransparency Score for the Blo c k Reac h and Navigation tasks. Metho d Spatial Am biguity No Spatial Ambiguit y Blo ck R e ach Evaluation Diffusion Policy 0 . 52 ± 0 . 06 0 . 61 ± 0 . 06 Legibilit y Diffuser 0 . 61 ± 0 . 09 0 . 43 ± 0 . 10 Ours (SCDP) 0 . 58 ± 0 . 08 0 . 74 ± 0 . 06 Dataset 0 . 47 ± 0 . 17 0 . 62 ± 0 . 18 Navigation Evaluation Diffusion Policy 0 . 50 ± 0 . 03 0 . 61 ± 0 . 05 Legibilit y Diffuser 0 . 62 ± 0 . 02 0 . 32 ± 0 . 12 Ours (SCDP) 0 . 59 ± 0 . 03 0 . 76 ± 0 . 06 Dataset 0 . 46 ± 0 . 17 0 . 59 ± 0 . 20 This formulation allo ws for a fluid transition b et ween b ehavioral mo des based on the en vironmen t’s spatial configuration, to quan tify the agen t’s ability to adapt its motion style to the sp ecific geometric requiremen ts of the scene. In our ev aluation, we set the steepness parameter u = 2 . 5 and the midp oint distance x 0 = 0 . 5 . 5.5 Results T able 1 presents the ev aluation results using the adaptiv e transparency score T = w amb ˆ D + (1 − w amb ) ˆ E across different metho ds, while table 2 displays the detac hment and tra jectory efficiency scores separately . Results are av er- aged in simulation o ver 100 inferences. Figure 4 illustrates the qualitative dif- ferences in tra jectory generation b etw een the baselines in b oth ambiguous and non-am biguous settings. P erformance Analysis Our results demonstrate that SCDP effectiv ely bal- ances tra jectory legibility and path efficiency b y adapting its b eha vior to the scene’s spatial am biguity . • Am biguous Scenarios: In environmen ts requiring higher clarity , the Legibility Diffuser baseline achiev es the highest mean scores ( 0 . 61 in Block Reac h and 0 . 62 in Navigation). This is driv en by its prioritization of detachmen t from negativ e goals, though it suffers from low efficiency . SCDP follo ws closely ( 0 . 58 and 0 . 59 resp ectiv ely), significantly outp erforming the standard Diffu- sion Policy . • Non-Am biguous Scenarios: When the goal is clear, SCDP consistently out- p erforms all other metho ds, achieving the highest fused scores of ( 0 . 74 and 0 . 76 ) in b oth tasks. 12 A. Jacquet Crétides et al. T able 2. Separated Detachmen t and T ra jectory Efficiency Scores Comparison for Blo c k Reac h and Navigation tasks. Metho d Spatial Ambiguit y No Spatial Ambiguit y Detac hment Efficiency Detac hment Efficiency Blo ck R e ach Evaluation Diffusion Policy 0 . 52 ± 0 . 06 0 . 67 ± 0 . 06 0 . 43 ± 0 . 09 0 . 65 ± 0 . 09 Legibilit y Diffuser 0 . 89 ± 0 . 12 ↑ 0 . 13 ± 0 . 10 ↓ 0 . 67 ± 0 . 16 ↑ 0 . 31 ± 0 . 20 ↓ Ours (SCDP) 0 . 70 ± 0 . 10 ↑ 0 . 43 ± 0 . 06 ↓ 0 . 42 ± 0 . 10 0 . 80 ± 0 . 12 ↑ Dataset 0 . 46 ± 0 . 22 0 . 64 ± 0 . 25 0 . 50 ± 0 . 22 0 . 64 ± 0 . 24 Navigation Evaluation Diffusion Policy 0 . 42 ± 0 . 03 0 . 77 ± 0 . 05 0 . 58 ± 0 . 16 0 . 61 ± 0 . 04 Legibilit y Diffuser 0 . 85 ± 0 . 04 ↑ 0 . 02 ± 0 . 03 ↓ 1 . 06 ± 0 . 28 ↑ − 0 . 07 ± 0 . 19 ↓ Ours (SCDP) 0 . 64 ± 0 . 02 ↑ 0 . 47 ± 0 . 05 ↓ 0 . 31 ± 0 . 01 ↓ 0 . 85 ± 0 . 08 ↑ Dataset 0 . 39 ± 0 . 21 0 . 70 ± 0 . 25 0 . 47 ± 0 . 28 0 . 64 ± 0 . 26 It is w orth noting that across all exp erimen ts, the success rate for reac hing the target goal remained abov e 0 . 98 for all baselines and our prop osed SCDP . This confirms that the style-conditioning mo dules mo dulate the tra jectory path without compromising the underlying task p erformance of the base mo del. Discussion While Legibility Diffuser can b e tuned to maximize inten t expres- sion, our approac h is limited to the data seen in the training set, meaning it is unlikely to generate tra jectories that are more legible than those observed. Nonethelss, SCDP provides a data-driven alternative allo wing for environmen t- a wareness, that maintains high efficiency in clear scenarios, without explicit guid- ance tuning, offering a superior o verall trade-off across div erse en vironmental con texts. 6 Deplo ymen t T o ev aluate the p ortabilit y of SCDP beyond sim ulated environmen ts, w e de- plo yed the mo del on a ph ysical F rank a Emik a Panda rob ot to p erform the Blo c k Reac h task, as sho wn in the figure 5. 6.1 P erception T o perform ob ject detection, we use a fine-tuned YOLO [13] model to detect target and distractor blo c ks from an Intel RealSense RGB-D’s camera frame. The detected 2D bounding b o xes are pro jected in to the 3D space using the camera’s depth map and extrinsic calibration with the rob ot’s base frame. The obtained co ordinates can then b e used b y the mo del. Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 13 Fig. 5. Real-world deploymen t of SCDP on a F rank a Emik a Pa nda rob ot for the Block Reac h task. The image sequence (left to right) illustrates the mo del successfully gen- erating an exaggerated, inten t-expressive tra jectory to ward the target blue ob ject to resolv e spatial ambiguit y relative to the distractor pink ob ject. 6.2 Sim-to-Real T ransfer and Computational Efficiency W e perform inference on a NVIDIA R TX A3000 GPU. The generated tra jectories are in terp olated to execute motion with a frequency of 1000 Hz. By using a state- based Diffusion Policy model, the model a voids any reality gap and mirrors sim ulation p erformance gains, without any sp ecific calibration or fine-tuning on real demonstrations. Our metho d ac hieves a mean total inference time of 5 seconds, which is similar to the base mo del’s p erformance, due to our additional mo dules b eing ligh tw eight and adding negligible latency . 7 Conclusion W e prop osed an arc hitecture that enables the sp ecialization of a ligh tw eight MLP from the initial mo del to learn a context-dependent tra jectory st yle, that conditions a pretrained Diffusion P olicy mo del to constrain the generation when needed, in our case tow ard legible and predictable motion. In addition to the baseline, these st yle-sp ecific mo dules require smaller datasets for training and imp ose no o verhead at inference time, offering a practical enhancement for diffusion-based p olicies. While the fo cus of this w ork was primarily on legibility , w e b eliev e that the same learning metho d could b e used to learn other tra jectory- lev el concepts like safety , enabling style-conditioned motion generation through mo dular and reusable components. F u ture w ork could aim to further develop am biguity detection, as the curren t mo dule can curren tly b e considered as a geometrical proxy that could be sw app ed in a more complex am biguity detec- tor, and to conduct user studies to v alidate how human observers p erceiv e and in terpret these adaptive tra jectories. F urthermore, future research could explore the framew ork’s flexibility and scalability when faced with a higher n umber of p oten tial goals. 14 A. Jacquet Crétides et al. A ckno wledgmen ts. This work used IDRIS HPC resources under the allocation 2025- [AD011017084] made b y GENCI. It was funded by the F rench National Research Agency (ANR) under the OSTENSIVE pro ject (ANR-24-CE33-6907-01) and the F rance 2030 program, reference ANR-23-P A VH-0005 (INNO VCARE Pro ject). This pro ject has received funding from the European Union’s Horizon Europ e F ramework Pro- gramme under grant agreement No 101070596. Disclosure of In terests. The authors hav e no comp eting interests to declare that are relev ant to the con tent of this article. References 1. W olf, R., Shi, Y., Liu, S., Rayy es, R.: Diffusion Mo dels for Robotic Manipulation: A Survey . In: F rontiers in Rob otics and AI (2025) 2. Chi, C., Xu, Z., F eng, S., Cousineau, E., Du, Y., Burc hfiel, B., T edrake, R., Song, S.: Diffusion P olicy: Visuomotor Policy Learning via Action Diffusion. In: Pro ceedings Rob otics: Science and Systems (RSS) (2023) 3. Dragan, A., Lee, K.C.T., Sriniv asa, S.S.: Legibilit y and predictabilit y of robot motion. In: 8th ACM/IEEE In ternational Conference on Human-Robot Interaction (HRI), pp. 301–308 (2013) 4. Bronars, M., Cheng, S., Xu, D.: Legibility Diffuser: Offline Imitation for In tent Expressiv e Motion. In: IEEE Rob otics and Automation Letters (RA-L) 9 (11), 10161–10168 (2024) 5. Ho, J., Jain, A., Abb eel, P .: Denoising Diffusion Probabilistic Mo dels. In: Proceed- ing of the 34th International Conference on Neural Information Processing Systems (NeurIPS) (2020) 6. Li, H., Shen, C., T orr, P ., T resp, V., Gu, J.: Self-Disco vering Interpretable Diffusion Laten t Directions for Responsible T ext-to-Image Generation. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 12006–12016 (2024) 7. P erez, E., Strub, F., de V ries, H., Dumoulin, V., Courville, A.: FiLM: Visual Rea- soning with a General Conditioning Lay er. In: Pro ceedings of the AAAI Conference on Artificial Intelligence (2018) 8. Ronneb erger, O., Fisc her, P ., Brox, T.: U-Net: Con volutional Net works for Biomed- ical Image Segmentation. In: MICCAI 2015, LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015) 9. P eebles, W., Xie, S.: Scalable Diffusion Mo dels with T ransformers. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV), pp. 4172– 4182 (2022) 10. Lin, H., Cheng, X., W u, X., Y ang, F., Shen, D., W ang, Z., Song, Q., Y uan, W.: CA T: Cross A ttention in Vision T ransformer. In: IEEE International Conference on Multimedia and Exp o (ICME), pp. 1–6 (2022) 11. Dragan, A., Sriniv asa, S.S.: Generating Legible Motion. In: Pro ceedings of Rob otics: Science and Systems (RSS) (2013) 12. Ho, J., Salimans, T.: Classifier-F ree Diffusion Guidance. arXiv preprin t arXiv:2207.12598 (2022) 13. Redmon, J., Divv ala, S., Girshick, R., F arhadi, A.: Y ou Only Look Once: Unified, Real-Time Ob ject Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016) Enco ding Predictability and Legibilit y for Style-Conditioned Diffusion Policy 15 14. W allk ötter, S., T ulli, S., Castellano, G., Paiv a, A., Chetouani, M.: Explainable Em- b odied Agents Through So cial Cues: A Review. In: ACM T ransactions on Human- Rob ot Interactions (THRI) 10 (3), art. 27 (2021) 15. Urain, J., Mandlek ar, A., Du, Y., Shafiullah, M., Xu, D., F ragkiadaki, K., Chal- v atzaki, G., Peters, J.: Deep Generativ e Mo dels in Rob otics: A Survey on Learning from Multimo dal Demonstrations. arXiv preprint arXiv:2408.04380 (2024) 16. Lic hten thäler, C., Lorenzy , T., Kirsch, A.: Influence of legibility on perceived safety in a virtual h uman-rob ot path crossing task. In: The 21st IEEE In ternational Symp osium on Robot and Human Interactiv e Communication (RO-MAN), pp. 676–681 (2012) 17. Rumelhart, D.E., Hin ton, G.E., Williams, R.J.: Learning internal representations b y error propagation. In: Parallel Distributed Pro cessing, pp. 318–362. MIT Press (1987) 18. P anagiotakopoulos, T., K otsiantis, S., Gkillas, A., Lalos, A.S.: Conditional Dif- fusion Mo dels: A Survey of T echniques, Applications, and Challenges. In: IEEE A ccess 13 , 183617–183643 (2025) 19. Berrada, T., Astolfi, P ., Hall, M., Hemmat, R.A., Benchetrit, Y., Hav asi, M., Muc k- ley , M.J., Alahari, K., Romero-Soriano, A., V erb eek, J., Drozdzal, M.: On impro ved Conditioning Mec hanisms and Pre-training Strategies for Diffusion Mo dels. In: Pro ceeding of the 37th International Conference on Neural Information Pro cessing Systems (NeurIPS) (2024) 20. Bied, M., Chetouani, M.: In tegrating an Observ er in In teractive Reinforcement Learning to Learn Legible T ra jectories. In: 29th IEEE International Symp osium on Rob ot and Human Interactiv e Communication (RO-MAN), pp. 760–767 (2020) 21. Dragan, A.D., Bauman, S., F orlizzi, J., Sriniv asa, S.S.: Effects of Rob ot Motion on Human-Rob ot Collaboration. In: 10th ACM/IEEE International Conference on Human-Rob ot Interaction (HRI), pp. 51–58 (2015) 22. Lic hten thäler, C., Lorenz, T., Kirsch, A.: T o wards a Legibilit y Metric: How to Measure the Perceiv ed V alue of a Rob ot. In: International Conference on So cial Rob otics (ICSR) (2011) 23. W allk ötter, S., Chetouani, M., Castellano, G.: SLOT-V: Sup ervised Learning of Observ er Models for Legible Robot Motion Planning in Manipulation. In: 31st IEEE In ternational Conference on Robot and Human In teractive Communication (R O-MAN), pp. 1421–1428 (2022) 24. Zare, M., Kebria, P .M., Khosravi, A., Nahav andi, S.: A Surv ey of Imitation Learn- ing: Algorithms, Recent Developmen ts, and Challenges. In: IEEE T ransactions on Cyb ernetics 54 (12), 7173–7186 (2024)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment