Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy

Enco ding Predictabilit y and Legibilit y for St yle-Conditioned Diﬀusion P olicy A drien Jacquet Crétides , Mouad Abrini , Hamed Rahimi , and Mohamed Chetouani Institut des Systèmes In telligen ts et de Robotique (ISIR), CNRS UMR7222, Inserm ERL U1150, Sorbonne Université, P aris, F rance {first_name.last_name}@sorbonne-universite.fr Abstract. Striking a balance b et ween eﬃciency and transparen t motion is a core c hallenge in human-robot collab oration, as highly expressiv e mo vemen ts often incur unnecessary time and energy costs. In collabo- rativ e environmen ts, legibility allows a human observ er a better under- standing of the rob ot’s actions, increasing safety and trust. Ho wev er, these behaviors result in sub-optimal and exaggerated tra jectories that are redundan t in low-am biguity scenarios where the rob ot’s goal is al- ready obvious. T o address this trade-oﬀ, we propose Style-Conditioned Diﬀusion P olicy (SCDP), a mo dular framework that constrains the tra- jectory generation of a pre-trained diﬀusion mo del tow ard either legibil- it y or eﬃciency based on the environmen t’s conﬁguration. Our metho d utilizes a p ost-training pip eline that freezes the base policy and trains a light w eight scene enco der and conditioning predictor to mo dulate the diﬀusion pro cess. A t inference time, an ambiguit y detection module ac- tiv ates the appropriate conditioning, prioritizing expressive motion only for am biguous goals and reverting to eﬃcien t paths otherwise. W e ev alu- ate SCDP on manipulation and na vigation tasks, and results show that it enhances legibilit y in ambiguous settings while preserving optimal ef- ﬁciency when legibilit y is unnecessary , all without retraining the base p olicy . 1 In tro duction A chieving a balance b et ween op erational eﬃciency and transparent rob ot mo- tion remains a signiﬁcan t c hallenge in human-robot collab oration [3]. While these expressive b eha viors often result in sub-optimal or exaggerated tra jecto- ries, they are a necessary trade-oﬀ to ensure motion legibility , a modality where the rob ot’s mov ement alone allows an external observ er to correctly infer its in- tended goal [11]. This b ecomes esp ecially critical on minimalist platforms or in constrained en vironments where other communication channels, such as speech or gaze [14] are unav ailable. Ultimately , legibility is essential to achiev e safe and in tuitive in teractions betw een h umans and rob ots [21]. T o acquire such expressive b eha viors, Imitation Learning has recently served as a foundational approac h, 2 A. Jacquet Crétides et al. Diﬀusion Policy Spatial Ambiguity No Spatial Ambiguity SCDP (Ours) Legibility Diﬀuser Fig. 1. St yle-Conditioned Diﬀusion Policy is an oﬄine imitation learning framework that allo ws for motion conditioning dep ending on the en vironment’s context. In ambigu- ous scenes (top), SCDP pro duces in tent-expressiv e motion to resolve goal am biguity . When am biguity is low (b ottom), it prioritizes task eﬃciency , av oiding sub-optimal and exaggerated tra jectories. enabling rob ots to mimic strategies directly from data [24]. Building up on this, Diﬀusion Mo dels [5] hav e emerged as a solution oﬀering a robust framework for mo deling these complex tra jectory distributions in a sto c hastic y et controllable manner [1]. While legibilit y is an implicit but p o werful form of comm unication, it of- ten comes at the cost of eﬃciency . Making a tra jectory more expressive ma y require deviations from a classical path, that would b e more direct or optimal from an usual rob ot generation p oin t of view [11]. In many practical scenarios, b eing legible is not alwa ys necessary [3]. F or example, when the goal is clearly distinguishable or when ambiguit y is low, a standard, eﬃcient tra jectory can b e suﬃcien t for an external observer to infer quic kly what is the goal of the motion. Therefore, ﬁnding a trade-oﬀ b et ween legibility and eﬃciency is a key challenge, b y dev eloping systems that can adaptively pro duce legible motion only when required, balancing expressiv eness with task p erformance. W e introduce in this work Style-Conditione d Diﬀusion Policy (SCDP), a no vel approach that mo dulates tra jectory generation by constraining a diﬀusion mo del tow ard a sp eciﬁc style. Dep ending on the spatial am biguity of the environ- men t, SCDP dynamically shifts the rob ot’s motion b et w een legibility and pre- dictabilit y , as shown in ﬁgure 1. Our metho d consists of a p ost-training pip eline, freezing the base diﬀusion mo del and training external en vironment-a w are mod- Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 3 ules, namely a scene enco der and legibility/predictabilit y predictor multi-la yer p erceptrons, to generate an additional conditioning based on the scene ambigu- it y without requiring any mo diﬁcation to the diﬀusion model’s internal structure. Through exp erimen ts, we sho w that our architecture: (i) enables st yle-conditioned tra jectory generation without retraining or altering the base p olicy , (ii) sup- p orts adaptive and scene-dep enden t mo dulation based on an ambiguit y detec- tion mechanism and (iii) improv es motion in terpretability in ambiguous scenarios while fo cusing on eﬃciency when legibilit y is not needed. 2 Related W ork 2.1 Legible Motion Generation The concept of legible motion for rob otics has emerged as a key-topic for HRI, emphasizing the design of mov eme n ts that clearly comm unicates the rob ot’s in- ten tions to human observers [22]. Dragan et al. [3] ﬁrst introduced the distinction b et w een legible and predictable motion, and mo deled h uman inference through a Bay esian framework, relying on cost functions and optimization to maximize the probability of the correct goal b eing iden tiﬁed b y the human observ er, giv en the robot’s tra jectory from its start to its current p osition [11]. Learning-based approac hes ha ve since gained prominence, utilizing reinforcement learning to op- timize legibilit y metrics [20] and sup ervised observer mo dels to predict h uman goal inference [23]. A notable example is Legibility Diﬀuser [4], which employs diﬀusion with classiﬁer-free guidance [12] to generate legible motion. How ev er, this metho d remains agnostic to environmen t’s spatial ambiguit y and relies on man ually tuned guidance w eights at test time, whic h motiv ates the need for a more environmen t-aw are and in terpretable mec hanism to encode legibility in to the generation pro cess. 2.2 Rob ot Motion and Conditional Generativ e Mo dels Generativ e mo dels ha ve gained increasing attention in rob otic motion genera- tion due to their ability to learn complex, high-dimensional distributions of mo- tion data from demonstrations [15]. More recen tly , diﬀusion mo dels ha ve shown promising results in generating diverse and temporally coheren t robot tra jec- tories, esp ecially in manipulation and lo comotion tasks [1]. Notably , Diﬀusion P olicy [2] still demonstrates comp etitiv e performance across visuomotor manipu- lation tasks. Generation is conditioned on observ ation and curren t robot’s state, using the FiLM metho d [7]. Bey ond the scop e of rob otics, recent adv ancemen ts in generative mo deling ha ve introduced conditioning tec hniques to constrain the diﬀusion pro cess [18, 19]. Sp eciﬁcally , Li et al [6] added an additional context v ector to constrain the diﬀusion generation, allowing for ﬁne-grained control o ver the sampling pro cess without altering the base mo del. Inspired by this, w e prop ose an external, environmen t-aw are mo dule that mo dulates tra jectory generation based on the scene’s context. 4 A. Jacquet Crétides et al. 3 Preliminaries 3.1 Diﬀusion P olicy Our work builds up on Diﬀusion P olicy in its U-Net form [2, 8], which adapts the Denoising Diﬀusion Probabilistic Mo dels (DDPM) [5] framework to learn conditioned action sequences from demonstrations. The denoising up date rule is deﬁned as: X k − 1 t = α ( X k t − γ ϵ θ ( O t , X k t , k ) + N (0 , σ 2 I )) (1) where X k t represen ts the noisy action at step k , and ϵ θ estimates the noise comp onen t added to the action sequence, conditioned on the observ ation O t . While observ ation can be vision-based, w e choose to limit it in our work to the goal state g ∗ and the curren t state of the rob ot s t . 3.2 Legibilit y and Predictability Legibilit y refers to the rob ot’s abilit y to communicate its intended goal to a hu- man observer through the expressive c haracteristics of its motion alone [11]. By doing legible, or in tent expressive actions, a rob ot allows an external observer to ha ve a b etter understanding of the rob ot’s action, increasing safety and eﬃciency in human-robot collab oration [16, 21]. A legible motion will seek to maximize the probabilit y that an external observer can infer which ob jectiv e is targeted, even if it means reducing the eﬀectiveness of the motion, by achieving a longer and sub-optimal tra jectory . It can b e deﬁned as follo ws [11]: legibilit y ( ξ ) = R P ( g ∗ | ξ s → ξ ( t ) ) f ( t ) dt R f ( t ) dt (2) where g ∗ is the targeted goal, ξ ( t ) represen ts the partial tra jectory up to time t , and f ( t ) is a temporal weigh ting function (e.g., f ( t ) = t ), emphasizing early steps of the tra jectory , where observer’s uncertaint y is the highest due to m ultiple p oten tial goals b eing equally plausible. While legibility fo cuses on the inference of the goal from the motion, pre- dictabilit y reﬂects the observer’s abilit y to anticipate the tra jectory itself, pro- vided the goal is already known. Mathematically , if legibilit y maximizes P ( g | ξ ) , predictabilit y seeks to maximize P ( ξ | g ) . A predictable tra jectory aligns with the human observ er’s exp ectation of ho w an action should be p erformed, whic h t ypically corresp onds to the most eﬃcient or cost-optimal path. 3.3 Spatial Am biguity W e fo cus in this work on spatial ambiguit y , which dep ends on the en vironment’s conﬁguration. Given a goal space G , w e formalize a scene as spatially ambiguous for a target goal g ∗ ∈ G relative to a set of distractor goals g − b y examining the observ er’s ability to infer the correct inten t from an eﬃcient tra jectory . Let ξ ∗ Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 5 denote an optimal tra jectory tow ard g ∗ , t ypically characterized by the shortest Euclidean distance. F ormally , giv en a conﬁdence threshold τ , we deﬁne a scene as am biguous if there exists at least one goal g − ∈ G \ { g ∗ } such that the following condition holds for the eﬃcien t tra jectory ξ ∗ : P ( g ∗ | ξ ∗ s → ξ ( t ) ) − P ( g − | ξ ∗ s → ξ ( t ) ) < τ (3) Where P ( g | ξ s → ξ ( t ) ) represents the p osterior probability of a goal g given the observ ed partial tra jectory ξ ∗ s → ξ ( t ) . 4 St yle-Conditioned Diﬀusion P olicy W e propose a composite framew ork that builds on Diﬀusion Policy [2] as the base mo del and uses an additional mo dule to condition it. It is comp osed of three Multi-La yer P erceptrons (MLPs), one acting as a scene enco der and generating a con text vector corresp onding on the environmen t the rob ot acts in, and the t wo others acting as conditioning predictors dep ending on this context. A t inference time, we use an am biguity detection module to determine if legibility or eﬃciency is required. 4.1 T ask F ormalization W e consider rob ot motion generation as a sequential decision-making problem in which an agent must mov e to ward a sp eciﬁc goal. Let G ⊂ R 2 denote the set of p ossible goals, where g ∈ G represents a p osition for the rob ot to reach. W e mo del this problem as a discrete-time, inﬁnite-horizon Marko v Decision Pro cess (MDP), M = ( S , A , T , ρ 0 ) , where S is the state space, A the action space, T ( ·| s, a ) the state transition distribution, and ρ 0 ( · ) the initial state distribution. A t each step, the agent observes a state s t and uses a p olicy π to select an action a t = π ( s t ) , where the states and actions corresp ond to the rob ot’s current and next joint conﬁgurations. The agen t then transitions to the next state s t +1 ∼ T ( ·| s t , a t ) . W e assume having a dataset of N d task demonstrations D = { ξ i } N d i =1 where eac h demonstration is a tra jectory ξ i = ( s i 0 , a i 0 , s i 1 , a i 1 , . . . , s iT ) , along with the en vironment state, namely a p ositiv e goal g ∗ ∈ G . In the case of legible motion generation, our ob jective is to generate motion tow ard g ∗ that is distinct from motion to ward any negative goal g − ∈ G \ { g ∗ } , suc h that p ( g ∗ | s t , a t ) > p ( g − | s t , a t ) . On the other hand, in the case of eﬃcien t motion, our ob jective is to minimize the cost of reaching g ∗ , whic h corresp onds to maximizing the lik eliho o d of the tra jectory giv en the goal, p ( ξ | g ∗ ) . 4.2 Scene Enco ding As legibilit y is entirely dep ending on the environmen t’s conﬁguration and the presen t ob jects’ p ositions, w e introduce a scene encoder, that is trained inde- p enden tly from the rest of the architecture, to learn a laten t representation of 6 A. Jacquet Crétides et al. the en vironment and of the spatial relations b etw een the diﬀeren t goals. This MLP takes as inputs the diﬀeren t goals’ co ordinates g ∗ and g − in the scene. F or every goal g − i ∈ G \ { g ∗ } , w e compute r i = g − i − g ∗ the relativ e vec- tor to g ∗ , and j i = ∥ g − i − g ∗ ∥ 2 ∈ R the euclidean distance to g ∗ . W e then construct the enriched vectors ˜ g i =  g − i r i j i  ⊤ . The g ∗ enric hed vector is de- ﬁned as ˜ g ∗ =  g ∗ 0 0  ⊤ ∈ R 5 , acting as the origin of the scene’s coordinate system and grounding the relative v ectors of the other goals. Given N the num- b er of negativ e goals, we can concatenate these enriched vectors into a single one, x = h ˜ g ∗ ⊤ ˜ g ⊤ 1 . . . ˜ g ⊤ N i ⊤ ∈ R 5 × ( N +1) , that will be passed into the enco der S : R 5 × ( N +1) → R s to get a latent contextual v ector of the scene : c = S ( x ) ∈ R s (4) T o train this encoder, we employ a reconstruction-based approac h using an auto encoder arc hitecture [17], where the scene enco der is jointly trained with a deco der. 4.3 Constraining the Diﬀusion Diusion P olic y MSE L oss Cont e xt V ect or Scene Encoder L e gibility Pr edict or x t x t-1 Noise (a) T raining Pro cess. g*, s, t L e gibility x t x t-1 (b) U-Net Conditioning. Fig. 2. (a) The predictor mo dule is integrated via a post-training pip eline where the base Diﬀusion Policy weigh ts remain frozen. By training the light weigh t MLP on a subset of expressive demonstrations, the module learns to sp eciﬁcally comp ensate for the residuals b et ween the st yle-sp eciﬁc tra jectories and the general paths the base mo del was originally trained to repro duce.(b) The conditioning from the predictor is only applied to the b ottlenec k of the diﬀusion U-Net using FiLM to denoise X at each timestep t . W e design a second MLP , resp onsible of the tra jectory style enco ding. Fig- ure 3 shows the p ost training pro cess. After training the base Diﬀusion P olicy’s U-Net on a large set of demonstrations, we freeze its weigh ts and add the scene Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 7 enco der, also pretrained, and the predictor MLP . W e then train this new com- p osite mo del on a smaller dataset, containing only style-speciﬁc demonstrations. As the base model is frozen, only the style MLP is up dated by the MSE loss through this training phase, and has to learn to comp ensate through the genera- tion the diﬀerence b et ween the legible demonstrations seen here and the classical demonstrations Diﬀusion P olicy has b een trained to repro duce. T raining on a sp eciﬁc data subset, the context vector c generated b y the scene enco der passes through the MLP , that generates t wo vectors γ and β : γ = W γ c + b γ ∈ R l , β = W β c + b β ∈ R l (5) where l is the feature dimension of the b ottleneck lay er, W γ , W β ∈ R l × d and b γ , b β ∈ R l are learned w eight matrices. These parameters γ , β are then used to mo dulate the U-Net’s middle lay er vector h using a FiLM conditioning [7]: FiLM ( h ) = γ ⊙ h + β , h ∈ R l (6) This results in an additional mo dule that can b e added to the base diﬀusion mo del to constrain its generation to a desired controlled distribution. 4.4 Am biguity Detection Diusion P olic y A ctions Cont e xt V ect or Scene Encoder L e gibility Pr edict or Ambig uity Det ection Pr edict ability Pr edict or (a) Ev aluation Pro cess. s t g* g - e (b) Ellipse of Ambiguit y . Fig. 3. (a) The environmen t state is passed through the ambiguit y detection mo dule to determine if the scene is spatially ambiguous and decide which conditioning should b e used. (b) Visualization of the ellipse of ambiguit y used for the scene’s classiﬁcation. The scene is lab eled as spatially ambiguous when g − falls inside the elliptical b oundary . T o approximate the probabilistic deﬁnition of spatial am biguity from 3.3, W e deﬁne an ellipse of ambiguit y E that represents the spatial zone of confusion where a negativ e goal g − is most lik ely to cause an observ er to misinterpret the rob ot’s in tent. The ellipse is constructed suc h that the rob ot’s current state s t acts as one of its fo cal p oin ts, and is cen tered at a p oin t e lo cated b et w een 8 A. Jacquet Crétides et al. the robot’s current state s t and the goal g ∗ . This cen ter is deﬁned as: e = s t + κ ( g ∗ − s t ) , where κ ∈ (0 . 5 , 1) is a scaling factor. A scene is considered spatially ambiguous if an y negative goal g − falls within this elliptical b oundary: Am biguity ( g ∗ , g − , s t ) = ( 1 if ( g − − e ) ⊤ M ( g − − e ) ≤ 1 0 otherwise (7) Here, M is a symmetric pos itiv e-deﬁnite matrix that deﬁnes the orien tation and the semi-axes of the ellipse. Figure 4.4 sho ws the arc hitecture used at infer- ence time and a visualization of the ellipse of ambiguit y for a given environmen t. The am biguity detection module serv es as an arbitrator that selects the opti- mal conditioning for the Diﬀusion P olicy based on the en vironment’s risk of goal-confusion. If the scene is ﬂagged as spatially ambiguous, the legibilit y pre- dictor is activ ated to constrain the diﬀusion pro cess tow ard a more expressive tra jectory . Otherwise, the predictability predictor constrains the generation to the most eﬃcien t path. 5 Ev aluation Our experiments are designed to ev aluate ho w eﬀectiv ely the mo del balances legibilit y and eﬃciency across v arying environmen tal contexts. T o address this trade-oﬀ, we consider the following researc h questions: • Can SCDP increase legibility p erformance when the situation requires higher clarit y? • Do es SCDP eﬀectively reduce legibilit y to prioritize eﬃciency in unam bigu- ous environmen ts? T o provide robust answ ers, we test these questions across diverse datasets fea- turing a large range of spatial conﬁgurations. 5.1 Datasets W e conduct our exp erimen ts on t wo tasks. Each task is divided into tw o envi- ronmen tal scenarios: spatial am biguity and no spatial ambiguit y . Blo c k Reac h : the Block Reac h task is a b enchmark commonly used in legibilit y studies [3]. In this setup, tw o ob jects are placed randomly on the scene, and a manipulator arm, here a F rank a Emik a Panda rob ot, is required to reach one of them. Na vigation : In this task, a mobile rob ot has to reach one of the tw o goals presen t in the ro om. The rob ot used here is a T urtleb ot. T o train our agen ts, we collected datasets of 200 demonstrations for each task using the Gazeb o simulator to p erform and record the rob ots’ states and actions. These demonstrations were pro cedurally generated using quadratic Bezier curv es to ensure a diverse distribution of tra jectories that range from near-optimal straigh t lines to highly curved paths. Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 9 Fig. 4. Visualization of SCDP and baselines’ inferences in ambiguous (top) and non- am biguous (b ottom) scene conﬁgurations for the navigation task. While Diﬀusion P ol- icy captures the entire data distribution and Legibility Diﬀuser collapses on the most legible mo de, SCDP constrains its generation depending on the scene conﬁguration. 5.2 Baselines W e compare and ev aluate our metho d with resp ect to other diﬀusion-based base- lines. Diﬀusion P olicy [2] : Diﬀusion P olicy is the original DDPM-based mo del our architecture is built on. Similarly to our method, it is here used in its goal- conditioned form, to isolate the eﬀect of our con tribution. Legibilit y Diﬀuser [4] : Also built on a Diﬀusion Policy , Legibility Diﬀuser is a v ariant that in tro duces a guidance term to pull the tra jectory a wa y from a negativ e goal g − while going to g ∗ . As the oﬃcial implemen tation is not av ailable, their metho d was repro duced for comparison, using the following v alues for their h yp erparameters: w t = 4 . 5 , α = 0 . 9 and λ = 0 . 5 . P arameters were chosen to recreate a similar b ehavior to the one presented in the ev aluation of the baseline. T raining Dataset : W e additionally rep ort the v alues of the datasets used to train the diﬀerent baselines. 5.3 Implemen tation Details This section details the architectural and training conﬁgurations for the SCDP framew ork and the comparativ e baselines. Base Diﬀusion Policy: W e employ a U-Net-based Diﬀusion Policy [2] with appro ximately 70 M parameters for the diﬀeren ts baselines and SCDP . The base mo del is trained on a foundational dataset of 200 exp ert demonstrations for 100 ep ochs. 10 A. Jacquet Crétides et al. Scene Encoder: T o extract laten t environmen tal context, we utilize a 3- la yer MLP architecture comprising 26 , 000 parameters. This mo dule is pre- trained using a reconstruction loss on 5 , 000 randomized goal conﬁgurations o ver 50 ep o c hs to ensure a robust representation of spatial relations. Conditioning Predictors: The style-speciﬁc mo dulation is handled by tw o separate 4-lay er MLP backbones ( 1 M parameters eac h) representing the legibility and predictability predictors. These mo dules are ﬁne-tuned for 300 ep o c hs using sp ecialized subsets con taining the top 20% most legible and 20% most eﬃcient demonstrations from the original datasets, resp ectiv ely . 5.4 Metrics W e ev aluate the generated tra jectories using t wo primary p erformance indica- tors, which are then combined in to a ﬁnal adaptiv e transparency score metric. Primary Metrics The ﬁrst metric, detachmen t score, w as introduced b y the authors of L egibilit y Diﬀuser [4], and serves as a pro xy of legibility when targeted and negative goals are close. It quan tiﬁes the divergence from the negative goal g − along the tra jectory ξ s → g ∗ : D ( ξ s 0 → g ∗ ) = X s t ∈ ξ s → g ∗ ∥ g − − s t ∥ 2 t (8) The second metric is tra jectory eﬃciency , calculated as the recipro cal of the total Euclidean distance trav eled during motion execution: E ( ξ s 0 → g ∗ ) =   X s t ∈ ξ s → g ∗ ∥ s t − s t − 1 ∥ 2 + ϵ   − 1 (9) where ϵ is a small constant to ensure numerical stability . V alues for D and E are normalized using the Min-Max metho d with resp ect to the v alues obtained from the training dataset to ensure they share a comparable scale. A daptive T ransparency Score T o ev aluate ho w eﬀectively the mo del navi- gates the conﬂict b et ween legibilit y and eﬃciency , we deﬁne a trade-oﬀ metric, the adaptive transparency score ( T ): T = (1 − w amb ) ˆ D + w amb ˆ E (10) where ˆ D and ˆ E are the normalized detachmen t and eﬃciency scores. The w eight w amb is mo deled as a contin uous sigmoid function of the Euclidean dis- tance j b et ween the goals g ∗ and g − , whic h is also normalized using the Min-Max metho d: w amb ( j ) = 1 1 + e − u ( j − x 0 ) (11) Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 11 T able 1. Performance Comparison using Adaptiv e T ransparency Score for the Blo c k Reac h and Navigation tasks. Metho d Spatial Am biguity No Spatial Ambiguit y Blo ck R e ach Evaluation Diﬀusion Policy 0 . 52 ± 0 . 06 0 . 61 ± 0 . 06 Legibilit y Diﬀuser 0 . 61 ± 0 . 09 0 . 43 ± 0 . 10 Ours (SCDP) 0 . 58 ± 0 . 08 0 . 74 ± 0 . 06 Dataset 0 . 47 ± 0 . 17 0 . 62 ± 0 . 18 Navigation Evaluation Diﬀusion Policy 0 . 50 ± 0 . 03 0 . 61 ± 0 . 05 Legibilit y Diﬀuser 0 . 62 ± 0 . 02 0 . 32 ± 0 . 12 Ours (SCDP) 0 . 59 ± 0 . 03 0 . 76 ± 0 . 06 Dataset 0 . 46 ± 0 . 17 0 . 59 ± 0 . 20 This formulation allo ws for a ﬂuid transition b et ween b ehavioral mo des based on the en vironmen t’s spatial conﬁguration, to quan tify the agen t’s ability to adapt its motion style to the sp eciﬁc geometric requiremen ts of the scene. In our ev aluation, we set the steepness parameter u = 2 . 5 and the midp oint distance x 0 = 0 . 5 . 5.5 Results T able 1 presents the ev aluation results using the adaptiv e transparency score T = w amb ˆ D + (1 − w amb ) ˆ E across diﬀerent metho ds, while table 2 displays the detac hment and tra jectory eﬃciency scores separately . Results are av er- aged in simulation o ver 100 inferences. Figure 4 illustrates the qualitative dif- ferences in tra jectory generation b etw een the baselines in b oth ambiguous and non-am biguous settings. P erformance Analysis Our results demonstrate that SCDP eﬀectiv ely bal- ances tra jectory legibility and path eﬃciency b y adapting its b eha vior to the scene’s spatial am biguity . • Am biguous Scenarios: In environmen ts requiring higher clarity , the Legibility Diﬀuser baseline achiev es the highest mean scores ( 0 . 61 in Block Reac h and 0 . 62 in Navigation). This is driv en by its prioritization of detachmen t from negativ e goals, though it suﬀers from low eﬃciency . SCDP follo ws closely ( 0 . 58 and 0 . 59 resp ectiv ely), signiﬁcantly outp erforming the standard Diﬀu- sion Policy . • Non-Am biguous Scenarios: When the goal is clear, SCDP consistently out- p erforms all other metho ds, achieving the highest fused scores of ( 0 . 74 and 0 . 76 ) in b oth tasks. 12 A. Jacquet Crétides et al. T able 2. Separated Detachmen t and T ra jectory Eﬃciency Scores Comparison for Blo c k Reac h and Navigation tasks. Metho d Spatial Ambiguit y No Spatial Ambiguit y Detac hment Eﬃciency Detac hment Eﬃciency Blo ck R e ach Evaluation Diﬀusion Policy 0 . 52 ± 0 . 06 0 . 67 ± 0 . 06 0 . 43 ± 0 . 09 0 . 65 ± 0 . 09 Legibilit y Diﬀuser 0 . 89 ± 0 . 12 ↑ 0 . 13 ± 0 . 10 ↓ 0 . 67 ± 0 . 16 ↑ 0 . 31 ± 0 . 20 ↓ Ours (SCDP) 0 . 70 ± 0 . 10 ↑ 0 . 43 ± 0 . 06 ↓ 0 . 42 ± 0 . 10 0 . 80 ± 0 . 12 ↑ Dataset 0 . 46 ± 0 . 22 0 . 64 ± 0 . 25 0 . 50 ± 0 . 22 0 . 64 ± 0 . 24 Navigation Evaluation Diﬀusion Policy 0 . 42 ± 0 . 03 0 . 77 ± 0 . 05 0 . 58 ± 0 . 16 0 . 61 ± 0 . 04 Legibilit y Diﬀuser 0 . 85 ± 0 . 04 ↑ 0 . 02 ± 0 . 03 ↓ 1 . 06 ± 0 . 28 ↑ − 0 . 07 ± 0 . 19 ↓ Ours (SCDP) 0 . 64 ± 0 . 02 ↑ 0 . 47 ± 0 . 05 ↓ 0 . 31 ± 0 . 01 ↓ 0 . 85 ± 0 . 08 ↑ Dataset 0 . 39 ± 0 . 21 0 . 70 ± 0 . 25 0 . 47 ± 0 . 28 0 . 64 ± 0 . 26 It is w orth noting that across all exp erimen ts, the success rate for reac hing the target goal remained abov e 0 . 98 for all baselines and our prop osed SCDP . This conﬁrms that the style-conditioning mo dules mo dulate the tra jectory path without compromising the underlying task p erformance of the base mo del. Discussion While Legibility Diﬀuser can b e tuned to maximize inten t expres- sion, our approac h is limited to the data seen in the training set, meaning it is unlikely to generate tra jectories that are more legible than those observed. Nonethelss, SCDP provides a data-driven alternative allo wing for environmen t- a wareness, that maintains high eﬃciency in clear scenarios, without explicit guid- ance tuning, oﬀering a superior o verall trade-oﬀ across div erse en vironmental con texts. 6 Deplo ymen t T o ev aluate the p ortabilit y of SCDP beyond sim ulated environmen ts, w e de- plo yed the mo del on a ph ysical F rank a Emik a Panda rob ot to p erform the Blo c k Reac h task, as sho wn in the ﬁgure 5. 6.1 P erception T o perform ob ject detection, we use a ﬁne-tuned YOLO [13] model to detect target and distractor blo c ks from an Intel RealSense RGB-D’s camera frame. The detected 2D bounding b o xes are pro jected in to the 3D space using the camera’s depth map and extrinsic calibration with the rob ot’s base frame. The obtained co ordinates can then b e used b y the mo del. Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 13 Fig. 5. Real-world deploymen t of SCDP on a F rank a Emik a Pa nda rob ot for the Block Reac h task. The image sequence (left to right) illustrates the mo del successfully gen- erating an exaggerated, inten t-expressive tra jectory to ward the target blue ob ject to resolv e spatial ambiguit y relative to the distractor pink ob ject. 6.2 Sim-to-Real T ransfer and Computational Eﬃciency W e perform inference on a NVIDIA R TX A3000 GPU. The generated tra jectories are in terp olated to execute motion with a frequency of 1000 Hz. By using a state- based Diﬀusion Policy model, the model a voids any reality gap and mirrors sim ulation p erformance gains, without any sp eciﬁc calibration or ﬁne-tuning on real demonstrations. Our metho d ac hieves a mean total inference time of 5 seconds, which is similar to the base mo del’s p erformance, due to our additional mo dules b eing ligh tw eight and adding negligible latency . 7 Conclusion W e prop osed an arc hitecture that enables the sp ecialization of a ligh tw eight MLP from the initial mo del to learn a context-dependent tra jectory st yle, that conditions a pretrained Diﬀusion P olicy mo del to constrain the generation when needed, in our case tow ard legible and predictable motion. In addition to the baseline, these st yle-sp eciﬁc mo dules require smaller datasets for training and imp ose no o verhead at inference time, oﬀering a practical enhancement for diﬀusion-based p olicies. While the fo cus of this w ork was primarily on legibility , w e b eliev e that the same learning metho d could b e used to learn other tra jectory- lev el concepts like safety , enabling style-conditioned motion generation through mo dular and reusable components. F u ture w ork could aim to further develop am biguity detection, as the curren t mo dule can curren tly b e considered as a geometrical proxy that could be sw app ed in a more complex am biguity detec- tor, and to conduct user studies to v alidate how human observers p erceiv e and in terpret these adaptive tra jectories. F urthermore, future research could explore the framew ork’s ﬂexibility and scalability when faced with a higher n umber of p oten tial goals. 14 A. Jacquet Crétides et al. A ckno wledgmen ts. This work used IDRIS HPC resources under the allocation 2025- [AD011017084] made b y GENCI. It was funded by the F rench National Research Agency (ANR) under the OSTENSIVE pro ject (ANR-24-CE33-6907-01) and the F rance 2030 program, reference ANR-23-P A VH-0005 (INNO VCARE Pro ject). This pro ject has received funding from the European Union’s Horizon Europ e F ramework Pro- gramme under grant agreement No 101070596. Disclosure of In terests. The authors hav e no comp eting interests to declare that are relev ant to the con tent of this article. References 1. W olf, R., Shi, Y., Liu, S., Rayy es, R.: Diﬀusion Mo dels for Robotic Manipulation: A Survey . In: F rontiers in Rob otics and AI (2025) 2. Chi, C., Xu, Z., F eng, S., Cousineau, E., Du, Y., Burc hﬁel, B., T edrake, R., Song, S.: Diﬀusion P olicy: Visuomotor Policy Learning via Action Diﬀusion. In: Pro ceedings Rob otics: Science and Systems (RSS) (2023) 3. Dragan, A., Lee, K.C.T., Sriniv asa, S.S.: Legibilit y and predictabilit y of robot motion. In: 8th ACM/IEEE In ternational Conference on Human-Robot Interaction (HRI), pp. 301–308 (2013) 4. Bronars, M., Cheng, S., Xu, D.: Legibility Diﬀuser: Oﬄine Imitation for In tent Expressiv e Motion. In: IEEE Rob otics and Automation Letters (RA-L) 9 (11), 10161–10168 (2024) 5. Ho, J., Jain, A., Abb eel, P .: Denoising Diﬀusion Probabilistic Mo dels. In: Proceed- ing of the 34th International Conference on Neural Information Processing Systems (NeurIPS) (2020) 6. Li, H., Shen, C., T orr, P ., T resp, V., Gu, J.: Self-Disco vering Interpretable Diﬀusion Laten t Directions for Responsible T ext-to-Image Generation. In: IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 12006–12016 (2024) 7. P erez, E., Strub, F., de V ries, H., Dumoulin, V., Courville, A.: FiLM: Visual Rea- soning with a General Conditioning Lay er. In: Pro ceedings of the AAAI Conference on Artiﬁcial Intelligence (2018) 8. Ronneb erger, O., Fisc her, P ., Brox, T.: U-Net: Con volutional Net works for Biomed- ical Image Segmentation. In: MICCAI 2015, LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015) 9. P eebles, W., Xie, S.: Scalable Diﬀusion Mo dels with T ransformers. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV), pp. 4172– 4182 (2022) 10. Lin, H., Cheng, X., W u, X., Y ang, F., Shen, D., W ang, Z., Song, Q., Y uan, W.: CA T: Cross A ttention in Vision T ransformer. In: IEEE International Conference on Multimedia and Exp o (ICME), pp. 1–6 (2022) 11. Dragan, A., Sriniv asa, S.S.: Generating Legible Motion. In: Pro ceedings of Rob otics: Science and Systems (RSS) (2013) 12. Ho, J., Salimans, T.: Classiﬁer-F ree Diﬀusion Guidance. arXiv preprin t arXiv:2207.12598 (2022) 13. Redmon, J., Divv ala, S., Girshick, R., F arhadi, A.: Y ou Only Look Once: Uniﬁed, Real-Time Ob ject Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016) Enco ding Predictability and Legibilit y for Style-Conditioned Diﬀusion Policy 15 14. W allk ötter, S., T ulli, S., Castellano, G., Paiv a, A., Chetouani, M.: Explainable Em- b odied Agents Through So cial Cues: A Review. In: ACM T ransactions on Human- Rob ot Interactions (THRI) 10 (3), art. 27 (2021) 15. Urain, J., Mandlek ar, A., Du, Y., Shaﬁullah, M., Xu, D., F ragkiadaki, K., Chal- v atzaki, G., Peters, J.: Deep Generativ e Mo dels in Rob otics: A Survey on Learning from Multimo dal Demonstrations. arXiv preprint arXiv:2408.04380 (2024) 16. Lic hten thäler, C., Lorenzy , T., Kirsch, A.: Inﬂuence of legibility on perceived safety in a virtual h uman-rob ot path crossing task. In: The 21st IEEE In ternational Symp osium on Robot and Human Interactiv e Communication (RO-MAN), pp. 676–681 (2012) 17. Rumelhart, D.E., Hin ton, G.E., Williams, R.J.: Learning internal representations b y error propagation. In: Parallel Distributed Pro cessing, pp. 318–362. MIT Press (1987) 18. P anagiotakopoulos, T., K otsiantis, S., Gkillas, A., Lalos, A.S.: Conditional Dif- fusion Mo dels: A Survey of T echniques, Applications, and Challenges. In: IEEE A ccess 13 , 183617–183643 (2025) 19. Berrada, T., Astolﬁ, P ., Hall, M., Hemmat, R.A., Benchetrit, Y., Hav asi, M., Muc k- ley , M.J., Alahari, K., Romero-Soriano, A., V erb eek, J., Drozdzal, M.: On impro ved Conditioning Mec hanisms and Pre-training Strategies for Diﬀusion Mo dels. In: Pro ceeding of the 37th International Conference on Neural Information Pro cessing Systems (NeurIPS) (2024) 20. Bied, M., Chetouani, M.: In tegrating an Observ er in In teractive Reinforcement Learning to Learn Legible T ra jectories. In: 29th IEEE International Symp osium on Rob ot and Human Interactiv e Communication (RO-MAN), pp. 760–767 (2020) 21. Dragan, A.D., Bauman, S., F orlizzi, J., Sriniv asa, S.S.: Eﬀects of Rob ot Motion on Human-Rob ot Collaboration. In: 10th ACM/IEEE International Conference on Human-Rob ot Interaction (HRI), pp. 51–58 (2015) 22. Lic hten thäler, C., Lorenz, T., Kirsch, A.: T o wards a Legibilit y Metric: How to Measure the Perceiv ed V alue of a Rob ot. In: International Conference on So cial Rob otics (ICSR) (2011) 23. W allk ötter, S., Chetouani, M., Castellano, G.: SLOT-V: Sup ervised Learning of Observ er Models for Legible Robot Motion Planning in Manipulation. In: 31st IEEE In ternational Conference on Robot and Human In teractive Communication (R O-MAN), pp. 1421–1428 (2022) 24. Zare, M., Kebria, P .M., Khosravi, A., Nahav andi, S.: A Surv ey of Imitation Learn- ing: Algorithms, Recent Developmen ts, and Challenges. In: IEEE T ransactions on Cyb ernetics 54 (12), 7173–7186 (2024)

Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment