ReactMotion: Generating Reactive Listener Motions from Speaker Utterance

In this paper, we introduce a new task, Reactive Listener Motion Generation from Speaker Utterance, which aims to generate naturalistic listener body motions that appropriately respond to a speaker's utterance. However, modeling such nonverbal listen…

Authors: Cheng Luo, Bizhu Wu, Bing Li

ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
ReactMotion: Generating Reactiv e Listener Motions from Sp eak er Utterance Cheng Luo 1 † , Bizh u W u 2 , 4 , 5 † , Bing Li 1 ∗ , Jianfeng Ren 4 , Ruibin Bai 4 , Rong Qu 5 , Linlin Shen 2 , 3 ∗ , and Bernard Ghanem 1 1 King Abdullah Univ ersit y of Science and T ec hnology 2 Computer Vision Institute, School of Artificial Intelligence, Shenzhen Universit y 3 Guangdong Provincial Key Lab oratory of Intelligen t Information Processing, Shenzhen Universit y 4 Sc ho ol of Computer Science, Universit y of Nottingham Ningb o China 5 Sc ho ol of Computer Science, Universit y of Nottingham, United Kingdom Pro ject page: https://reactmotion.github.io Abstract. In this pap er, w e in tro duce a new task, Reactiv e Listener Motion Generation from Sp eak er Utterance, whic h aims to generate nat- uralistic listener b o dy motions that appropriately respond to a sp eak er’s utterance. Ho wev er, modeling such nonv erbal listener b eha viors remains underexplored and c hallenging due to the inherently non-deterministic nature of h uman reactions. T o facilitate this task, we present ReactMo- tionNet, a large-scale dataset that pairs sp eak er utterances with multiple candidate listener motions annotated with v arying degrees of appropri- ateness. This dataset design explicitly captures the one-to-man y nature of listener b eha vior and provides sup ervision b ey ond a single ground- truth motion. Building on this dataset design, we develop preference- orien ted ev aluation protocols tailored to ev aluate reactive appropriate- ness, where conv en tional motion metrics fo cusing on input–motion align- men t ignore. W e further propose ReactMotion, a unified generative frame- w ork that jointly mo dels text, audio, emotion, and motion, and is trained with preference-based ob jectiv es to encourage both appropriate and di- v erse listener resp onses. Extensive exp eriments show that ReactMotion outp erforms retriev al baselines and cascaded LLM-based pip elines, gen- erating more natural, diverse, and appropriate listener motions. Keyw ords: Dy adic interaction · Interactional AI systems 1 In tro duction Mo deling dy adic h uman comm unication is crucial for virtual agents [33], dig- ital h umans [50, 104], and so cial rob ots [71]. While prior w ork has adv anced sp eec h-to-sp eec h dialogue [15], language-based interfaces [1, 28], and listener fa- cial reactions [57, 70], reactiv e listener b o dy motions remain largely o verlooked † Equal contribution. ∗ Corresp onding authors. 2 C. Luo et al Moving and greeting Wavi ng hands E cstatic React Motion I'm so excited you're here! I was hoping you'd show up . Emotion: Input: speaker utterance Output: reactive listener motion Fig. 1: Illustration of the proposed new task: Reactive Listener Motion Generation from Speech Utterance. Given a speaker’s utterance, i.e ., transcript and/or audio (op- tionally supplemen ted with emotion), a generativ e mo del such as our Rea ctMotion generates a corresponding responsive b ody-motion sequence for the listener. despite b eing central to face-to-face interaction. Listeners often con vey engage- men t and understanding through p osture and subtle gestures, and generating suc h feedback is important for natural dy adic communication. W e introduce a new task, R e active Listener Motion Gener ation fr om Sp e e ch Utter anc e , which aims to generate naturalistic listener b ody motions that ap- propriately respond to a sp eak er’s utterance given its audio and/or transcript. Unlik e text-to-motion [21, 62, 75, 76, 96] or audio-driven motion generation [88] that primarily realize the input con tent, our setting models con versational reac- tions where speaker cues are indirect and the output is inherently one-to-many . This task p oses three challenges. (i) The same utterance can elicit m ultiple v alid listener reactions [57, 70]. Suc h non-deterministic listener behaviour p oses a significan t challenge for modeling the listener’s motion resp onses. (ii) There is no publicly a v ailable large-scale dataset with multiple listener-reactiv e b o dy motions p er utterance, to the b est of our knowledge. (iii) Reactive appropriateness is difficult to ev aluate. Metrics based on a single ground truth or motion diversit y are insufficien t to measure the appropriateness of a listener’s reaction. T o address these challenges, we introduce ReactMotionNet , a curated dataset with 151,328 (sp eak er utterance, listener motion) pairs. Unlike prior motion datasets that typically pro vide a single target p er condition , we asso ciate eac h utterance with multiple candidate reactions and annotate them in to three preference tiers, Gold , Silver , and Ne gative . This tiered design captures one-to-man y am biguity and enables preference-st yle supervision and ev al- uation [11, 13, 102]. Moreov er, w e prop ose a scalable pip eline that re-purposes existing motion data into dy adic sp eak er-listener pairs for dataset construction, whic h av oids relying on exp ensiv e sp eak er–listener motion capture . T o ev aluate reactiv e appropriateness, we in tro duce a tier-a ware ranking proto col . W e train a m ultimo dal judge net work to score and rank candidate re- actions under the same speaker input and rep ort win rates against the Gold, Silv er, or Negative tiers. This relativ e ev aluation goes beyond single- reference similarit y and better reflects that m ultiple reactions can be ap- propriate for the same utterance. Finally , w e prop ose ReactMotion , a unified generativ e framew ork that join tly mo dels sp eak er transcript, emotion, and audio Abbreviated pap er title 3 to generate listener motions. W e leverage the tiered annotations with preference- based ob jectives that learn from r elative comparisons within each utterance group for the training. Contributions. (i) T o the b est of our knowledge, we in tro duce the first task of reactiv e listener b o dy motion generation from sp eak er sp eec h in dy adic in- teraction. (ii) W e present ReactMotionNet , a new dataset with multi-tier (Gold/Silv er/Negative) reactive listener motions and a tier-aw are ev aluation proto col for reactiv e appropriateness, enabling researc h on nonv erbal listener resp onse behavior. (iii) W e propose ReactMotion , a unified m ultimo dal gen- erativ e mo del that pro cesses multiple sp eak er cues and generates high-quality listener b ody motions in resp onse to the speaker. 2 Related W ork Human Motion Generation. Human motion generation can b e conditioned on diverse mo dalities, including text [8, 30, 42, 48, 52, 63, 78, 84, 95, 98], action classes [60, 64, 74], and audio signals such as m usic [37, 39, 40, 90] or sp eec h [38, 45, 85]). Among these, text- and audio-driven motion generation are most related to our setting. T ext-based approac hes generate motions from explicit action de- scriptions [4, 10, 18, 26, 34, 61, 79, 80, 97, 101, 105], while audio-driv en methods syn- thesize gestures aligned with temp orally sync hronized acoustic signals [7, 53, 99]. Represen tative mo deling paradigms include transformer-based latent mo dels ( e.g. , [43, 60, 100]), discrete motion tokenization with autoregressive mo deling ( e.g. , [3, 9, 91, 96]), and diffusion-based framew orks ( e.g. , [2, 22, 44, 76]). Bey ond single-person generation, recent w orks [24, 41, 53, 55, 73, 81] extend motion synthesis to multi-person scenarios. These approac hes typically generate m ulti-p erson motions b y conditioning on explicit textual descriptions of joint actions or on the audio streams of both individuals. In contrast, our problem setting differs in that the target motion is not directly sp ecified b y explicit action instructions or synchronized signals. Instead, the model must infer the implicit in teraction in tention from the sp eak er’s utterance, including transcript, audio, and emotion cues, and pro duce a so cially appropriate reactive motion for the listener. This requires reasoning o ver cross-sp eak er dynamics rather than direct condition-to-motion mapping. Human Reaction Generation. Human reaction generation is crucial for AI in- teraction systems. Sp oken language mo deling has progressed from cascaded ASR → LLM → TTS pipelines to end-to-end and full-duplex sp eech-to-speech mo d- els [15, 66, 77, 94], while facial reaction generation has adv anced from conditional GANs [27] to uncertain ty-a w are and diffusion-based metho ds [49, 51, 57, 70, 103]. Audio-visual face-to-face dialogue mo deling has b een explored [14, 57, 59, 103]. In 3D human b ody modeling, most methods syn thesize reactor motion con- ditioned on actor motion [12, 17, 46, 47, 87]. F or instance, In terF ormer [12] uses temp oral-spatial atten tion in T ransformers, and ReGenNet [87] and ReMoS [17] emplo y diffusion mo dels for full-b ody motion. Recen tly , HERO [93] generates 4 C. Luo et al 3D reactive motion directly from R GB videos, incorporating the actor’s facial expressions to capture emotional cues. Differently , our metho d generates 3D re- actor motion from the sp eaker’s utterance, whic h includes transcript, audio, and optional emotion annotations. T ranscript provides a light w eight, user-friendly mo dalit y , audio offers ric h v o cal cues, and emotion labels explicitly indicate mo od, facilitating more effective interaction modeling. 3D Human Bo dy Interaction Datasets. Recen t datasets hav e facilitated researc h on multi-person dynamics and interaction-a w are 3D motion. Sev eral w orks [20, 25, 41, 86, 92] pro vide paired h uman motions, modeling interaction as symmetric kinematic coupling, where one participan t’s motion is predicted from the other’s. While effectiv e for spatial co ordination, this ignores linguistic and affectiv e signals that drive con versation. Other datasets [31, 32, 35, 56, 67, 68, 93] supply silen t R GB videos with 3D reactiv e motions, offering richer con text but still lacking sp eec h seman tics and emotional cues, which are central to communicativ e inten t. Some datasets [24, 36, 55, 73] include b oth audio and motion for human interactions, but the mov emen ts of their motions primarily fo cus on the upp er b ody , such as arms, and are limited to one-to-one sp eak er-listener pairs. In contrast, our dataset provides a one-to-many mapping b et ween sp eaker utterances and listener reactive motions. Eac h utterance has multiple resp onses lab eled gold , silver , and ne g for appropriate, partially appropriate, and irrelev ant reactions, making it b etter suited for practical applications. Plus, motions are more dynamic, suc h as jumping, enabling more diverse bo dy reactions. 3 T ask Definition In this pap er, w e study R e active Listener Motion Gener ation in dy adic interac- tion, whic h consists of a sp e aker and a listener . Giv en a sp eaker utterance C s , the goal is to generate appropriate reactiv e b ody motion of the listener, denoted as R l . F ormally , the ob jective is to learn the conditional distribution: p θ  R l | C s  , C s ∈ n A s , T s , ( A s , T s ) , ( A s , E s ) , ( T s , E s ) , ( A s , T s , E s ) o . (1) Here, A s denotes the sp eak er audio, T s is the corresp onding textual transcript, E s represen ts the sp eak er emotion, and θ denotes the mo del parameters. As sho wn in Eqn. 1, C s ma y consist of different mo dalities of the speaker utter- ance or their com binations. A t inference time, diverse listener reactions can b e sampled from p θ ( R l | C s ) . In contrast to conv en tional text-to-motion generation, the sp eak er utterance do not explicitly sp ecify the target listener motion. The mapping from C s to R l is therefore inherently one-to-many , which requires the mo del to generate motions that are con textually appropriate while maintaining div ersity . Abbreviated pap er title 5 The man plays the violin Audio emotion Person is doing a hand stand . the person is moving his arms like he is arguing with someone. A person is waving hi with his right hand. A person punches with their right hand before they do a counterclockwise spin . Step1: Dyadic Listener Reactive Motion Curation St ep2: I n ve r s e S p e a k e r - Co nd i t i o n Infer enc e Step4: Speaker – Li s t e n e r C an d i d at e R an ki n g and Pre f e r e n c e Ti e r i n g Sample caption - motion pairs from HumanML3D dataset Select dyadic conversation relat ed motion LLMs ✔ ✔ ✖ ✖ Motion caption A bunch of your old schoolmates just arrived, and they're all looking this way A person is waving hi with his right hand. Speaker utterance LLM Inference Content Speaker emotion TTS Synthetic s peaker audio Step3: Data Filtration Synthetic s peaker audio Ta rg e t s peaker emotion Speech emotion recog nition Hume AI Emotion consistency check (keep/discard ) Speaker tr a n s cri pt Rank listener candidat e s b y appropriateness f or ea c h speaker utteran ce, and retain the top - ranked o n e s . Speaker emotion Natural l anguage inference mo dels LLMs + A bee seemed to zip past you just now. That gave me a tiny scare Speaker utterance Someone abruptly steps ba c k w a r d , s e e m i n g l y s u r p r i s e d o r s t a r t l e d b y so m e t h i n g Listener motion candidates A pe r so n st a n di n g st i l l then suddenly stepping back out of the way Gold Silver Negative Step2 : Inverse Speaker- Condition Synthesis Step3 : Data Filtering All listener motion captions Fig. 2: ReactMotionNet dataset construction. W e curate dyadic listener motions (Step 1), synthesize sp eak er conditions via in verse inference and T ext-to-Speech (TTS) (Step 2), filter unreliable samples (Step 3), and rank/re-tier speaker–listener pairs in to gold/silver/ne gative preferences (Step 4). 4 ReactMotionNet Dataset T o bridge the gap betw een existing 3D human motion interaction datasets and real-w orld conv ersational dynamics, we construct a dataset, ReactMotionNet, featuring one-to-many sp eak er utterance–listener reaction mappings with gr ade d appr opriateness annotations . T o construct this dataset, we present a nov el data construction pip eline (Fig. 2) that repurp oses existing human motion data into sp eak er–listener motion–resp onse pairs using p o w erful LLMs [58, 89], thereby a voiding costly data collection. 4.1 Dataset Construction Pip eline Step 1: Dyadic Listener R e active Motion Cur ation. Unlik e existing audio-driv en 3D h uman in teraction datasets, which mainly focus on upper-b o dy mov emen ts while standing still, w e curate motions from the more dynamic and commonly used HumanML3D dataset [19]. Leveraging the textual captions of motions, we filter out con versation-irrelev ant ones ( e.g. , doing a handstand) using m ultiple LLM-based verifiers (e.g., ChatGPT-o1 [29], ChatGPT-o3 mini [58]). This step results in a set of motions with reaction-like seman tics, which serv e as the lis- tener’s reactiv e motions. Step 2: Inverse Sp e aker-Condition Synthesis. F or each listener motion R l from the last step, we infer multiple plausible sp eaker utterances that could elicit the observed reaction. Concretely , we input the listener motion’s caption in to Op enAI o3-mini [1, 58, 69] to generate potential s peaker transcripts T s and as- so ciated emotion labels E s . W e incorp orate emotion in to utterance generation, as the sp eak er’s emotional state influences the listener’s reaction. F or example, the same transcript, “Do whatever you w ant,” can lead to different resp onses: a 6 C. Luo et al T able 1: Dataset statistics. #Pairs is the total num b er of labeled sp eak er–listener pairs ( i.e. candidate reactions). #T rans., #Audio, and #Emo. denote the num b ers of unique transcripts, audio files, and emotion categories, resp ectiv ely . #Motion is the num ber of unique motion sequences. #Motion/Utter. rep orts the av erage num- b er of candidate motions p er sp eaker utterance. Label counts rep ort the num b ers of gold/silv er/negative candidates (# G /# S /# N ). Split #Pairs Speaker Utterance Listener Reaction #Motion/Utter. Labels (y) #T rans. #Audio #Emo. #Motion (avg.) (# G /# S /# N ) T rain 137,879 6,631 6,631 46 1,822 20.79 7,527 / 30,862 / 99,490 V al 6,790 841 841 40 195 8.07 903 / 1,682 / 4,205 T est 6,659 826 826 39 197 8.06 877 / 1,652 / 4,130 All 151,328 8,298 8,298 47 2,029 18.24 9,307 / 34,196 / 107,825 supp ortiv e tone ma y cause the listener to jump happily in place, whereas a frus- trated tone ma y cause the listener to walk a wa y feeling hurt. Given T s and E s , w e synthesize the corresp onding sp eaker audio A s using GPT-4o mini TTS [28]. These steps pro duce a p o ol of p ossible sp eak er utterances ( A s , T s , E s ). Step 3: Data Filtering. W e p erform a series of pro cedures to ensure the dataset qualit y . First, for each sp eaker utterance, we verify whether the synthesized audio A s faithfully reflects the in tended emotion E s . Sp ecifically , w e apply an auto- matic sp eec h emotion recognizer ( i.e., Hume AI 6 ) to the generated audio and discard an y utterance whose predicted emotion is inconsisten t with its assigned emotion lab el. Next, w e pair each remaining sp eak er utterance with the caption of every listener reactive motion R l obtained in Step 1. W e then employ Qwen (Qw en3-235B-A22B-Instruct) [89] to assign a dy adic con versation appropriate- ness score to eac h sp eak er-utterance and listener motion caption pair. F or each sp eak er utterance, we retain only the top sev eral higher-scoring listener reactiv e motions, thereb y removing inappropriate pairs. Step 4: Sp e aker–Listener Candidate R anking and Pr efer enc e Tiering. Given a pair consisting of a sp eaker utterance and one of its corresp onding listener re- activ e motions from Step 3, we use m ultiple agen ts ( i.e., ChatGPT-o1 [29], ChatGPT-o3 mini [58], and Qwen3-235B-A22B-Instruct [89]) to ev aluate the pair. They score it according to (1) semantic appr opriateness (whether the re- action fits the utterance), and (2) c onversational plausibility (whether it sounds lik e a natural dy adic resp onse). W e further use a natural language inference (NLI) model 7 to v erify whether the listener motion caption is a logically plau- sible inference from the sp eak er utterance. W e then weigh ted sum the agen ts’ scores to obtain a final score, whic h is used to label the pair as gold , silver , or ne gative according to predefined thresholds. 6 https://www.hume.ai/expression- measurement 7 https : / / huggingface . co / MoritzLaurer / deberta - v3 - large - zeroshot - v1 . 1 - all- 33 Abbreviated pap er title 7 T5 - Encoder E xcited T5 - To k e n i z e r MiMi Neural Audio Codec Unified vocabulary ! = ! ! ∪ ! " ∪ ! # S peaker transcript L istener motion sequence Speaker emotion S peaker a udio I'm so pumped to try that massive ride. It's much bigger than I imagined! Autoregressive generation over unified vocabulary ! (motion- only output ! # ) Cross - attention T5 - Decoder Motion VQ - VA E De coder Shared T5 - To k e n E m b e d d i n g < Audio T oken i>… … Fig. 3: Ov erview of the ReactMotion framework . W e use mo dalit y-sp ecific tok- enizers to conv ert raw data, i.e. , the sp eak er’s utterances (including transcript, audio, and emotion) and the listener’s reactiv e motions, in to discrete special tokens. With these tok enizers, a Seq2Seq mo del is employ ed to in tegrate information across mo dali- ties and learns to generate the listener’s reactive motions from the sp eak er’s utterances. 4.2 Dataset Statistics In total, our dataset contains 151,328 labeled (speaker utterance, listener reac- tiv e motion) pairs, cov ering 8,298 unique sp eak er utterances and 2,029 unique listener reactive motions. On a verage, eac h sp eak er’s utterance is paired with 18.24 candidate reactive motions, highligh ting the one-to-many nature of lis- tener reactions. Overall, 9,307, 34,196, and 107,825 pairs are lab eled as Gold, Silv er, and Negativ e, resp ectiv ely , reflecting gr ade d appr opriateness of candi- date reactions. W e split the dataset b y sp e aker utter anc e with an 8:1:1 ratio for train/v al/test, such that sp eak er utterances are disjoint across splits (i.e., no utterance app ears in more than one split). T ab. 1 lists detailed statistics. Our automated construction pip eline further enables straightforw ard scaling to larger datasets. 5 Metho dology W e present ReactMotion, a unified framework for Reactive Listener Motion Generation from Sp eak er Utterance. As illustrated in Fig. 3, w e first introduce mo dalit y-sp ecific tokenizers that con vert raw inputs, i.e. , the speaker utterance (including transcript, audio, and emotion) and the listener’s reactiv e motions, in to discrete sp ecial tok ens. With these tokenizers, we emplo y a Seq2Seq mo del to unify information across mo dalities and learn the conditional distribution of the task (Eqn. 1). T o capture the one-to-many nature of dyadic interactions, w e further train the mo del with a group-wise preference-based learning ob jective, 8 C. Luo et al whic h explicitly allo ws the generation of multiple appropriate reactions for the same sp eak er utterance. 5.1 Mo dalit y-Sp ecific T ok enization W e emplo y mo dalit y-sp ecific tokenizers to conv ert ra w data from differen t mo dal- ities in to discrete tokens. A udio T okenization. W e use Moshi [15] (its Neural Audio Codec MiMi) to con- v ert the audio wa v eform in the sp eak er utterance A s in to discrete co des. Specif- ically , its audio encoder E aud ( · ) is emplo y ed to extract audio features from A s , whic h are then quantized using the base codeb o ok C aud . h s a = E aud ( A s ) , x s a = Q aud ( h s a ) , (2) where quan tizer Q aud ( · ) maps the features to their nearest entries in the code- b ook C aud , and outputs the corresp onding co debo ok indices x s a . The resulting indices are treated as discrete audio tok ens, allowing the unified model to incor- p orate audio information while retaining proso dy and paralinguistic cues that are informativ e for reactive behaviors. Motion T okenization. W e represent the listener’s reactive motion R l as discrete tok ens with [96], similar to the audio tokenization process: h l m = E mot ( R l ) , x l m = Q mot ( h l m ) . (3) where E mot and Q mot are the motion enco der and quan tizer, resp ectiv ely , and x l m are discrete indices of motion co debo ok C mot . Also, the predicted listener reactiv e motion in the form of discrete tokens from the unified mo del can b e mapp ed back to the ra w motion data through: h l m = Q − 1 mot ( x l m ) , R l = D mot ( h l m ) , (4) where Q − 1 mot ( · ) maps the discrete token indices to the v ectors in the co debo ok, and a V Q-V AE motion decoder [82, 96] D mot ( · ) deco des the vectors b ac k to the ra w motion data. 5.2 Unified Seq2Seq Mo deling With ab ov e mo dalit y-sp ecific tokenizers, we can now represen t information across mo dalities in to a unified space, and thus enable a Seq2Seq model to generate a listener reactiv e motion conditioned on the sp eak er utterance. Sp ecifically , we adopt T5-base [65] as the Seq2Seq backbone and extend its original textual v o cabulary V t to include audio and motion v o cabulary: V = V t ∪ V m ∪ V a ∪ V s , (5) Abbreviated pap er title 9 where V m are the co de indices of the motion codeb ook C mot , represen ted as { } | V C mot |− 1 i =0 , and V a are the co de indices of audio co debo ok C aud , represen ted as {

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment