ReactMotion: Generating Reactiv e Listener Motions from Sp eak er Utterance Cheng Luo 1 † , Bizh u W u 2 , 4 , 5 † , Bing Li 1 ∗ , Jianfeng Ren 4 , Ruibin Bai 4 , Rong Qu 5 , Linlin Shen 2 , 3 ∗ , and Bernard Ghanem 1 1 King Abdullah Univ ersit y of Science and T ec hnology 2 Computer Vision Institute, School of Artificial Intelligence, Shenzhen Universit y 3 Guangdong Provincial Key Lab oratory of Intelligen t Information Processing, Shenzhen Universit y 4 Sc ho ol of Computer Science, Universit y of Nottingham Ningb o China 5 Sc ho ol of Computer Science, Universit y of Nottingham, United Kingdom Pro ject page: https://reactmotion.github.io Abstract. In this pap er, w e in tro duce a new task, Reactiv e Listener Motion Generation from Sp eak er Utterance, whic h aims to generate nat- uralistic listener b o dy motions that appropriately respond to a sp eak er’s utterance. Ho wev er, modeling such nonv erbal listener b eha viors remains underexplored and c hallenging due to the inherently non-deterministic nature of h uman reactions. T o facilitate this task, we present ReactMo- tionNet, a large-scale dataset that pairs sp eak er utterances with multiple candidate listener motions annotated with v arying degrees of appropri- ateness. This dataset design explicitly captures the one-to-man y nature of listener b eha vior and provides sup ervision b ey ond a single ground- truth motion. Building on this dataset design, we develop preference- orien ted ev aluation protocols tailored to ev aluate reactive appropriate- ness, where conv en tional motion metrics fo cusing on input–motion align- men t ignore. W e further propose ReactMotion, a unified generative frame- w ork that jointly mo dels text, audio, emotion, and motion, and is trained with preference-based ob jectiv es to encourage both appropriate and di- v erse listener resp onses. Extensive exp eriments show that ReactMotion outp erforms retriev al baselines and cascaded LLM-based pip elines, gen- erating more natural, diverse, and appropriate listener motions. Keyw ords: Dy adic interaction · Interactional AI systems 1 In tro duction Mo deling dy adic h uman comm unication is crucial for virtual agents [33], dig- ital h umans [50, 104], and so cial rob ots [71]. While prior w ork has adv anced sp eec h-to-sp eec h dialogue [15], language-based interfaces [1, 28], and listener fa- cial reactions [57, 70], reactiv e listener b o dy motions remain largely o verlooked † Equal contribution. ∗ Corresp onding authors. 2 C. Luo et al Moving and greeting Wavi ng hands E cstatic React Motion I'm so excited you're here! I was hoping you'd show up . Emotion: Input: speaker utterance Output: reactive listener motion Fig. 1: Illustration of the proposed new task: Reactive Listener Motion Generation from Speech Utterance. Given a speaker’s utterance, i.e ., transcript and/or audio (op- tionally supplemen ted with emotion), a generativ e mo del such as our Rea ctMotion generates a corresponding responsive b ody-motion sequence for the listener. despite b eing central to face-to-face interaction. Listeners often con vey engage- men t and understanding through p osture and subtle gestures, and generating suc h feedback is important for natural dy adic communication. W e introduce a new task, R e active Listener Motion Gener ation fr om Sp e e ch Utter anc e , which aims to generate naturalistic listener b ody motions that ap- propriately respond to a sp eak er’s utterance given its audio and/or transcript. Unlik e text-to-motion [21, 62, 75, 76, 96] or audio-driven motion generation [88] that primarily realize the input con tent, our setting models con versational reac- tions where speaker cues are indirect and the output is inherently one-to-many . This task p oses three challenges. (i) The same utterance can elicit m ultiple v alid listener reactions [57, 70]. Suc h non-deterministic listener behaviour p oses a significan t challenge for modeling the listener’s motion resp onses. (ii) There is no publicly a v ailable large-scale dataset with multiple listener-reactiv e b o dy motions p er utterance, to the b est of our knowledge. (iii) Reactive appropriateness is difficult to ev aluate. Metrics based on a single ground truth or motion diversit y are insufficien t to measure the appropriateness of a listener’s reaction. T o address these challenges, we introduce ReactMotionNet , a curated dataset with 151,328 (sp eak er utterance, listener motion) pairs. Unlike prior motion datasets that typically pro vide a single target p er condition , we asso ciate eac h utterance with multiple candidate reactions and annotate them in to three preference tiers, Gold , Silver , and Ne gative . This tiered design captures one-to-man y am biguity and enables preference-st yle supervision and ev al- uation [11, 13, 102]. Moreov er, w e prop ose a scalable pip eline that re-purposes existing motion data into dy adic sp eak er-listener pairs for dataset construction, whic h av oids relying on exp ensiv e sp eak er–listener motion capture . T o ev aluate reactiv e appropriateness, we in tro duce a tier-a ware ranking proto col . W e train a m ultimo dal judge net work to score and rank candidate re- actions under the same speaker input and rep ort win rates against the Gold, Silv er, or Negative tiers. This relativ e ev aluation goes beyond single- reference similarit y and better reflects that m ultiple reactions can be ap- propriate for the same utterance. Finally , w e prop ose ReactMotion , a unified generativ e framew ork that join tly mo dels sp eak er transcript, emotion, and audio Abbreviated pap er title 3 to generate listener motions. W e leverage the tiered annotations with preference- based ob jectives that learn from r elative comparisons within each utterance group for the training. Contributions. (i) T o the b est of our knowledge, we in tro duce the first task of reactiv e listener b o dy motion generation from sp eak er sp eec h in dy adic in- teraction. (ii) W e present ReactMotionNet , a new dataset with multi-tier (Gold/Silv er/Negative) reactive listener motions and a tier-aw are ev aluation proto col for reactiv e appropriateness, enabling researc h on nonv erbal listener resp onse behavior. (iii) W e propose ReactMotion , a unified m ultimo dal gen- erativ e mo del that pro cesses multiple sp eak er cues and generates high-quality listener b ody motions in resp onse to the speaker. 2 Related W ork Human Motion Generation. Human motion generation can b e conditioned on diverse mo dalities, including text [8, 30, 42, 48, 52, 63, 78, 84, 95, 98], action classes [60, 64, 74], and audio signals such as m usic [37, 39, 40, 90] or sp eec h [38, 45, 85]). Among these, text- and audio-driven motion generation are most related to our setting. T ext-based approac hes generate motions from explicit action de- scriptions [4, 10, 18, 26, 34, 61, 79, 80, 97, 101, 105], while audio-driv en methods syn- thesize gestures aligned with temp orally sync hronized acoustic signals [7, 53, 99]. Represen tative mo deling paradigms include transformer-based latent mo dels ( e.g. , [43, 60, 100]), discrete motion tokenization with autoregressive mo deling ( e.g. , [3, 9, 91, 96]), and diffusion-based framew orks ( e.g. , [2, 22, 44, 76]). Bey ond single-person generation, recent w orks [24, 41, 53, 55, 73, 81] extend motion synthesis to multi-person scenarios. These approac hes typically generate m ulti-p erson motions b y conditioning on explicit textual descriptions of joint actions or on the audio streams of both individuals. In contrast, our problem setting differs in that the target motion is not directly sp ecified b y explicit action instructions or synchronized signals. Instead, the model must infer the implicit in teraction in tention from the sp eak er’s utterance, including transcript, audio, and emotion cues, and pro duce a so cially appropriate reactive motion for the listener. This requires reasoning o ver cross-sp eak er dynamics rather than direct condition-to-motion mapping. Human Reaction Generation. Human reaction generation is crucial for AI in- teraction systems. Sp oken language mo deling has progressed from cascaded ASR → LLM → TTS pipelines to end-to-end and full-duplex sp eech-to-speech mo d- els [15, 66, 77, 94], while facial reaction generation has adv anced from conditional GANs [27] to uncertain ty-a w are and diffusion-based metho ds [49, 51, 57, 70, 103]. Audio-visual face-to-face dialogue mo deling has b een explored [14, 57, 59, 103]. In 3D human b ody modeling, most methods syn thesize reactor motion con- ditioned on actor motion [12, 17, 46, 47, 87]. F or instance, In terF ormer [12] uses temp oral-spatial atten tion in T ransformers, and ReGenNet [87] and ReMoS [17] emplo y diffusion mo dels for full-b ody motion. Recen tly , HERO [93] generates 4 C. Luo et al 3D reactive motion directly from R GB videos, incorporating the actor’s facial expressions to capture emotional cues. Differently , our metho d generates 3D re- actor motion from the sp eaker’s utterance, whic h includes transcript, audio, and optional emotion annotations. T ranscript provides a light w eight, user-friendly mo dalit y , audio offers ric h v o cal cues, and emotion labels explicitly indicate mo od, facilitating more effective interaction modeling. 3D Human Bo dy Interaction Datasets. Recen t datasets hav e facilitated researc h on multi-person dynamics and interaction-a w are 3D motion. Sev eral w orks [20, 25, 41, 86, 92] pro vide paired h uman motions, modeling interaction as symmetric kinematic coupling, where one participan t’s motion is predicted from the other’s. While effectiv e for spatial co ordination, this ignores linguistic and affectiv e signals that drive con versation. Other datasets [31, 32, 35, 56, 67, 68, 93] supply silen t R GB videos with 3D reactiv e motions, offering richer con text but still lacking sp eec h seman tics and emotional cues, which are central to communicativ e inten t. Some datasets [24, 36, 55, 73] include b oth audio and motion for human interactions, but the mov emen ts of their motions primarily fo cus on the upp er b ody , such as arms, and are limited to one-to-one sp eak er-listener pairs. In contrast, our dataset provides a one-to-many mapping b et ween sp eaker utterances and listener reactive motions. Eac h utterance has multiple resp onses lab eled gold , silver , and ne g for appropriate, partially appropriate, and irrelev ant reactions, making it b etter suited for practical applications. Plus, motions are more dynamic, suc h as jumping, enabling more diverse bo dy reactions. 3 T ask Definition In this pap er, w e study R e active Listener Motion Gener ation in dy adic interac- tion, whic h consists of a sp e aker and a listener . Giv en a sp eaker utterance C s , the goal is to generate appropriate reactiv e b ody motion of the listener, denoted as R l . F ormally , the ob jective is to learn the conditional distribution: p θ R l | C s , C s ∈ n A s , T s , ( A s , T s ) , ( A s , E s ) , ( T s , E s ) , ( A s , T s , E s ) o . (1) Here, A s denotes the sp eak er audio, T s is the corresp onding textual transcript, E s represen ts the sp eak er emotion, and θ denotes the mo del parameters. As sho wn in Eqn. 1, C s ma y consist of different mo dalities of the speaker utter- ance or their com binations. A t inference time, diverse listener reactions can b e sampled from p θ ( R l | C s ) . In contrast to conv en tional text-to-motion generation, the sp eak er utterance do not explicitly sp ecify the target listener motion. The mapping from C s to R l is therefore inherently one-to-many , which requires the mo del to generate motions that are con textually appropriate while maintaining div ersity . Abbreviated pap er title 5 The man plays the violin Audio emotion Person is doing a hand stand . the person is moving his arms like he is arguing with someone. A person is waving hi with his right hand. A person punches with their right hand before they do a counterclockwise spin . Step1: Dyadic Listener Reactive Motion Curation St ep2: I n ve r s e S p e a k e r - Co nd i t i o n Infer enc e Step4: Speaker – Li s t e n e r C an d i d at e R an ki n g and Pre f e r e n c e Ti e r i n g Sample caption - motion pairs from HumanML3D dataset Select dyadic conversation relat ed motion LLMs ✔ ✔ ✖ ✖ Motion caption A bunch of your old schoolmates just arrived, and they're all looking this way A person is waving hi with his right hand. Speaker utterance LLM Inference Content Speaker emotion TTS Synthetic s peaker audio Step3: Data Filtration Synthetic s peaker audio Ta rg e t s peaker emotion Speech emotion recog nition Hume AI Emotion consistency check (keep/discard ) Speaker tr a n s cri pt Rank listener candidat e s b y appropriateness f or ea c h speaker utteran ce, and retain the top - ranked o n e s . Speaker emotion Natural l anguage inference mo dels LLMs + A bee seemed to zip past you just now. That gave me a tiny scare Speaker utterance Someone abruptly steps ba c k w a r d , s e e m i n g l y s u r p r i s e d o r s t a r t l e d b y so m e t h i n g Listener motion candidates A pe r so n st a n di n g st i l l then suddenly stepping back out of the way Gold Silver Negative Step2 : Inverse Speaker- Condition Synthesis Step3 : Data Filtering All listener motion captions Fig. 2: ReactMotionNet dataset construction. W e curate dyadic listener motions (Step 1), synthesize sp eak er conditions via in verse inference and T ext-to-Speech (TTS) (Step 2), filter unreliable samples (Step 3), and rank/re-tier speaker–listener pairs in to gold/silver/ne gative preferences (Step 4). 4 ReactMotionNet Dataset T o bridge the gap betw een existing 3D human motion interaction datasets and real-w orld conv ersational dynamics, we construct a dataset, ReactMotionNet, featuring one-to-many sp eak er utterance–listener reaction mappings with gr ade d appr opriateness annotations . T o construct this dataset, we present a nov el data construction pip eline (Fig. 2) that repurp oses existing human motion data into sp eak er–listener motion–resp onse pairs using p o w erful LLMs [58, 89], thereby a voiding costly data collection. 4.1 Dataset Construction Pip eline Step 1: Dyadic Listener R e active Motion Cur ation. Unlik e existing audio-driv en 3D h uman in teraction datasets, which mainly focus on upper-b o dy mov emen ts while standing still, w e curate motions from the more dynamic and commonly used HumanML3D dataset [19]. Leveraging the textual captions of motions, we filter out con versation-irrelev ant ones ( e.g. , doing a handstand) using m ultiple LLM-based verifiers (e.g., ChatGPT-o1 [29], ChatGPT-o3 mini [58]). This step results in a set of motions with reaction-like seman tics, which serv e as the lis- tener’s reactiv e motions. Step 2: Inverse Sp e aker-Condition Synthesis. F or each listener motion R l from the last step, we infer multiple plausible sp eaker utterances that could elicit the observed reaction. Concretely , we input the listener motion’s caption in to Op enAI o3-mini [1, 58, 69] to generate potential s peaker transcripts T s and as- so ciated emotion labels E s . W e incorp orate emotion in to utterance generation, as the sp eak er’s emotional state influences the listener’s reaction. F or example, the same transcript, “Do whatever you w ant,” can lead to different resp onses: a 6 C. Luo et al T able 1: Dataset statistics. #Pairs is the total num b er of labeled sp eak er–listener pairs ( i.e. candidate reactions). #T rans., #Audio, and #Emo. denote the num b ers of unique transcripts, audio files, and emotion categories, resp ectiv ely . #Motion is the num ber of unique motion sequences. #Motion/Utter. rep orts the av erage num- b er of candidate motions p er sp eaker utterance. Label counts rep ort the num b ers of gold/silv er/negative candidates (# G /# S /# N ). Split #Pairs Speaker Utterance Listener Reaction #Motion/Utter. Labels (y) #T rans. #Audio #Emo. #Motion (avg.) (# G /# S /# N ) T rain 137,879 6,631 6,631 46 1,822 20.79 7,527 / 30,862 / 99,490 V al 6,790 841 841 40 195 8.07 903 / 1,682 / 4,205 T est 6,659 826 826 39 197 8.06 877 / 1,652 / 4,130 All 151,328 8,298 8,298 47 2,029 18.24 9,307 / 34,196 / 107,825 supp ortiv e tone ma y cause the listener to jump happily in place, whereas a frus- trated tone ma y cause the listener to walk a wa y feeling hurt. Given T s and E s , w e synthesize the corresp onding sp eaker audio A s using GPT-4o mini TTS [28]. These steps pro duce a p o ol of p ossible sp eak er utterances ( A s , T s , E s ). Step 3: Data Filtering. W e p erform a series of pro cedures to ensure the dataset qualit y . First, for each sp eaker utterance, we verify whether the synthesized audio A s faithfully reflects the in tended emotion E s . Sp ecifically , w e apply an auto- matic sp eec h emotion recognizer ( i.e., Hume AI 6 ) to the generated audio and discard an y utterance whose predicted emotion is inconsisten t with its assigned emotion lab el. Next, w e pair each remaining sp eak er utterance with the caption of every listener reactive motion R l obtained in Step 1. W e then employ Qwen (Qw en3-235B-A22B-Instruct) [89] to assign a dy adic con versation appropriate- ness score to eac h sp eak er-utterance and listener motion caption pair. F or each sp eak er utterance, we retain only the top sev eral higher-scoring listener reactiv e motions, thereb y removing inappropriate pairs. Step 4: Sp e aker–Listener Candidate R anking and Pr efer enc e Tiering. Given a pair consisting of a sp eaker utterance and one of its corresp onding listener re- activ e motions from Step 3, we use m ultiple agen ts ( i.e., ChatGPT-o1 [29], ChatGPT-o3 mini [58], and Qwen3-235B-A22B-Instruct [89]) to ev aluate the pair. They score it according to (1) semantic appr opriateness (whether the re- action fits the utterance), and (2) c onversational plausibility (whether it sounds lik e a natural dy adic resp onse). W e further use a natural language inference (NLI) model 7 to v erify whether the listener motion caption is a logically plau- sible inference from the sp eak er utterance. W e then weigh ted sum the agen ts’ scores to obtain a final score, whic h is used to label the pair as gold , silver , or ne gative according to predefined thresholds. 6 https://www.hume.ai/expression- measurement 7 https : / / huggingface . co / MoritzLaurer / deberta - v3 - large - zeroshot - v1 . 1 - all- 33 Abbreviated pap er title 7 T5 - Encoder E xcited T5 - To k e n i z e r MiMi Neural Audio Codec Unified vocabulary ! = ! ! ∪ ! " ∪ ! # S peaker transcript L istener motion sequence Speaker emotion S peaker a udio I'm so pumped to try that massive ride. It's much bigger than I imagined! Autoregressive generation over unified vocabulary ! (motion- only output ! # ) Cross - attention T5 - Decoder Motion VQ - VA E De coder Shared T5 - To k e n E m b e d d i n g < Audio T oken i>…
… Fig. 3: Ov erview of the ReactMotion framework . W e use mo dalit y-sp ecific tok- enizers to conv ert raw data, i.e. , the sp eak er’s utterances (including transcript, audio, and emotion) and the listener’s reactiv e motions, in to discrete special tokens. With these tok enizers, a Seq2Seq mo del is employ ed to in tegrate information across mo dali- ties and learns to generate the listener’s reactive motions from the sp eak er’s utterances. 4.2 Dataset Statistics In total, our dataset contains 151,328 labeled (speaker utterance, listener reac- tiv e motion) pairs, cov ering 8,298 unique sp eak er utterances and 2,029 unique listener reactive motions. On a verage, eac h sp eak er’s utterance is paired with 18.24 candidate reactive motions, highligh ting the one-to-many nature of lis- tener reactions. Overall, 9,307, 34,196, and 107,825 pairs are lab eled as Gold, Silv er, and Negativ e, resp ectiv ely , reflecting gr ade d appr opriateness of candi- date reactions. W e split the dataset b y sp e aker utter anc e with an 8:1:1 ratio for train/v al/test, such that sp eak er utterances are disjoint across splits (i.e., no utterance app ears in more than one split). T ab. 1 lists detailed statistics. Our automated construction pip eline further enables straightforw ard scaling to larger datasets. 5 Metho dology W e present ReactMotion, a unified framework for Reactive Listener Motion Generation from Sp eak er Utterance. As illustrated in Fig. 3, w e first introduce mo dalit y-sp ecific tokenizers that con vert raw inputs, i.e. , the speaker utterance (including transcript, audio, and emotion) and the listener’s reactiv e motions, in to discrete sp ecial tok ens. With these tokenizers, we emplo y a Seq2Seq mo del to unify information across mo dalities and learn the conditional distribution of the task (Eqn. 1). T o capture the one-to-many nature of dyadic interactions, w e further train the mo del with a group-wise preference-based learning ob jective, 8 C. Luo et al whic h explicitly allo ws the generation of multiple appropriate reactions for the same sp eak er utterance. 5.1 Mo dalit y-Sp ecific T ok enization W e emplo y mo dalit y-sp ecific tokenizers to conv ert ra w data from differen t mo dal- ities in to discrete tokens. A udio T okenization. W e use Moshi [15] (its Neural Audio Codec MiMi) to con- v ert the audio wa v eform in the sp eak er utterance A s in to discrete co des. Specif- ically , its audio encoder E aud ( · ) is emplo y ed to extract audio features from A s , whic h are then quantized using the base codeb o ok C aud . h s a = E aud ( A s ) , x s a = Q aud ( h s a ) , (2) where quan tizer Q aud ( · ) maps the features to their nearest entries in the code- b ook C aud , and outputs the corresp onding co debo ok indices x s a . The resulting indices are treated as discrete audio tok ens, allowing the unified model to incor- p orate audio information while retaining proso dy and paralinguistic cues that are informativ e for reactive behaviors. Motion T okenization. W e represent the listener’s reactive motion R l as discrete tok ens with [96], similar to the audio tokenization process: h l m = E mot ( R l ) , x l m = Q mot ( h l m ) . (3) where E mot and Q mot are the motion enco der and quan tizer, resp ectiv ely , and x l m are discrete indices of motion co debo ok C mot . Also, the predicted listener reactiv e motion in the form of discrete tokens from the unified mo del can b e mapp ed back to the ra w motion data through: h l m = Q − 1 mot ( x l m ) , R l = D mot ( h l m ) , (4) where Q − 1 mot ( · ) maps the discrete token indices to the v ectors in the co debo ok, and a V Q-V AE motion decoder [82, 96] D mot ( · ) deco des the vectors b ac k to the ra w motion data. 5.2 Unified Seq2Seq Mo deling With ab ov e mo dalit y-sp ecific tokenizers, we can now represen t information across mo dalities in to a unified space, and thus enable a Seq2Seq model to generate a listener reactiv e motion conditioned on the sp eak er utterance. Sp ecifically , we adopt T5-base [65] as the Seq2Seq backbone and extend its original textual v o cabulary V t to include audio and motion v o cabulary: V = V t ∪ V m ∪ V a ∪ V s , (5) Abbreviated pap er title 9 where V m are the co de indices of the motion codeb ook C mot , represen ted as { } | V C mot |− 1 i =0 , and V a are the co de indices of audio co debo ok C aud , represen ted as { } | V C aud |− 1 i =0 , respectively . V s con tains sp e- cial tok ens such as , , , , and , whic h wrap the motion, audio, and emotion tok en sequences. This unified vocabulary allows us to formulate reactive listener motion gener- ation, conditioned on differen t mo dalities or their com binations C s , in a general format and ac hieve them within a single model. Sp ecifically , we first fit discrete co des of the speaker utterance C s and the listen reactiv e motion R l in to fixed prompt templates. Due to page limit, a coarse example task template of using only speaker audio as the condition is shown; detailed one and templates for other conditions are pro vided in the App endix A.2. Input : Y ou are mo deling a sp eak er-listener dyadic interaction. Given SPEAKER_AUDIO: [Au- dio T ok ens Placeholder], return ONL Y a sequence of listener reactive motion tokens. Output : [Motion T okens Placeholder] No w, the mo deling process of generating listener reactive motion can b e represen ted as an auto-regressive one, where each motion token is generated with probabilit y p θ x out t | x in ( C s ) , x out ℓ S > ℓ N . W e enforce this ordering with a soft-margin ranking loss: L rank = log 1 + exp m − ( ℓ G − ℓ S ) + log 1 + exp m − ( ℓ S − ℓ N ) + λ g n log 1 + exp m − ( ℓ G − ℓ N ) , (8) where m sp ecifies the margin betw een differen t lab els, and λ g n con trols the strength of the Gold ≻ Negativ e constraint. T r aining obje ctive with fr e quency r eweighting. T o mitigate the dominance of frequen tly o ccurring motion sequences, w e apply in v erse-frequency weigh ting based on motion sequence IDs. Let i index a group (corresp onding to one sp eak er utterance) and let r ij denote the motion sequence ID of the j -th candidate in group i . W e compute freq( r ) as the num b er of times motion ID r app ears in the tr aining set and assign an item weigh t ˜ w ij = 1 √ freq( r ij ) . W e then define the group w eight as the mean item w eight within the group, w i = 1 |C i | P j ˜ w ij , where C i denotes the candidate set of group i . Finally , we maximize the aggregated Gold score while applying the ranking loss: L = P i w i − ℓ ( i ) G + λ rank L ( i ) rank P i w i . (9) 6 Exp erimen ts 6.1 Implemen tation Details W e train ReactMotion for 100 , 000 iterations using the default Adam W optimizer and a cosine learning-rate schedule. The learning rate is set to 2 × 10 − 5 with 1 , 000 w armup steps. W e use a p er-device batch size of 8 with gradien t accum ulation of 2 steps on a single NVIDIA A100 GPU. W e train with six conditioning v arian ts ( T , A , T + A , T + E , A + E , T + A + E ) and apply mo dalit y drop out ( p =0 . 3 ) to impro ve robustness (see the App endix A for more details of the implementation). 6.2 Ev aluation Protocol Ev aluation metrics. (i) R e active appr opriateness , i.e ., ho w well the generated reactiv e human motions resp ond to the speaker’s input, is a core ob jectiv e of our task. Inspired by preference-based ev aluation paradigms [6, 11, 13, 16, 72, 102], w e ev aluate reactive appropriateness using group-level win rates Win(g > G) , Win(g > S) , and Win(g > N) . Specifically , we compare the b est generated sam- ple g with annotated listener motions labeled as Gold (G), Silv er (S), and Neg- ativ e (N), and compute the win rate against each reference tier. A win against a Abbreviated pap er title 11 T able 2: Multi-mo dal judge netw ork reliability under strict mo dalit y missingness (Strict-L2). W e ev aluate six input mo des (text T , audio A , emotion E , and their fusions) on the test set , reporting pairwise win rates (Win(G>N), Win(G>S), Win(S>N)) and ranking metrics (MRR(G), nDCG@K) with graded relev ance G>S>N. Mode Win(G > N) ↑ Win(G > S) ↑ Win(S > N) ↑ MRR(G) ↑ nDCG@3 ↑ nDCG@5 ↑ nDCG@10 ↑ T 0.992 0.873 0.983 0.829 0.864 0.878 0.932 A 0.992 0.872 0.983 0.832 0.866 0.878 0.933 T+E 0.993 0.876 0.982 0.826 0.857 0.876 0.929 A+E 0.992 0.874 0.983 0.831 0.865 0.878 0.933 T+A 0.993 0.879 0.982 0.820 0.855 0.875 0.928 T+A+E 0.993 0.878 0.982 0.828 0.859 0.878 0.930 higher reference tier ( e.g., Silver) indicates that the generated motion is ranked ab o v e a higher-quality annotated resp onse, reflecting stronger reactive appro- priateness. T o realize this ev aluation, we train a multimodal judge netw ork to rank generated reactiv e b ody motions conditioned on the same sp eak er in- put. Details of the judge netw ork are pro vided in the appendix. W e also rep ort Gen@3 , the fraction of groups where a generated candidate is ranked within the top-3 among {G , S , N } plus generated candidates under the same group. (ii) Motion quality is measured b y F réchet Inception Distance (FID) [23] computed in a motion feature space, and (iii) Diversity is measured as the a verage pairwise em b edding distance across generated samples, following h uman motion genera- tion [82, 96]. (see the App endix B.4 for more details of the ev aluation metrics). V alidation of the multimodal judge net work. Since the judge netw ork is cen tral to measuring reactiv e appropriateness, we v alidate it on samples with tiered appropriateness annotations (G/S/N). Sp ecifically , we compute the tier- consistency win rates Win(G > S) , Win(G > N) , and Win(S > N) to test whether the judge assigns higher scores to more appropriate reactions. Higher v alues in- dicate a more reliable judge. W e further rep ort MRR(G) , which measures how highly the Gold reaction is rank ed, and nDCG@3 / nDCG@5 / nDCG@10 to assess graded ranking qualit y among the top- K candidates. T able 2 sho ws the judge consisten tly preserves the exp ected preference order- ing with near-p erfect separation , across all six mo des and b oth T est set. Gold almost alwa ys b eats negativ es (Win(G > N) ≈ 0.99) and silv er also strongly b eats negativ es (Win(S > N) ≈ 0.98), indicating that the judge reliably distinguishes p oor motions from plausible ones. Mean while, gold beats silv er with a clear margin (Win(G > S) ≈ 0.87–0.88), reflecting sensitivity to fine-grained quality differences b ey ond simply rejecting negatives. The judge further ac hieves strong ranking qualit y (MRR(G) ≈ 0.82–0.84; nDCG@5 ≈ 0.87–0.88; nDCG@10 ≈ 0.93), demonstrating stable and meaningful top- K ordering. Although our multimodal judge netw ork is trained on multiple input mo dal- ities, i.e ., text ( T ), audio ( A ), and emotion ( E ), it supp orts missing mo dalities using Strict-L2. Disabled mo dalities are replaced with information-free inputs (all-padding text, all-padding audio co des, or an unknown emotion token). This enables the judge net work to op erate with any subset of modalities; ev en with a 12 C. Luo et al single mo dalit y , it p erforms w ell in ev aluation. (see the Appendix B.1 and B.2 for more details of the judge net work). 6.3 Quan titative Results Since reactive listener motion generation remains underexplored, we ev aluate a set of representativ e baselines. (a) Random Selection uniformly samples a mo- tion sequence from HumanML3D [19]. (b) Retriev al applies the text–motion matc hing netw ork from prior HumanML3D T2M w ork [82, 96] to compute text– motion similarity and retrieves the nearest-neighbor listener motion sequence from the training set giv en the sp eak er transcript. W e also consider stronger cascaded LLM → T2M baselines: given a sp eak er utterance (and emotion), an LLM [89] first generates a listener-motion caption, whic h is then passed to a T2M generator to synthesize the final motion. W e instan tiate the LLM with Qwen3-30B-A3B (30.5B parameters) and a fine-tuned Qwen3-4B-Thinking (4B parameters) trained on our training-set (sp eak er utterance, listener-motion caption) pairs. The resulting captions are fed into tw o representativ e T2M gen- erators, T2M-GPT [96] and MG-MotionLLM [82]. More details of baselines are in the App endix B.3. T ab. 3 sho ws that ReactMotion outperforms all baselines in reactiv e appropri- ateness. Among the cascaded LLM → T2M pip elines, LLM → MG-MotionLLM * is the strongest, improving ov er Random Selection and Retriev al. Ho w ever, de- spite using a p ow erful motion generator, it still p erforms p oorly under strict com- parisons to Silv er references (Win(g > S)), indicating that the t wo-stage caption- then-generate pip eline struggles to pro duce highly appropriate listener reactions. In contrast, ReactMotion achiev es near-p erfect Win(g > N) across input mo des and substantially improv es Win(g > S) and Gen@3. Our full mo del ( T + A + E ) yields the best ov erall Win rates, while maintaining low FID and comp etitiv e div ersity . Although Retriev al attains the highest div ersity by construction, it yields m uch low er appropriateness and worse realism than our approach. More exp erimen tal results are pro vided in the App endix D. 6.4 Qualitativ e Results W e visualize representativ e examples in Fig. 4, comparing our ReactMotion (Ours), a cross-entrop y trained v ariant (CE), and LLM → MG-MotionLLM * with a finetuned Qwen [89] ( Qwen3-4B-Thinking ) on training set, together with gold and silv er reference reactions under the same sp eaker condition. Overall, ReactMotion pro duces reactiv e motions that are b oth seman tically consisten t with the sp eak er conten t and expressiv e in intensit y . F or instance, for the ut- terance “The energy in here feels electric righ t no w” with excite d emotion, our mo del generates larger, more dynamic upper-b ody and arm mo vemen ts, whic h b etter reflect the high-energy “electric” cue and matc h the communicativ e style seen in the gold reaction. In contrast, the silver reaction exhibits a rapid hand-wa v e but remains relativ ely lo w-energy , making it less aligned with the excited condition. The Abbreviated pap er title 13 T able 3: Quan titative results on the test set. Main ev aluation metrics are Win(g > N), Win(g > S), Win(g > G), Gen@3 measuring Reactiv e Appropriateness. W e additionally ev aluate motion quality (FID) and div ersity . ∗ indicates that the LLM is fine-tuned using training-set sp eak er utterance and listener motion caption pairs. Method Input Mo d. Win(g > N) ↑ Win(g > S) ↑ Win(g > G) ↑ Gen@3 ↑ FID ↓ Diversity ↑ GT - - - - - 0.278 6.187 Random Selection - 0.265 0.122 0.006 0.099 42.363 9.880 Retriev al T 0.392 0.252 0.130 0.206 7.429 8.207 LLM → T2M-GPT T + E 0.138 0.038 0.016 0.199 49.920 4.946 LLM → T2M-GPT * T + E 0.171 0.027 0.017 0.350 42.589 6.102 LLM → MG-MotionLLM T + E 0.775 0.245 0.044 0.345 23.629 5.082 LLM → MG-MotionLLM * T + E 0.883 0.274 0.047 0.380 25.723 4.546 ReactMotion (Ours) T 0.993 0.774 0.258 0.916 4.706 4.789 ReactMotion (Ours) A 0.992 0.614 0.164 0.864 6.221 4.009 ReactMotion (Ours) T + E 0.990 0.696 0.206 0.930 5.422 4.475 ReactMotion (Ours) A + E 0.993 0.736 0.323 0.981 6.485 4.162 ReactMotion (Ours) T + A 0.993 0.651 0.215 0.931 6.560 4.145 ReactMotion (Ours) T + A + E 1.000 0.797 0.266 0.960 4.760 4.804 CE v arian t tends to regress to generic, w eakly-conditioned responses ( e.g, a static pose such as crossing arms), indicating limited ability to exploit prefer- ence structure and mo del the one-to-many nature of reactive behaviors. Finally , the LLM → T2M baseline often generates rep etitiv e motions ( e.g, near-constan t w aving) with limited temp oral v ariation, whic h app ears less suitable for dy adic comm unication, where reactions typically evolv e ov er time ( e.g, hands rising and lo wering, p ose c hanges and subtle turns). Moreov er, b ecause dy adic reactions can b e difficult to describe in n atural language, the out-of-domain captions produced b y the LLM may b e noisy , which can lead MG-MotionLLM to pro duce de graded outputs, including o verly short motion sequences. 6.5 User Study W e recruit 59 volun teers and conduct a user study to ev aluate the reactive appropriateness of listener motions generated by ReactMotion ( Ours ) against t wo baselines (the CE v arian t and LLM → MG-MotionLLM * ) and the b est- in-gr oup Silver reference. In each case, participants watc h t wo motion videos (A/B) conditioned on the same sp eaker utterance (audio with transcript sho wn) and select the more appropriate listener reaction. Eac h participant completes 36 cases co vering six speaker conditions (six pairwise comparisons per condition). As shown in Fig. 5, Ours is preferred ov er the generative base lines, achieving win rates of 67.8% against CE and 72.0% against LLM → MG-MotionLLM * . Ours is also comp etitiv e w ith the Silv er reference, receiving 44.1% of v otes in Silv er vs. Ours , substan tially higher than CE (31.9%) and LLM → MG- MotionLLM (31.4%). 14 C. Luo et al The energy in here feels electric right now . Emotion: E xcited Ours Gold CE LLM → MG - MotionLLM Silver Fig. 4: Qualitative results. W e compare gold and silv er listener reactions, motions generated by our ReactMotion (Ours), a cross-en trop y trained v ariant (CE), and a cascaded LLM → T2M baseline, all conditioned on the same sp eaker utterance. W e visualize the resulting 3D motion sequences. 6.6 Ablation Studies Mo dality study. W e study the effect of input mo dalities in T ab. 3. A cross settings, m ultimo dal fusion p erforms b est ov erall. T ext is the strongest single cue, giving high alignment and the low est single-modality FID ( e.g., T : Win(g>N)=0.993, Win(g>S)=0.774, FID=4.706). Audio alone is w eaker for fine-grained appropri- ateness, but adding emotion substantially improv es it (b est Win(g>G)=0.323 and Gen@3=0.981). F ull fusion ( T + A + E ) is the most balanced, achieving the b est Win(g>N)=1.000, strong Win(g>S)=0.797, and a low FID=4.760. A blations on gr oup-wise pr efer enc e le arning. T ab. 4 ablates k ey components of our group-wise preference learning ob jective. Compared to training with cross- en tropy only , our full mo del substantially improv es b oth reactive appropriate- ness and motion quality (e.g., Win(g > S): 0.741 → 0.797; Gen@3: 0.938 → 0.960; FID: 6.555 → 4.760). Remo ving inv erse-frequency reweigh ting leads to the largest appropriateness drop, especially against the strongest tier (Win(g > G): 0.266 → 0.220), highligh ting the importance of mitigating the dominance of fre- quen t and generic motions. Removing the ranking loss degrades fidelit y (FID: 4.760 → 5.950) while increasing diversit y (4.804 → 5.453), suggesting that the rank- ing constraints help enforce correct relativ e ordering among tiers. Finally , re- Abbreviated pap er title 15 Silver vs. Ours Silver vs. CE Silver vs. LLM→MG- MotionLLM Ours vs. CE Ours vs. LLM→MG- MotionLLM 0 25 50 75 100 User Preference Rate (%) 55.9% 44.1% 68.1% 31.9% 68.6% 31.4% 67.8% 32.2% 72.0% 28.0% Silver Ours CE LLM→MG-MotionLLM Fig. 5: User study on reactive appropriateness. T able 4: Ablation studies on the test split (all use A + T + E unless noted). w/o denotes training without the corresp onding component. The CE baseline trains the same mo del using only a cross-entrop y loss by pairing eac h speaker input with a single Gold reaction as sup ervision. Method Win(g > N) ↑ Win(g > S) ↑ Win(g > G) ↑ Gen@3 ↑ FID ↓ Diversit y ↑ CE baseline 0.990 0.741 0.262 0.938 6.555 5.448 Ours (full) 1.000 0.797 0.266 0.960 4.760 4.804 w/o Inverse-frequency reweigh ting 0.979 0.704 0.220 0.946 5.177 4.929 w/o L rank 0.996 0.781 0.260 0.960 5.950 5.453 w/o ℓ G 0.996 0.712 0.215 0.943 6.376 4.493 mo ving ℓ G consisten tly harms both appropriateness and quality , indicating that lik eliho o d sup ervision on Gold reactions remains necessary . 7 Conclusion W e introduce R e active Listener Motion Gener ation fr om Sp e aker Utter anc e , a new task for mo deling listener motion resp onses in dyadic interactions. T o sup- p ort this task, w e present ReactMotionNet , a m ulti-mo dal dataset that explic- itly captures the inheren t non-determinism of human b eha vior: for each sp eak er utterance, we pro vide multiple candidate listener motions with preference anno- tations, enabling sup ervision b ey ond a single “ground-truth” resp onse. Building on this dataset design, we dev elop preference-orien ted ev aluation proto cols tai- lored to reactive motion generation. Finally , w e prop ose ReactMotion , a unified framew ork that pro cesses multi-modal sp eak er cues, substantially outperforms strong baselines in motion qualit y and reactive appropriateness. W e believe this w ork provides a foundation for future researc h on modeling dyadic interactions. 16 C. Luo et al Outline of the Supplementary Material The supplemen tary material is organized as follows: • Section A presents the implemen tation details, including the model config- uration, v o cabulary construction, optimization settings, and training h yp er- parameters. • Section A.1 presents the mo del size of ReactMotion. • Section A.2 : prompt templates for different sp eak er-condition settings; • Section B further provides the additional ev aluation details, including: • Section B.1 : the formulation of the multimodal judge netw ork; • Section B.3 : details of the baseline metho ds. • Section B.4 in tro duces the ev aluation metrics, co v ering reactiv e appropri- ateness, motion qualit y , and diversit y . • Section C provides additional statistics and analysis of the ReactMotionNet dataset. • Section D.1 presents the hyperparameter sensitivity analysis, including the full sweep results, represen tative configurations, and heatmap visualizations. • Section D.2 ev aluates the inference efficiency of the proposed metho d. • Section D.3 rep orts the proto col and results of the user study . • Section D.4 shows representativ e failure cases. • Section E discusses the limitations of the current framework. A Implemen tation Details T ab. 5 summarizes the key implementation details and training hyperparameters used in our exp erimen ts. Sp ecifically , ReactMotion is instantiated with a T5-base Seq2Seq backbone, comprising 222.9M backbone parameters and 235.9M train- able parameters after extending the v o cabulary . In accordance with the method- ology section, the original textual v o cabulary ( | V t | = 32 , 100 ) is augmen ted with motion tok ens ( | V m | = 512 ), MiMi audio tok ens ( | V a | = 2 , 048 p er codeb ook; 8 co debo oks), and modality-specific sp ecial tokens that mark the b oundaries of differen t mo dalities, resulting in a unified vocabulary of size 63 , 338 . Notably , the vocabulary includes tokens from all 8 MiMi co deb ooks for completeness, while in practice we only use tok ens from the base co deb ook during training to accelerate the process. The mo del takes tok enized speaker utterances as input and autoregressiv ely predicts listener reactiv e motion tok ens, with maximum source and target lengths set to 512 and 256, resp ectiv ely . W e train the mo del using Adam W with learning rate 2 . 0 × 10 − 5 , β 1 = 0 . 9 , β 2 = 0 . 999 , weigh t deca y 0.0, 1,000 warm up steps, per-device batc h size 8, gradient accumulation o v er 2 steps, and 100,000 total optimization steps. T o capture the one-to-many mapping from a speaker utterance to plausible listener reactions, training adopts the pro- p osed group-wise preference ob jective with λ rank = 0 . 25 , λ gn = 0 . 25 , and margin m = 0 . 5 . W e further apply mo dalit y drop out with rate 0.3 to impro ve robustness to missing mo dalities, while length-normalized LogSumExp aggregation is used to obtain stable set-lev el scores during preference optimization. Abbreviated pap er title 17 T able 5: Implemen tation details and hyperparameters used in training. Setup V alue Seq2Seq backbone mo del T5-base [65] T ext tokenizer T5-base tokenizer [65] Audio tokenizer MiMi neural audio co dec [15] Motion tokenizer VQ-V AE from T2M-GPT [96] Per-device batch size 8 Gradient accumulation steps 2 T raining steps 100,000 W arm up steps 1,000 Optimizer Adam W Adam β 1 0.9 Adam β 2 0.999 W eigh t decay 0.0 Learning rate 2 . 0 × 10 − 5 Maximum source length 512 Maximum target length 256 T ext vocabulary size | V t | 32,100 Audio co debook size | V a | 2,048 Number of MiMi audio co debooks 8 Motion VQ-V AE co debook size | V m | 512 T otal vocabulary size | V | 49,002 Backbone parameters 222.9M T otal trainable parameters after vocabulary expansion 235.9M Ranking loss weight λ rank 0.25 Gold-negative loss weigh t λ gn 0.25 Ranking Margin m 0.5 Modality drop out rate 0.3 LogSumExp normalization Enabled A.1 Mo del Size T able 6: Model Configuration and P arameters of ReactMotion. Metric V alue Backbone parameters 222.9M T otal trainable parameters 235.9M Unified v o cabulary size 49,002 T able 6 summarizes the model size of ReactMotion. The mo del is built up on a T5-base bac kb one with 222.9M parameters and 235.9M trainable parameters after extending the v o cabulary to incorp orate multimodal tokens. A.2 Prompt T emplates T o supp ort unified generation under different sp eak er-condition settings, we con- v ert the av ailable sp eak er cues into a fixed natural-language prompt template. 18 C. Luo et al Giv en a sp eak er utterance consisting of transcription, audio, and optional emo- tion annotation, we construct the input prompt by selectively enabling the corre- sp onding fields. The mo del is ins tructed to output only the listener motion-token sequence in a strict format, without an y additional natural language. F ormally , for a sp eak er utterance C s , the prompt is constructed as Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: [Speaker T ranscription] - SPEAKER_AUDIO: [Sp e aker Audio] - SPEAKER_EMOTION: [Speaker Emotion] Output: Return ONL Y a sequence of listener motion tok ens in the exact format: ... Do NOT output any other w ords. In practice, the fields in the prompt are enabled or disabled dep ending on the c hosen condition mo de. F or example, when transcription is used but audio is not, the SPEAKER_AUDIO field is left empty; when emotion is disabled, the emotion line is omitted entirely . This design allows us to handle text-only , audio-only , text+audio, text+emotion, audio+emotion, and text+audio+emotion settings within a single unified framew ork. Belo w we sho w several concrete examples. T ext-only c ondition ( T ). Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: [Speaker T ranscription] - SPEAKER_AUDIO: Output: Return ONL Y a sequence of listener motion tok ens in the exact format: ... Do NOT output any other w ords. T ext+Emotion c ondition ( T + E ). Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: [Speaker T ranscription] - SPEAKER_AUDIO: - SPEAKER_EMOTION: [Speaker Emotion] Output: Return ONL Y a sequence of listener motion tok ens in the exact format: ... Do NOT output any other w ords. A udio-only c ondition ( A ). Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: - SPEAKER_AUDIO: [Sp e aker Audio] Output: Return ONL Y a sequence of listener motion tok ens in the exact format: Abbreviated pap er title 19 ... Do NOT output any other w ords. A udio+Emotion c ondition ( A + E ). Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: - SPEAKER_AUDIO: [Sp e aker Audio] - SPEAKER_EMOTION: [Speaker Emotion] Output: Return ONL Y a sequence of listener motion tok ens in the exact format: ... Do NOT output any other w ords. T ext+Audio c ondition ( T + A ). Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: [Speaker T ranscription] - SPEAKER_AUDIO: [Sp e aker Audio] Output: Return ONL Y a sequence of listener motion tok ens in the exact format: ... Do NOT output any other w ords. T ext+Audio+Emotion c ondition ( T + A + E ). Input: Y ou are mo deling a sp eak er-listener dyadic interaction. Input: - SPEAKER_TRANSCRIPTION: [Speaker T ranscription] - SPEAKER_AUDIO: [Sp e aker Audio] - SPEAKER_EMOTION: [Speaker Emotion] Output: Return ONL Y a sequence of listener motion tok ens in the exact format: ... Do NOT output any other w ords. Giv en the constructed prompt x in ( C s ) , the mo del auto-regressively predicts the listener motion-tok en sequence x out as p θ x out t | x in ( C s ) , x out 0 is the corresp onding scaling factor. Because all score-space embeddings are ℓ 2 -normalized, Eq. (23) is a scaled cosine similarit y . The fused compatibilit y score is defined as s ψ ( C s , x l m ) = ϕ ( z f , z m ) . (24) In addition, w e compute auxiliary mo dalit y-sp ecific compatibilit y scores s ( k ) ψ ( C s , x l m ) = ϕ ( z k , z m ) , k ∈ { t, a, e } , (25) whic h allo w the judge to score candidate motions under partial sp eaker utter- ances. Gr oup-wise c ontr astive tr aining. F or eac h sp eak er utterance C s i , w e construct a candidate set U i = G ( C s i ) ∪ S ( C s i ) ∪ N ( C s i ) , (26) where G ( C s i ) , S ( C s i ) , and N ( C s i ) denote the Gold, Silver, and Negative listener motion sets, respectively . During training, w e randomly sample a small n umber of candidates from eac h tier and enco de them join tly . T o improv e robustness to incomplete conditions, w e randomly v ary the active mo dalit y set o during training. This encourages the judge to remain reliable under differen t condition mo des, including single-mo dalit y settings such as text- only and audio-only . Let P i ⊆ U i denote the p ositiv e set associated with C s i ; in our default setting, P i = G ( C s i ) . Given a condition em b edding z i (whic h can b e the fused em b edding z f or an active mo dalit y-sp ecific embedding z t , z a , z e ), w e optimize the follo wing group-wise InfoNCE ob jective: L con ( z ) = − 1 |B | X i ∈B log P x ∈P i exp ϕ ( z i , z m ( x )) P x ∈U i exp ϕ ( z i , z m ( x )) + P b ∈B bank exp β ϕ ( z i , z m ( b )) , (27) where B is the mini-batch, B bank is an auxiliary motion bank prov iding additional generic negativ es, z m ( x ) denotes the motion embedding of candidate x , z m ( b ) denotes the em b edding of a motion sampled from the bank, and β controls the con tribution of bank negativ es. The motion bank discourages the judge from assigning o verly high compatibilit y scores to generic or template-like motions. W e alw a ys apply Eq. (27) to the fused embedding z f . F or the mo dality- sp ecific auxiliary losses, w e apply it only to the modalities active under the 24 C. Luo et al curren t mo de o : L judge = λ f L con ( z f ) + X k ∈ o λ k L con ( z k ) , k ∈ { t, a, e } , (28) where λ f , λ t , λ a , λ e are loss w eights to balance differen t loss terms. V alidation of the multimo dal judge network. Because the judge net work is central to our ev aluation proto col, w e further verify whether its rankings resp ect the annotated tier ordering G ≻ S ≻ N . F or any tier A ∈ {G , S , N } , we define its mean judge score under condition C s as ¯ s A ( C s ) = 1 |A ( C s ) | X x ∈A ( C s ) s ψ ( C s , x ) . (29) W e then rep ort Win(G > S) , Win(G > N) , and Win(S > N) , defined as Win( A > B ) = 1 |D | X C s ∈D κ ¯ s A ( C s ) , ¯ s B ( C s ) , (30) where ( A , B ) ∈ { ( G , S ) , ( G , N ) , ( S , N ) } , D denotes the ev aluation s et of speaker utterances, and κ ( u, v ) = 1 , u > v , 0 . 5 , u = v , 0 , u < v . (31) W e further rep ort MRR(G) , defined as MRR( G ) = 1 |D | X C s ∈D 1 min x ∈G ( C s ) rank C s ( x ) , (32) where all candidates in U ( C s ) are sorted in descending order of s ψ ( C s , x ) , and rank C s ( x ) denotes the resulting 1-based rank of candidate x . Finally , we rep ort nDCG@3 , nDCG@5 , and nDCG@10 , using graded rel- ev ance lab els 2 , 1 , and 0 for Gold, Silver, and Negative candidates, respectively . These metrics v erify whether the learned judge produces rankings aligned with the annotated appropriateness structure. Strict-L2 missing-mo dality inje ction. F or partial-condition ev aluation, w e adopt a Strict-L2 missing-mo dalit y injection proto col. Given an active mo dalit y set o ⊆ { t, a, e } , every unav ailable mo dalit y is replaced b y a n ull input b efor e it is pro cessed by its enco der branc h. This differs from a weak masking strategy that remo ves a modality only during fusion while still allowing its enco der to observe the original input. F ormally , let δ t ( o ) , δ a ( o ) , and δ e ( o ) indicate whether text, audio, and emotion are ac tiv e under mode o , resp ectiv ely . F or text, if δ t ( o ) = 0 , w e replace the transcript with an all-padding sequence and set its padding mask to zero: x s t ← PAD , M t ( j ) = 0 , ∀ j. (33) Abbreviated pap er title 25 F or audio, if δ a ( o ) = 0 , w e replace all co dec tokens with the audio padding index and mark all time steps as padded: x s a ← PAD a , M a ( j ) = 0 , ∀ j. (34) F or emotion, if δ e ( o ) = 0 , we replace the original label with a dedicated unknown sym b ol: e s ← . (35) A t the fusion stage, the corresp onding mo dalit y tok en is additionally masked out through M f . As a result, unav ailable mo dalities contribute no semantic information to the final condition represen tation. This proto col provides a strict test of whether the judge can reliably score listener motions using only the actually a v ailable sp eak er signals. Unless otherwise specified, all partial-condition reliability exp erimen ts are conducted under this Strict-L2 proto col. B.2 Implemen tation Details of Judge Net w ork T able 7: Hyperparameters for the m ultimo dal judge netw ork. P arameter V alue Bac kb one enco der T5-base Hidden dimension d 768 Em b edding dimension 512 T ransformer heads 12 T ransformer la yers 6 F eedforward dimension 3072 Drop out 0.1 T emp erature 0.07 Memory bank size 4096 Optimizer A dam W Learning rate 5 × 10 − 5 W eight deca y 0.01 Batc h size 16 Ep och 50 λ f 1.0 λ t 0.5 λ a 0.5 λ e 0.2 The m ultimo dal judge net work is implem en ted using a transformer-based arc hitecture that ev aluates the compatibility b et ween sp eak er utterances and candidate listener motions. The textual mo dalit y is encoded using a pre-trained T5-base encoder, while audio tok ens, emotion labels, and motion tok ens are 26 C. Luo et al T able 8: W e ev aluate the m ulti-mo dal matc hing judge on v alidation and test set across six input mo des (text T , audio A , emotion E , and their fusions). W e rep ort pairwise win rates based on mean score comparisons (Win(G>N), Win(G>S), Win(S>N)) and ranking metrics (MRR(G), nDCG@K with graded relev ance G>S>N), where G = G (Gold), S = S (Silver), and N = N (Negative). Mode Split Win(G > N) ↑ Win(G > S) ↑ Win(S > N) ↑ MRR(G) ↑ nDCG@3 ↑ nDCG@5 ↑ nDCG@10 ↑ V al T V al 0.990 0.873 0.985 0.839 0.878 0.891 0.939 A V al 0.990 0.873 0.985 0.842 0.881 0.893 0.940 T+A V al 0.993 0.883 0.988 0.840 0.875 0.890 0.937 T+E V al 0.994 0.881 0.988 0.841 0.875 0.891 0.938 A+E V al 0.990 0.875 0.985 0.840 0.878 0.892 0.939 T+A+E V al 0.993 0.882 0.988 0.840 0.876 0.890 0.937 T est T T est 0.992 0.873 0.983 0.829 0.864 0.878 0.932 A T est 0.992 0.872 0.983 0.832 0.866 0.878 0.933 T+A T est 0.993 0.879 0.982 0.820 0.855 0.875 0.928 T+E T est 0.993 0.876 0.982 0.826 0.857 0.876 0.929 A+E T est 0.992 0.874 0.983 0.831 0.865 0.878 0.933 T+A+E T est 0.993 0.878 0.982 0.828 0.859 0.878 0.930 em b edded and pro cessed through transformer enco ders to obtain mo dalit y rep- resen tations. These represen tations are pro jected in to a shared embedding space where the final compatibilit y score is computed. T able 7 summarizes the key hyperparameters used for training the judge net work. The mo del adopts a hidden dimension of 768 and pro jects the rep- resen tations in to a 512-dimensional em b edding space. The transformer enco der uses 12 attention heads and 6 lay ers with a feedforward dimension of 3072. T rain- ing is p erformed using the A dam W optimizer with a learning rate of 5 × 10 − 5 , w eight deca y of 0.01, and batch size of 16. A memory bank of size 4096 is used to pro vide additional negative samples for con trastive training. B.3 Baseline Metho ds GT. W e use the ground-truth listener motion sequences from the test set as an upp er-bound reference. R andom Sele ction. W e randomly sample a motion sequence from HumanML3D [19] as a naiv e baseline. R etrieval. F ollowing standard text–motion matc hing proto cols [19, 82], we re- triev e a listener motion by matching the sp eak er transcription against candidate motions and returning the top-1 nearest neighbor from the training set. Specifi- cally , we use the pretrained text and motion enco ders from [19], which are trained with a contrastiv e ob jectiv e so that matched text–motion pairs are close in the shared em b edding space, while mismatched pairs are separated by a margin. The text enco der maps the input transcription to a semantic feature vector, while the motion enco der first conv erts a pose sequence in to motion snippet co des and then maps them to a motion feature v ector. In practice, the text enco der follo ws the architecture in [19], and the motion enco der is implemented as a bidirectional GR U with hidden size 1,024. Abbreviated pap er title 27 Casc ade d LLM → T2M. W e construct cascaded baselines by first prompting an LLM to generate the caption of listener reactiv e motion conditioned on the sp eak er transcription and emotion. Then, w e feed the generated caption into a text-to-motion (T2M) mo del to synthesize the final motion. Here, we consider t wo LLMs, Qwen3-30B-A3B and a fine-tuned Qwen3-4B-Thinking , together with t wo represen tative T2M generators, T2M-GPT and MG-MotionLLM . A ccordingly , LLM → T2M-GPT denotes the cascade using Qwen3-30B- A3B and T2M-GPT , while LLM → T2M-GPT ∗ uses the fine-tuned Qwen3- 4B-Thinking together with T2M-GPT . Similarly , LLM → MG-MotionLLM denotes the cascade using Qwen3-30B-A3B and MG-MotionLLM , while LLM → MG-MotionLLM ∗ uses the fine-tuned Qwen3-4B-Thinking together with MG-MotionLLM . T o k eep the main table concise, we rep ort the cascaded baselines under the T + E setting. B.4 Ev aluation Metrics W e ev aluate model p erformance from three complemen tary p erspectives: (i) r e- active appr opriateness , (ii) motion quality , and (iii) diversity . R e active appr opriateness. Reactive appropriateness measures ho w well the gen- erated listener motions respond to the speaker utterance. F or each sp eaker ut- terance C s , the annotated listener motions are partitioned in to three relev ance tiers: Gold G ( C s ) , Silv er S ( C s ) , and Negativ e N ( C s ) . Let b R l ( C s ) = { ˆ x l m, 1 , . . . , ˆ x l m,M } (36) denote the set of M generated listener motion sequences for the same condi- tion. T o assess relative appropriateness, w e use the multimodal judge netw ork in tro duced in Sec. B.1, whic h assigns a compatibility score s ψ ( C s , x l m ) (37) to a candidate listener motion x l m conditioned on the sp eak er input C s . F or any candidate set A ( C s ) , w e define its mean judge score as ¯ s A ( C s ) = 1 |A ( C s ) | X x l m ∈A ( C s ) s ψ ( C s , x l m ) . (38) F or brevity , we denote the mean scores of the generated set and the three anno- tated tiers b y g ( C s ) = ¯ s b R l ( C s ) , G ( C s ) = ¯ s G ( C s ) , S ( C s ) = ¯ s S ( C s ) , N ( C s ) = ¯ s N ( C s ) . (39) W e then rep ort Win(g > G) , Win(g > S) , and Win(g > N) , defined as Win( g > A ) = 1 |D | X C s ∈D κ g ( C s ) , ¯ s A ( C s ) , A ∈ {G , S , N } , (40) 28 C. Luo et al where D denotes the ev aluation set, and κ ( u, v ) = 1 , u > v , 0 . 5 , u = v , 0 , u < v . (41) In tuitively , Win(g > N) measures whether the generated motions are preferred o ver clearly inappropriate resp onses, Win(g > S) is a stricter criterion against mo derately appropriate resp onses, and Win(g > G) is the most c hallenging cri- terion against highly appropriate annotated reactions. Higher v alues indicate stronger reactiv e appropriateness. W e further report Gen@3 , whic h measures whether at least one generated motion is rank ed within the top 3 among all candidates under the same sp eak er utterance. F or each C s , w e form the candidate p o ol C ( C s ) = G ( C s ) ∪ S ( C s ) ∪ N ( C s ) ∪ b R l ( C s ) , (42) rank all candidates in C ( C s ) by s ψ ( C s , · ) in descending order, and denote the resulting rank of a candidate x l m b y rank C s ( x l m ) . W e then compute Gen@3 = 1 |D | X C s ∈D I " min ˆ x l m ∈ b R l ( C s ) rank C s ( ˆ x l m ) ≤ 3 # . (43) This metric is particularly suitable for our task b ecause reactive listener b eha vior is inherently one-to-many: the same speaker utterance ma y admit multiple plau- sible listener reactions, and Gen@3 ev aluates whether the model can pro duce at least one highly comp etitiv e resp onse within a limited candidate budget. Motion quality. W e ev aluate motion quality using F réchet Inception Distance (FID) [23] in a motion feature space. Let f ev al ( x l m ) denote the feature repre- sen tation of a motion sequence extracted b y a pretrained motion ev aluation net work. W e compute the feature statistics of generated motions and real mo- tions in the test set, and then measure the F réchet distance b et ween the tw o Gaussian distributions: FID = ∥ µ r − µ g ∥ 2 2 + T r Σ r + Σ g − 2( Σ r Σ g ) 1 / 2 , (44) where ( µ r , Σ r ) and ( µ g , Σ g ) are the mean and co v ariance of the real and gen- erated motion features, resp ectiv ely . Low er FID indicates that the generated motions are closer to the distribution of real listener motions, and therefore reflects b etter ov erall motion quality . Diversity. Since a single sp eak er utterance ma y admit multiple plausible listener reactions, it is also imp ortan t to ev aluate the diversit y of generated motions. F ollowing prior w ork in h uman motion generation [82, 96], we measure diversit y in the same motion feature space. Given the set of all generated motions, we Abbreviated pap er title 29 randomly sample tw o subsets of equal size S d , denoted by { ˆ x l m, 1 , . . . , ˆ x l m,S d } and { ˆ x l ′ m, 1 , . . . , ˆ x l ′ m,S d } , and define div ersity as Div ersity = 1 S d S d X i =1 f ev al ( ˆ x l m,i ) − f ev al ( ˆ x l ′ m,i ) 2 . (45) Higher div ersity indicates that the generated motions exhibit greater v ariation and are less lik ely to collapse to a small set of rep etitiv e motion patterns. T able 9: F ull h yp erparameter sweep results for group-wise preference training. W e v ary the ranking margin m , ranking-loss weigh t λ rank , and Gold-vs-Negative weigh t λ gn . W e rep ort pairwise preference metrics (Win(g > N), Win(g > S), Win(g > G)), together with Gen@3, FID, and Diversit y . m λ rank λ gn Win(g > N) ↑ Win(g > S) ↑ Win(g > G) ↑ Gen@3 ↑ FID ↓ Diversit y ↑ 0.00 0.00 0.00 0.9976 0.7809 0.2585 0.9600 5.2638 5.3005 0.00 0.00 0.25 0.9976 0.7809 0.2633 0.9600 5.2638 5.3005 0.00 0.00 0.50 0.9976 0.7809 0.2615 0.9600 5.2638 5.3005 0.00 0.00 1.00 0.9964 0.7809 0.2615 0.9613 5.2638 5.3005 0.00 0.25 0.00 0.9988 0.7809 0.2331 0.9467 5.9644 4.6993 0.00 0.25 0.25 0.9952 0.7482 0.2240 0.9467 5.2102 4.8197 0.00 0.25 0.50 0.9988 0.7288 0.2137 0.9455 5.3426 4.9865 0.00 0.25 1.00 0.9939 0.7815 0.2458 0.9528 5.3948 4.7384 0.00 0.50 0.00 0.9927 0.7760 0.2548 0.9443 4.6552 4.7315 0.00 0.50 0.25 0.9952 0.7730 0.2482 0.9600 5.4479 4.4127 0.00 0.50 0.50 0.9952 0.7476 0.2379 0.9600 5.9814 4.3124 0.00 0.50 1.00 0.9939 0.7694 0.2512 0.9576 5.3426 4.5137 0.00 1.00 0.00 0.9964 0.7548 0.2312 0.9443 6.5379 3.9613 0.00 1.00 0.25 0.9891 0.7306 0.2391 0.9479 7.0065 3.9543 0.00 1.00 0.50 0.9964 0.7391 0.2125 0.9540 5.5322 4.4312 0.00 1.00 1.00 0.9855 0.6731 0.1925 0.9407 6.8036 3.9632 0.50 0.00 0.00 0.9964 0.7809 0.2597 0.9600 5.2638 5.3005 0.50 0.00 0.25 0.9976 0.7809 0.2639 0.9613 5.2638 5.3005 0.50 0.00 0.50 0.9976 0.7809 0.2615 0.9600 5.2638 5.3005 0.50 0.00 1.00 0.9952 0.7809 0.2615 0.9588 5.2638 5.3005 0.50 0.25 0.00 0.9939 0.7494 0.2349 0.9407 5.0807 4.8318 0.50 0.25 0.25 1.0000 0.7966 0.2663 0.9600 4.7596 4.8039 0.50 0.25 0.50 0.9903 0.7337 0.2343 0.9407 4.8888 4.6845 0.50 0.25 1.00 0.9952 0.8184 0.2778 0.9552 5.1955 4.8183 0.50 0.50 0.00 0.9964 0.8287 0.3057 0.9625 5.8396 4.1884 0.50 0.50 0.25 0.9952 0.7579 0.2318 0.9310 5.3855 4.3443 0.50 0.50 0.50 0.9952 0.7736 0.2385 0.9625 6.2371 4.3488 0.50 0.50 1.00 0.9952 0.6762 0.1913 0.9467 6.1306 4.3766 0.50 1.00 0.00 0.9915 0.7337 0.2403 0.9492 6.7096 3.9289 0.50 1.00 0.25 0.9915 0.7082 0.2149 0.9443 5.4811 4.1878 0.50 1.00 0.50 0.9673 0.6132 0.1901 0.9334 6.9334 3.9102 0.50 1.00 1.00 0.9891 0.6168 0.1834 0.9237 6.5986 3.9541 1.00 0.00 0.00 0.9976 0.7809 0.2597 0.9600 5.2638 5.3005 30 C. Luo et al m λ rank λ gn Win(g > N) ↑ Win(g > S) ↑ Win(g > G) ↑ Gen@3 ↑ FID ↓ Diversit y ↑ 1.00 0.00 0.25 0.9976 0.7809 0.2609 0.9588 5.2638 5.3005 1.00 0.00 0.50 0.9964 0.7809 0.2609 0.9600 5.2638 5.3005 1.00 0.00 1.00 0.9976 0.7809 0.2627 0.9588 5.2638 5.3005 1.00 0.25 0.00 0.9964 0.8008 0.2851 0.9516 6.0285 4.2946 1.00 0.25 0.25 0.9939 0.7676 0.2464 0.9552 5.1537 4.6242 1.00 0.25 0.50 0.9939 0.7821 0.2682 0.9516 5.3639 4.5391 1.00 0.25 1.00 0.9988 0.8117 0.2706 0.9625 5.1943 4.6935 1.00 0.50 0.00 0.9927 0.7524 0.2288 0.9528 5.3754 4.3702 1.00 0.50 0.25 0.9952 0.7361 0.2288 0.9455 5.6698 4.2394 1.00 0.50 0.50 0.9903 0.7113 0.2010 0.9516 5.8942 4.3384 1.00 0.50 1.00 0.9915 0.6501 0.1816 0.9310 5.6888 4.2328 1.00 1.00 0.00 0.9952 0.6562 0.1973 0.9310 7.0648 3.9867 1.00 1.00 0.25 0.9849 0.5938 0.1774 0.9262 7.4283 3.8852 1.00 1.00 0.50 0.9921 0.5914 0.1798 0.9104 8.6083 3.6349 1.00 1.00 1.00 0.9831 0.5847 0.1731 0.9237 6.2941 3.9609 2.00 0.00 0.00 0.9976 0.7809 0.2567 0.9600 5.2638 5.3005 2.00 0.00 0.25 0.9976 0.7809 0.2585 0.9600 5.2638 5.3005 2.00 0.00 0.50 0.9964 0.7809 0.2579 0.9600 5.2638 5.3005 2.00 0.00 1.00 0.9976 0.7809 0.2627 0.9613 5.2638 5.3005 2.00 0.25 0.00 0.9952 0.7639 0.2512 0.9540 5.6781 4.4907 2.00 0.25 0.25 0.9891 0.7433 0.2452 0.9588 5.1178 4.7459 2.00 0.25 0.50 0.9964 0.7815 0.2603 0.9588 5.6664 4.3494 2.00 0.25 1.00 0.9939 0.7748 0.2785 0.9697 5.7083 4.1561 2.00 0.50 0.00 0.9939 0.7264 0.2228 0.9516 6.1482 4.1211 2.00 0.50 0.25 0.9964 0.6477 0.1828 0.9249 6.7075 3.8914 2.00 0.50 0.50 0.9964 0.6326 0.1901 0.9249 5.4215 4.1601 2.00 0.50 1.00 0.9909 0.6610 0.1907 0.9370 6.8355 3.7096 2.00 1.00 0.00 0.9927 0.6423 0.1998 0.9298 7.1093 3.8085 2.00 1.00 0.25 0.9715 0.6483 0.2046 0.9407 6.8560 3.7436 2.00 1.00 0.50 0.9752 0.6362 0.1907 0.9298 6.1279 3.8659 2.00 1.00 1.00 0.9655 0.5648 0.1544 0.9140 6.1125 4.0394 C More Details of ReactMotionNet Dataset ReactMotionNet exhibits three desirable properties for studying reactive listener motion generation. First, it pro vides lar ge-sc ale sup ervision, con taining ov er 151K lab eled sp eak er–listener pairs. Second, it explicitly captures the one-to- many nature of listener b eha vior by asso ciating e ac h sp eak er utterance with mul- tiple candidate reactiv e motions. Third, it provides gr ade d sup ervision through Gold, Silver, and Negativ e lab els, supp orting b oth generative mo deling and preference-a ware ev aluation. Moreov er, the dataset is split b y disjoint sp eak er utterances, enabling a cleaner ev aluation of generalization to unseen conv ersa- tional conditions. In total, ReactMotionNet con tains 151,328 labeled sp eak er–listener pairs, co vering 8,298 unique sp eak er utterances and 2,029 unique listener reactiv e mo- tions. On av erage, each sp eak er utterance is paired with 18.24 candidate reactive Abbreviated pap er title 31 (a) All (b) T rain (c) V al (d) T est Fig. 7: Emotion distributions ov er the full dataset and across the train/v alidation/test splits. motions, further highlighting the inherently one-to-many nature of reactive lis- tener b eha vior. Among all pairs, 9,307, 34,196, and 107,825 are annotated as Gold, Silver, and Negativ e, resp ectiv ely , reflecting the gr ade d appr opriateness of candidate reactions. W e partition the dataset by sp e aker utter anc e using an 8:1:1 train/v alidation/test split, ensuring that utterances are disjoint across splits, i.e ., no utterance app ears in more than one partition. The dataset co v ers 47 emotion categories, including admiring , adoring , aes- thetic al ly appr e ciative , amuse d , angry , anxious , ashame d , awar e , awe d , awkwar d , b or e d , c alm , c onfuse d , c ontemplative , c ontemptuous , c ontent , cr aving , desir ous , determine d , disapp ointe d , disguste d , distr esse d , doubtful , e cstatic , emb arr asse d , emp athetic (in p ain) , entr anc e d , envious , excite d , fe arful , fo cuse d , guilty , horri- fie d , inter este d , joyful , loving , nostalgic , p aine d , pr oud , r elieve d , r omantic , sad , satisfie d , surprise d , symp athetic , tir e d , and triumphant . As shown in Fig. 7, these emotion lab els exhibit a broad yet imbalanced distribution across the full dataset and each split, making ReactMotionNet a realistic benchmark for modeling di- v erse affective con versational resp onses. D A dditional Exp erimen tal Results D.1 Hyp erparameter Sensitivit y Analysis W e study the sensitivit y of group-wise preference training to the ranking mar- gin m , the ranking-loss weigh t λ rank , and the Gold-vs-Negativ e w eigh t λ gn . W e 32 C. Luo et al T able 10: Representativ e hyperparameter configurations selected from the full sw eep. W e emphasize Gen@3, Win(g > S), and Win(g > G), together with FID and Div ersity . Config m λ rank λ gn Win(g > N) ↑ Win(g > S) ↑ Win(g > G) ↑ Gen@3 ↑ FID ↓ Diversity ↑ C1 2.00 0.25 1.00 0.9939 0.7748 0.2785 0.9697 5.7083 4.1561 C2 0.50 0.50 0.00 0.9964 0.8287 0.3057 0.9625 5.8396 4.1884 C3 0.00 0.50 0.00 0.9927 0.7760 0.2548 0.9443 4.6552 4.7315 C4 1.00 0.00 0.25 0.9976 0.7809 0.2609 0.9588 5.2638 5.3005 primarily consider Gen@3 , whic h measures whether generated motions can be rank ed among the top plausible candidates under the same candidate budget. W e additionally report Win(g > S) and Win(g > G) to assess relativ e preference qualit y against medium-quality and high-qualit y reference candidates, resp ec- tiv ely . FID and Diversit y are further included to characterize motion realism and output div ersity . The hyperparameter sweep reveals sev eral consistent patterns. First, intro- ducing a smal l p ositive r anking mar gin is b eneficial and more reliable than using no margin. Under λ rank = 0 . 25 and λ gn = 0 . 25 , increasing m from 0 to 0 . 5 impro ves Win(g > S) from 0 . 7482 to 0 . 7966 , Win(g > G) from 0 . 2240 to 0 . 2663 , and Gen@3 from 0 . 9467 to 0 . 9600 , while simultaneously reducing FID from 5 . 2102 to 4 . 7596 . Although larger margins can further increase Gen@3 in certain cases, suc h gains are not consistently accompanied b y impro vemen ts in prefer- ence alignmen t or motion quality , suggesting that excessively large margins may o ver-specialize the ob jectiv e. Second, λ rank is the most sensitive h yperparameter in the sw eep. Mo derate ranking sup ervision is b eneficial, whereas o verly large v alues tend to degrade b oth alignment and generation qualit y . F or instance, at m = 0 . 5 and λ gn = 0 . 25 , increasing λ rank from 0 . 25 to 0 . 5 and 0 . 1 decreases Win(g > S) from 0 . 7966 to 0 . 7579 and 0 . 7082 , decreases Win(g > G) from 0 . 2663 to 0 . 2318 and 0 . 2149 , and w orsens FID from 4 . 7596 to 5 . 3855 and 5 . 4811 . This indicates that excessive ranking pressure can bias optimization to ward relative ordering at the expense of generativ e fidelity . Third, λ gn has a secondary but non-negligible effect, with a mo der ate v alue yielding the most fa vorable trade-off. At m = 0 . 5 and λ rank = 0 . 25 , setting λ gn = 0 . 25 impro ves Win(g > S), Win(g > G), and Gen@3 ov er λ gn = 0 , while also reducing FID. By con trast, further increasing λ gn to 1 . 0 sligh tly improv es pairwise preference scores, but lo wers Gen@3 and degrades FID, indicating that stronger Gold-vs-Negativ e separation does not necessarily translate into b etter o verall generation qualit y . A ccordingly , w e use m = 0 . 5 , λ rank = 0 . 25 , and λ gn = 0 . 25 in all main exp erimen ts, as this setting resides in a stable regime of the sw eep and yields the most balanced o verall p erformance across preference-orien ted and generation- orien ted criteria. Abbreviated pap er title 33 Fig. 8: Hyp erparameter sensitivity heatmaps under different ranking margins. W e sh o w Gen@3, Win(g > S), and FID as functions of λ rank and λ gn . D.2 Inference Efficiency T able 11 lists the inference efficiency of the prop osed ReactMotion. During infer- ence, ReactMotion runs on a single NVIDIA A100 80GB GPU and autoregres- siv ely generates listener motion tok ens conditioned on the sp eak er’s multimodal inputs. In our ev aluation, the mo del generates 50 listener reactive motions cor- resp onding to 50 speaker utterances. In total, it pro duces 1,830 motion tokens in 28.8 seconds, achieving a generation throughput of 63.6 tokens p er second and 1.74 motion sequences per second, which corresponds to an av erage latency of appro ximately 0.60 seconds p er listener motion sequence. The generated motion tokens are then deco ded in to joint sequences using the VQ-V AE deco der. The deco der pro cesses 39.1 motion sequences p er second, in tro ducing minimal computational o verhead. As a result, the complete pip eline 34 C. Luo et al T able 11: Inference Efficiency on a single NVIDIA A100 (80GB). Metric V alue T oken generation sp eed 63.6 tokens/s Motion generation sp eed 1.74 turns/s End-to-end generation sp eed 1.66 turns/s A verage latency p er sample ∼ 0.60 s VQ-V AE decoding sp eed 39.12 turns/s ac hieves an end-to-end throughput of 1.66 motion sequences p er second. These results indicate that ReactMotion maintains a fa vorable balance b et ween model capacit y and inference efficiency , enabling near real-time reactive motion gener- ation in con versational scenarios. D.3 More Details of User Study W e conducted a user study on the T encen t Questionnaire platform to ev aluate the listener motions generated by ReactMotion ( Ours ) against t wo generative baselines, namely the CE v ariant and LLM → MG-MotionLLM * , as well as the b est-in-gr oup Silv er reference. A total of 59 volun teers (16 female and 43 male), all with relev ant backgrounds in machine learning or deep learning, par- ticipated in the study through an online surv ey . In each trial, participan ts were presen ted with a pair of listener-motion videos (A/B) conditioned on the same sp eak er utterance, with the speaker’s transcript displa y ed and the corresp ond- ing audio play ed. They w ere ask ed to choose whic h video exhibited the more appropriate reactiv e listener motion. T o av oid p ositional bias, the t wo compared motions w ere randomly assigned to the A/B p ositions. Eac h participant com- pleted 36 trials, cov ering six speaker utterances with six pairwise comparisons p er condition. F or the Silv er condition, we selected the best candidate within eac h sp eak er-condition group based on its motion caption and rendered motion clip. The results in Fig. 5 reveal three notable findings. First, Ours is consistently preferred o ver b oth generativ e baselines, ac hieving 67.8% preference against CE and 72.0% against LLM → MG-MotionLLM , which demonstrates the adv an- tage of our unified m ultimo dal Seq2Seq form ulation ov er both standard CE training and cascaded generation pip elines. Second, although the Silver refer- ence remains stronger o verall, Ours is substantially closer to Silver than either baseline: Ours receives 44.1% of the v otes against Silv er , whereas CE and LLM → MG-MotionLLM receive only 31.9% and 31.4%, respectively . This in- dicates that the motions generated by Ours are p erceptually muc h closer to high-qualit y in-group references. Third, these results highlight the effectiv eness of the prop osed group-wise preference learning ob jective, whic h explicitly mo d- els the ordering among Gold, Silv er, and Negativ e reactions and leads to more appropriate listener b eha viors under human ev aluation. A t the same time, the remaining gap b et ween Ours and Silver suggests that reactive listener motion Abbreviated pap er title 35 generation remains challenging, leaving ro om for further improv ement in motion naturalness, con textual precision, and diversit y . D.4 F ailure Cases While the mo del effectively generates con textually appropriate listener motions in many scenarios, capturing deep er conv ersational inten t in complex dialogues remains c hallenging. In ambiguous or long-tail situations where appropriate lis- tener behavior requires deeper inten t understanding, the current mo del ma y still exhibit limited robustness. This highlights a promising researc h direction for future work to further enhance inten t-a ware interaction mo deling in dyadic in teraction. E Limitations Since w e are the first to explore this task, we design a relatively simple y et effectiv e mo del arc hitecture to main tain training stability and computational ef- ficiency . This design allo ws us to v alidate the core idea of our approach without in tro ducing excessiv e arc hitectural complexit y . The prop osed approac h already ac hieves promising results, demonstrating its feasibilit y and effectiv eness. Nev- ertheless, there remains a large potential for further impro vemen t. F uture work could explore more adv anced netw ork architectures and more sophisticated train- ing tec hniques to further enhance p erformance. References 1. A chiam, J., A dler, S., Agarw al, S., Ahmad, L., Akk ay a, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadk at, S., et al.: Gpt-4 technical rep ort (2023) 2. Alexanderson, S., Nagy , R., Besko w, J., Hen ter, G.E.: Listen, denoise, action! audio-driv en motion syn thesis with diffusion models. ACM T ransactions on Graphics (TOG) 42 (4), 1–20 (2023) 3. A o, T., Gao, Q., Lou, Y., Chen, B., Liu, L.: Rhythmic gesticulator: Rhythm-a ware co-sp eec h gesture syn thesis with hierarchical neural embeddings. ACM T ransac- tions on Graphics (TOG) 41 (6), 1–19 (2022) 4. Barquero, G., Escalera, S., Palmero, C.: Seamless human motion composition with blended positional encodings. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 5. Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 4. Springer (2006) 6. Bradley , R.A., T erry , M.E.: Rank analysis of incomplete block designs: I. the metho d of paired comparisons. Biometrik a 39 (3/4), 324–345 (1952) 7. Chen, B., Li, Y., Ding, Y.X., Shao, T., Zhou, K.: Enabling synergistic full-b ody con trol in prompt-based co-sp eec h motion generation. In: Pro ceedings of the ACM In ternational Conference on Multimedia (A CM M M). pp. 6774–6783 (2024) 36 C. Luo et al 8. Chen, C., Zhang, J., Lakshmik anth, S.K., F ang, Y., Shao, R., W etzstein, G., F ei-F ei, L., A deli, E.: The language of motion: Unifying verbal and non-verbal language of 3d human motion. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6200–6211 (June 2025) 9. Chen, C., Zhang, J., Lakshmik an th, S.K., F ang, Y., Shao, R., W etzstein, G., F ei- F ei, L., Adeli, E.: The language of motion: Unifying v erbal and non-verbal lan- guage of 3d human motion. In: Proceedings of the Computer Vision and P attern Recognition Conference. pp. 6200–6211 (2025) 10. Chen, X., Jiang, B., Liu, W., Huang, Z., F u, B., Chen, T., Y u, G.: Executing your commands via motion diffusion in laten t space. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18000– 18010 (2023) 11. Chiang, W.L., Zheng, L., Sheng, Y., Angelop oulos, A.N., Li, T., Li, D., Zhu, B., Zhang, H., Jordan, M., Gonzalez, J.E., et al.: Chatb ot arena: An op en platform for ev aluating llms b y human preference. In: The In ternational Conference on Mac hine Learning (ICML) (2024) 12. Chopin, B., T ang, H., Otb erdout, N., Daoudi, M., Sebe, N.: Interaction trans- former for human reaction generation. IEEE T ransactions on Multimedia (TMM) 25 , 8842–8854 (2023) 13. Christiano, P .F., Leike, J., Bro wn, T., Martic, M., Legg, S., Amo dei, D.: Deep reinforcemen t learning from human preferences. Adv ances in Neural Information Pro cessing Systems (NeurIPS) 30 (2017) 14. Ch u, X., Liu, R., Huang, Y., Liu, Y., Peng, Y., Zheng, B.: Unils: End-to-end audio- driv en av atars for unified listening and sp eaking. arXiv preprint (2025) 15. Défossez, A., et al.: Moshi: a sp eec h-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037 (2024) 16. Dub ois, Y., Galam b osi, B., Liang, P ., Hashimoto, T.B.: Length-controlled alpacaev al: A simple wa y to debias automatic ev aluators. arXiv preprint arXiv:2404.04475 (2024) 17. Ghosh, A., Dabral, R., Golyanik, V., Theobalt, C., Slusallek, P .: Remos: 3d motion-conditioned reaction syn thesis for t wo-person interactions. In: Europ ean Conference on Computer Vision (ECCV). pp. 418–437 (2024) 18. Guo, C., Mu, Y., Ja ved, M.G., W ang, S., Cheng, L.: Momask: Generative masked mo deling of 3d human motions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1900–1910 (2024) 19. Guo, C., Zou, S., Zuo, X., W ang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d h uman motions from text. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 5152– 5161 (2022) 20. Guo, W., Bie, X., Alameda-Pineda, X., Moreno-Noguer, F.: Multi-p erson extreme motion prediction. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 13053–13064 (2022) 21. Han, B., Peng, H., Dong, M., Ren, Y., Shen, Y., Xu, C.: AMD: autoregressiv e motion diffusion. In: W o oldridge, M.J., Dy , J.G., Natara jan, S. (eds.) The As- so ciation for the A dv ancemen t of Artificial Intelligence (AAAI). pp. 2022–2030 (2024) 22. He, X., Huang, Q., Zhang, Z., Lin, Z., W u, Z., Y ang, S., Li, M., Chen, Z., Xu, S., W u, X.: Co-sp eec h gesture video generation via motion-decoupled diffusion mo del. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 2263–2273 (2024) Abbreviated pap er title 37 23. Heusel, M., Ramsauer, H., Un terthiner, T., Nessler, B., Ho chreiter, S.: Gans trained by a tw o time-scale up date rule conv erge to a lo cal nash equilibrium. A dv ances in Neural Information Pro cessing Systems (NeurIPS) 30 (2017) 24. Ho, L., Huang, Y., Qin, D., Shi, M., T se, W., Liu, W., Y amagishi, J., Kom ura, T.: In teract: A large-scale dataset of dynamic, expressive and interactiv e activities b et w een tw o p eople in daily scenarios. Pro ceedings of the ACM on Computer Graphics and In teractive T echniques (P ACMCGIT) 8 (4), 1–27 (2025) 25. Hu, T., Zhu, X., Guo, W., Su, K.: Efficien t in teraction recognition through p ositiv e action represen tation. Mathematical Problems in Engineering 2013 (1), 795360 (2013) 26. Huang, Y., W an, W., Y ang, Y., Callison-Burch, C., Y atsk ar, M., Liu, L.: Como: Con trollable motion generation through language guided pose co de editing. In: Europ ean Conference on Computer Vision (ECCV). pp. 180–196 (2024) 27. Huang, Y., Khan, S.M.: Dy adgan: Generating facial expressions in dyadic inter- actions. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition W orkshops (CVPR W) (2017) 28. Hurst, A., Lerer, A., Goucher, A.P ., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., W elihinda, A., Hay es, A., Radford, A., et al.: Gpt-4o system card (2024) 29. Jaec h, A., Kalai, A., Lerer, A., Ric hardson, A., El-Kishky , A., Lo w, A., Hely ar, A., Madry , A., Beutel, A., Carney , A., et al.: Op enai o1 system card (2024) 30. Jeong, M., Hwang, Y., Lee, J., Jung, S., Kim, W.H.: Hgm 3 : Hierarchical generativ e mask ed motion mo deling with hard tok en mining. In: International Conference on Learning Represen tations (ICLR) (2025) 31. Khiro dk ar, R., Bansal, A., Ma, L., Newcom b e, R., V o, M., Kitani, K.: Ego- h umans: An ego-centric 3d m ulti-h uman benchmark. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 19807– 19819 (2023) 32. Khiro dk ar, R., Song, J.T., Cao, J., Luo, Z., Kitani, K.: Harmony4d: A video dataset for in-the-wild close human interactions. Adv ances in Neural Information Pro cessing Systems (NeurIPS) 37 , 107270–107285 (2024) 33. Kim, D.Y., Lee, H.K., Ch ung, K.: A v atar-mediated exp erience in the metav erse: The impact of av atar realism on user-av atar relationship. Journal of Retailing and Consumer Services 73 , 103382 (2023) 34. Kim, J., Kim, J., Choi, S.: Flame: F ree-form language-based motion synthesis & editing. In: The Asso ciation for the Adv ancement of Artificial In telligence (AAAI). v ol. 37, pp. 8255–8263 (2023) 35. K o, W.R., Jang, M., Lee, J., Kim, J.: Air-act2act: Human–human in teraction dataset for teaching non-verbal so cial b ehaviors to robots. The International Jour- nal of Robotics Researc h 40 (4-5), 691–697 (2021) 36. Lee, G., Deng, Z., Ma, S., Shiratori, T., Sriniv asa, S.S., Sheikh, Y.: T alking with hands 16.2 m: A large-scale dataset of synchronized b ody-finger motion and audio for conv ersational motion analysis and syn thesis. In: Pro ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 763–772 (2019) 37. Li, B., Zhao, Y., Zhelun, S., Sheng, L.: Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In: The Asso ciation for the Ad- v ancemen t of Artificial Intelligence (AAAI). vol. 36, pp. 1272–1279 (2022) 38. Li, J., Kang, D., P ei, W., Zhe, X., Zhang, Y., Bao, L., He, Z.: Audio2gestures: Generating div erse gestures from audio. IEEE T ransactions on Visualization and Computer Graphics (TV CG) 30 (8), 4752–4766 (2023) 38 C. Luo et al 39. Li, R., Dai, Y., Zhang, Y., Li, J., Y ang, J., Guo, J., Li, X.: Exploring m ulti-mo dal con trol in m usic-driven dance generation. In: IEEE In ternational Conference on A coustics, Speech, and Signal Processing (ICASSP). pp. 8281–8285 (2024) 40. Li, R., Zhang, H., Zhang, Y., Zhang, Y., Zhang, Y., Guo, J., Zhang, Y., Li, X., Liu, Y.: Lodge++: High-qualit y and long dance generation with robust choreog- raph y patterns. IEEE T ransactions on Pattern Analysis and Mac hine Intelligence (TP AMI) pp. 1–15 (2025) 41. Liang, H., Zhang, W., Li, W., Y u, J., Xu, L.: In tergen: Diffusion-based m ulti-human motion generation under complex in teractions. arXiv preprint arXiv:2304.05684 (2023) 42. Liao, T.H., Zhou, Y., Shen, Y., Huang, C.H.P ., Mitra, S., Huang, J.B., Bhat- tac harya, U.: Shap e m y mov es: T ext-driv en shap e-a ware synthesis of human mo- tions. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 1917–1928 (2025) 43. Liu, H., Zh u, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., Blac k, M.J.: Emage: T o wards unified holistic co-sp eech gesture generation via expressiv e mask ed audio gesture modeling. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 1144–1154 (2024) 44. Liu, P ., Song, L., Huang, J., Liu, H., Xu, C.: Gesturelsm: Laten t shortcut based co-sp eec h gesture generation with spatial-temporal mo deling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10929– 10939 (2025) 45. Liu, Y., Cao, Q., W en, Y., Jiang, H., Ding, C.: T ow ards v ariable and co ordinated holistic co-speech motion generation. In: Pro ceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 1566–1576 (2024) 46. Liu, Y., Chen, C., Ding, C., Yi, L.: Physreaction: Physically plausible real-time h umanoid reaction syn thesis via forward dynamics guided 4d imitation. In: Pro- ceedings of the ACM International Conference on Multimedia (A CM MM). pp. 3771–3780 (2024) 47. Liu, Y., Chen, C., Yi, L.: Interactiv e humanoid: Online full-bo dy motion reaction syn thesis with so cial affordance canonicalization and forecasting (2023) 48. Lu, S., W ang, J., Lu, Z., Chen, L.H., Dai, W., Dong, J., Dou, Z., Dai, B., Zhang, R.: Scamo: Exploring the scaling law in autoregressive motion generation model. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 27872–27882 (2025) 49. Luo, C., Song, S., Y an, S., Y u, Z., Ge, Z.: Reactdiff: F undamental multiple appro- priate facial reaction diffusion model. In: Pro ceedings of the ACM International Conference on Multimedia (ACM MM). pp. 5607–5616 (2025) 50. Luo, C., W ang, J., Li, B., Song, S., Ghanem, B.: Omniresp onse: Online multi- mo dal conv ersational resp onse generation in dyadic interactions. In: A dv ances in Neural Information Processing Systems (NeurIPS) (2025) 51. Luo, C., et al.: Reactface: Online multiple appropriate facial reaction generation in dyadic in teractions. arXiv preprint (2024), 52. Meng, Z., Xie, Y., Peng, X., Han, Z., Jiang, H.: Rethinking diffusion for text- driv en human motion generation: Redundant representations, ev aluation, and mask ed autoregression. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 27859–27871 (2025) Abbreviated pap er title 39 53. Mughal, M.H., Dabral, R., Habibie, I., Donatelli, L., Habermann, M., Theobalt, C.: Conv ofusion: Multi-modal conv ersational diffusion for co-sp eec h gesture syn- thesis. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 1388–1398 (2024) 54. Murra y , K., Chiang, D.: Correcting length bias in neural mac hine translation. In: Proceedings of the Conference on Machine T ranslation (WMT). pp. 212–223 (2018) 55. Ng, E., Romero, J., Bagautdinov, T., Bai, S., Darrell, T., Kanazaw a, A., Richard, A.: F rom audio to photoreal embo dimen t: Synthesizing humans in conv ersations. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1001–1010 (2024) 56. Ng, E., Xiang, D., Joo, H., Grauman, K.: Y ou2me: Inferring bo dy pose in ego- cen tric video via first and second person interactions. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 9890–9900 (2020) 57. Ng, E., et al.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 58. Op enAI: In troducing op enai o3 and o4-mini. https : / / openai . com / index / openai- o3- mini/ (2025) 59. P ark, S., Kim, C., Rha, H., Kim, M., Hong, J., Y eo, J., Ro, Y.: Let’s go real talk: Sp ok en dialogue mo del for face-to-face conv ersation. In: Pro ceedings of the Ann ual Meeting of the Association for Computational Linguistics (A CL). pp. 16334–16348 (2024) 60. P etrovic h, M., Black, M.J., V arol, G.: Action-conditioned 3d human motion syn- thesis with transformer v ae. In: Pro ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV) (2021) 61. P etrovic h, M., Black, M.J., V arol, G.: T emos: Generating diverse human motions from textual descriptions. In: Europ ean Conference on Computer Vision (ECCV). pp. 480–497 (2022) 62. P etrovic h, M., Litany , O., Iqbal, U., Black, M.J., V arol, G., Bin Peng, X., Remp e, D.: Multi-trac k timeline control for text-driv en 3d h uman motion generation. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1911–1921 (2024) 63. Pin yoan untapong, E., Saleem, M.U., Karunratanakul, K., W ang, P ., Xue, H., Chen, C., Guo, C., Cao, J., Ren, J., T ulyak ov, S.: Con trolmm: Controllable mask ed motion generation (2024) 64. Raab, S., Leib o vitc h, I., Li, P ., Ab erman, K., Sorkine-Hornung, O., Cohen-Or, D.: Mo di: Unconditional motion synthesis from div erse data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 13873–13883 (2023) 65. Raffel, C., Shazeer, N., Rob erts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P .J.: Exploring the limits of transfer learning with a unified text-to- text transformer. Journal of Mac hine Learning Research (JMLR) 21 (140), 1–67 (2020) 66. Rub enstein, P .K., et al.: Audiopalm: A large language mo del that can speak and listen. arXiv preprin t arXiv:2306.12925 (2023) 67. Ry o o, M.S., F uchs, T.J., Xia, L., Aggarw al, J.K., Matthies, L.: Rob ot-cen tric ac- tivit y prediction from first-person videos: What will they do to me? In: Pro ceed- ings of the T en th Annual ACM/IEEE International Conference on Human-Robot In teraction (HRI). pp. 295–302 (2015) 40 C. Luo et al 68. Ry o o, M.S., Matthies, L.: First-p erson activity recognition: What are they doing to me? In: Proceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 2730–2737 (2013) 69. Singh, A., F ry , A., Perelman, A., T art, A., Ganesh, A., El-Kishky , A., McLaughlin, A., Lo w, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card (2025) 70. Song, S., et al.: React 2024: the second m ultiple appropriate facial reaction gen- eration challenge. arXiv preprint (2024), 71. Spaccatini, F., Corlito, G., Sacchi, S.: New dy ads? the effect of social rob ots’ an throp omorphization on empathy tow ards h uman beings. Computers in Human Beha vior 146 , 107821 (2023) 72. Stiennon, N., Ouyang, L., W u, J., Ziegler, D., Low e, R., V oss, C., Radford, A., Amo dei, D., Christiano, P .F.: Learning to summarize with human feedback. A d- v ances in Neural Information Pro cessing Systems (NeurIPS) 33 , 3008–3021 (2020) 73. Sun, M., Xu, C., Jiang, X., Liu, Y., Sun, B., Huang, R.: Beyond talking–generating holistic 3d human dyadic motion for comm unication. In ternational Journ al of Computer Vision 133 (5), 2910–2926 (2025) 74. T evet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: Exp osing h uman motion generation to clip space. In: Europ ean Conference on Computer Vision (ECCV). pp. 358–374 (2022) 75. T evet, G., Raab, S., Cohan, S., Reda, D., Luo, Z., Peng, X.B., Bermano, A.H., v an de Panne, M.: Closd: Closing the loop betw een simulation and diffusion for m ulti-task character con trol. In: International Conference on Learning Represen- tations (ICLR) (2025) 76. T evet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprin t arXiv:2209.14916 (2022), also known as MDM; widely used as a diffusion baseline. 77. V eluri, B., P elo quin, B.N., Y u, B., Gong, H., Gollak ota, S.: Beyond turn-based in terfaces: Synchronous llms as full-duplex dialogue agents. In: Pro ceedings of the Conference on Empirical Metho ds in Natural Language Pro cessing (EMNLP). pp. 21390–21402 (2024) 78. W ang, T., W u, Z., He, Q., Chu, J., Qian, L., Cheng, Y., Xing, J., Zhao, J., Jin, L.: Stic kmotion: Generating 3d human motions by dra wing a stickman. In: Pro ceed- ings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 12370–12379 (2025) 79. W ang, Y., Leng, Z., Li, F.W., W u, S.C., Liang, X.: F g-t2m: Fine-grained text- driv en human motion generation via diffusion mo del. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 22035– 22044 (2023) 80. W ang, Y., Li, M., Liu, J., Leng, Z., Li, F.W., Zhang, Z., Liang, X.: F g-t2m++: Llms-augmen ted fine-grained text driven human motion generation. International Journal of Computer Vision (IJCV) 133 (7), 4277–4293 (2025) 81. W ang, Z., W ang, J., Li, Y., Lin, D., Dai, B.: Intercon trol: Zero-shot human inter- action generation by controlling ev ery join t. In: A dv ances in Neural Information Pro cessing Systems (NeurIPS) (2024) 82. W u, B., Xie, J., Shen, K., K ong, Z., Ren, J., Bai, R., Qu, R., Shen, L.: Mg- motionllm: A unified framew ork for motion comprehension and generation across m ultiple gran ularities. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 27849–27858 (2025) 83. W u, Y., Sc huster, M., Chen, Z., Le, Q.V., Norouzi, M., Mac herey , W., Krikun, M., Cao, Y., Gao, Q., Mac herey , K., et al.: Google’s neural mac hine translation system: Bridging the gap b et ween h uman and machine translation (2016) Abbreviated pap er title 41 84. Xiao, L., Lu, S., Pi, H., F an, K., P an, L., Zhou, Y., F eng, Z., Zhou, X., Peng, S., W ang, J.: Motionstreamer: Streaming motion generation via diffusion-based autoregressiv e mo del in causal latent space. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10086– 10096 (2025) 85. Xu, C., Sun, M., Cheng, Z.Q., W ang, F., Liu, Y., Sun, B., Huang, R., Haupt- mann, A.: Combo: Co-sp eec h holistic 3d h uman motion generation and efficien t customizable adaptation in harmony . IEEE T ransactions on Pattern Analysis and Mac hine In telligence (TP AMI) pp. 1–18 (2025) 86. Xu, L., Lv, X., Y an, Y., Jin, X., W u, S., Xu, C., Liu, Y., Zhou, Y., Rao, F., Sheng, X., et al.: In ter-x: T ow ards versatile human-h uman interaction analysis. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22260–22271 (2024) 87. Xu, L., Zhou, Y., Y an, Y., Jin, X., Zh u, W., Rao, F., Y ang, X., Zeng, W.: Regen- net: T ow ards h uman action-reaction syn thesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1759–1769 (2024) 88. Xu, S., Dou, Z., Shi, M., P an, L., Ho, L., W ang, J., Liu, Y., Lin, C., Ma, Y., W ang, W., et al.: Mospa: Human motion generation driven by spatial audio (2025) 89. Y ang, A., Li, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., W ei, H., Lin, H., T ang, J., Y ang, J., T u, J., Zhang, J., Y ang, J., Y ang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Y ang, K., Y u, L., Deng, L., Li, M., Xu e, M., Li, M., Zhang, P ., W ang, P ., Zh u, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T., T ang, T., Yin, W., Ren, X., W ang, X., Zhang, X., Ren, X., F an, Y., Su, Y., Zhang, Y., Zhang, Y., W an, Y., Liu, Y., W ang, Z., Cui, Z., Zhang, Z., Zhou, Z., Qiu, Z.: Qw en3 tec hnical rep ort (2025) 90. Y ang, Y., Huang, Z., Xu, C., He, S.: Lagrangian motion fields for long-term mo- tion generation. IEEE T ransactions on Pattern Analysis and Machine Intelligence (TP AMI) 48 (2), 1171–1184 (2026) 91. Yi, H., Liang, H., Liu, Y., Cao, Q., W en, Y., Bolk art, T., T ao, D., Black, M.J.: Generating holistic 3d human motion from speech. In: Proceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 469–480 (2023) 92. Yin, Y., Guo, C., Kaufmann, M., Zarate, J.J., Song, J., Hilliges, O.: Hi4d: 4d in- stance segmentation of close h uman in teraction. In: Pro ceedings of the IEE E/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17016– 17027 (2023) 93. Y u, C., Zhai, W., Y ang, Y., Cao, Y., Zha, Z.J.: Hero: Human reaction genera- tion from videos. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 10262–10274 (2025) 94. Zhang, D., Li, S., Zhang, X., Zhan, J., W ang, P ., Zhou, Y., Qiu, X.: Sp eec hgpt: Emp o w ering large language mo dels with intrinsic cross-mo dal conv ersational abil- ities. In: Findings of the Association for Computational Linguistics. pp. 15757– 15773 (2023) 95. Zhang, J., F an, H., Y ang, Y.: Energymogen: Comp ositional human motion gen- eration with energy-based diffusion mo del in latent space. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR). pp. 17592–17602 (2025) 42 C. Luo et al 96. Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., Shan, Y.: Generating human motion from textual descriptions with discrete representations. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14730–14740 (2023) 97. Zhang, M., Cai, Z., P an, L., Hong, F., Guo, X., Y ang, L., Liu, Z.: Motiondiffuse: T ext-driven human motion generation with diffusion model. IEEE T ransactions on P attern Analysis and Mac hine Intelligence (TP AMI) 46 (6), 4115–4128 (2024) 98. Zhang, P ., Liu, P ., Garrido, P ., Kim, H., Chaudhuri, B.: Kinmo: Kinematic-aw are h uman motion understanding and generation. In: Pro ceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 11187–11197 (2025) 99. Zhang, X., Li, J., Zhang, J., Dang, Z., Ren, J., Bo, L., T u, Z.: Semtalk: Holistic co-sp eec h motion generation with frame-level semantic emphasis. In: Proceedings of the IEEE/CVF In ternational Conference on Computer Vision (ICCV). pp. 13761–13771 (2025) 100. Zhang, X., Li, J., Zhang, J., Ren, J., Bo, L., T u, Z.: Ec homask: Speech-queried atten tion-based mask mo deling for holistic co-sp eec h motion generation. In: Pro- ceedings of the ACM International Conference on Multimedia (A CM MM). pp. 10827–10836 (2025) 101. Zhang, Y., Huang, D., Liu, B., T ang, S., Lu, Y., Chen, L., Bai, L., Chu, Q., Y u, N., Ouy ang, W.: Motiongpt: Finetuned llms are general-purpose motion generators. In: The Association for the Adv ancement of Artificial Intelligence (AAAI). vol. 38, pp. 7368–7376 (2024) 102. Zheng, L., Chiang, W.L., Sheng, Y., Zh uang, S., W u, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Adv ances in Neural Information Pro cessing Systems (NeurIPS) 36 , 46595– 46623 (2023) 103. Zhou, M., Bai, Y., Zhang, W., Y ao, T., Zhao, T., Mei, T.: Resp onsiv e listening head generation: A b enc hmark dataset and baseline. In: European Conference on Computer Vision (ECCV) (2022) 104. Zh u, Y., Zhang, L., Rong, Z., Hu, T., Liang, S., Ge, Z.: Infp: Audio-driv en inter- activ e head generation in dyadic conv ersations. In: Pro ceedings of the IEEE/CVF Conference on Computer Vision and P attern Recognition (CVPR) (2025) 105. Zou, Q., Y uan, S., Du, S., W ang, Y., Liu, C., Xu, Y., Chen, J., Ji, X.: Parco: P art-co ordinating text-to-motion syn thesis. In: Leonardis, A., Ricci, E., Roth, S., Russak ovsky , O., Sattler, T., V arol, G. (eds.) Europ ean Conference on Computer Vision (ECCV). v ol. 15114, pp. 126–143 (2024)
Comments & Academic Discussion
Loading comments...
Leave a Comment