Gesticulator: A framework for semantically-aware speech-driven gesture generation

Gesticulator: A framew ork for semantically-aware spee ch-driven gestur e generation T aras Kucherenko KTH, Stockholm, Sweden tarask@kth.se Patrik Jonell KTH, Stockholm, Sweden pjjonell@kth.se Sanne van W aver en KTH, Stockholm, Sweden sannevw@kth.se Gustav Eje Henter KTH, Stockholm, Sweden ghe@kth.se Simon Alexanderson KTH, Stockholm, Sweden simonal@kth.se Iolanda Leite KTH, Stockholm, Sweden iolanda@kth.se Hedvig Kjellström KTH, Stockholm, Sweden hedvig@kth.se ABSTRA CT During speech, pe ople sp ontaneously gesticulate, which plays a key role in conveying information. Similarly , realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-spee ch gesture generation systems use a single modality for representing speech: either au- dio or text. These systems are therefor e conned to producing either acoustically-linked beat gestures or semantically-linked ges- ticulation (e .g., raising a hand when saying “high”): they cannot appropriately learn to generate both gesture types. W e present a model designe d to produce arbitrary b eat and semantic gestures together . Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting ges- tures can be applied to b oth virtual agents and humanoid robots. Subjective and objective evaluations conrm the success of our approach. The code and video are available at the project page svito-zar .github.io/gesticulator . KEY W ORDS Gesture generation; virtual agents; socially intelligent systems; co- speech gestures; multi-modal interaction; deep learning A CM Reference Format: T aras Kucherenko, Patrik Jonell, Sanne van Wav eren, Gustav Eje Henter , Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticula- tor: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI ’20), October 25–29, 2020, Virtual event, Netherlands . ACM, New Y ork, NY, USA, 9 pages. https://doi.org/10.1145/3382507.3418815 1 IN TRODUCTION When sp eaking, pe ople often spontaneously pr oduce hand gestur es, also referred to as co-spe ech gestures. These co-speech gestures can accompany the content of the speech – what is being said – on all lev els, fr om partial word meanings to situation descriptions [ 25 ]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. T o copy other wise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. ICMI ’20, Octob er 25–29, 2020, Virtual event, Netherlands © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7581-8/20/10. . . $15.00 https://doi.org/10.1145/3382507.3418815 W h at | was | I | s ay i ng? N N N N NN N N Figure 1: Overview of the proposed autoregressive model. Gesture generation is hence an important part of animation, as well as of human-agent interaction research and applications. Virtual agents have be en developed for a diverse set of appli- cations, such as serious gaming [ 32 ], interpersonal skills train- ing [ 35 , 45 ] or therapy systems [ 41 ]. Interactions with these virtual agents have shown to be more engaging when the agent’s verbal behavior is accompanied by appropriate nonverbal behavior [ 43 ]. Moreover , it has been shown that manipulating gesture properties can inuence user perception of an agent’s emotions [ 7 ]. Traditionally , gesture generation for virtual agents has been done by various rule-based systems [ 6 , 21 , 44 ]. Those approaches are constraine d by the discrete set of gestures they can produce. Alongside recent advances in de ep learning, data-driven approaches have increasingly gained interest for gesture generation [ 1 , 27 , 48 ]. While early work has considered gesture generation as a classi- cation task which aims to deduce a specied gesture class [ 9 , 37 ], more r ecent work has considered it as a regression task which aims to produce continuous motion [ 2 , 48 ]. W e focus on the latter task: continuous gesture generation . T o date, prior work on continuous gesture generation has used a single input modality: either acoustic or semantic. In contrast, our work makes use of both these modali- ties to allow for semantic-aware spee ch-driven continuous gesture generation. The contributions of this work are the following: 1 (1) the rst data-driven model that maps speech acoustic and semantic features into continuous 3D gestures; (2) a comparison contrasting the eects of dierent architec- tures and important modelling choices; (3) objective and subjective evaluations of the eect of the two speech mo dalities – audio and semantics – on the resulting gestures. W e additionally extend a publicly available corpus of 3D co- speech gestures, the Trinity College dataset [ 13 ], with manual text transcriptions. Video samples from our evaluations are provided at vimeo.com/showcase/6737868 . 2 BA CKGROUND AND RELA TED W ORK 2.1 Background While there are sev eral theories on how gestures are produced by humans [ 5 , 10 , 33 ], there is a consensus that speech and gestures correlate strongly [ 18 , 23 , 31 , 39 ]. In this section, we revie w some concepts relevant to our w ork, namely gesture classication, the temporal alignment b etween gestures and speech as well as the gesture-generation problem formulation. 2.1.1 Co-Speech Gesture T ypes. Our work is informed by the ges- ture classication by McNeill [ 33 ], who distinguished the following gesture types: (1) Iconic gestures represent some aspect of the scene; (2) Metaphoric gestures represent an abstract concept; (3) Deictic gestures point to an object or orientation; (4) Beat gestures are used for emphasis and usually correlate with the speech prosody (e.g., intonation and loudness). The rst three gesture types, also called representational gestures , depend on the content of the speech – its semantics – while the last type instead depends on the audio signal – the acoustics. Hence, systems that ignor e either aspect of speech can only learn to model a subset of human co-speech gesticulation. 2.1.2 Gesture-Speech Alignment. Gesture-spee ch alignment is an active research eld co vering several languages, including Fr ench [ 12 ], German [ 4 ], and English [ 18 , 31 , 39 ]. W e focus on prior work on gesture-speech alignment for the English language. In English, gestures typically lead the corresponding speech by , on average, 0.22 s (std 0.13 s) [ 31 ]; specically , Pouw et al. [ 39 ] aligned dierent gesture types with the peak pitch of the sp eech audio and found that the onset of b eat gestures usually precedes the corresponding speech by 0.35 s (std 0.3), the onset of iconic gestures prece des spe ech by 0.45 s (std 0.4), and the onset of p ointing gestures precedes speech by 0.38 s (std 0.4). Informed by these works, we take the widest range among the studies, plus some margin, for the time-span of the spee ch used to predict the corresponding gesture, and consider 1 s of future spee ch and 0.5 s of past speech as input to our mo del detailed in Sec. 4 . 2.1.3 The Gesture-Generation Problem. W e frame the problem of speech-driven gesture generation as follows: given a sequence of speech features 𝒔 = [ 𝑠 𝑡 ] 𝑡 = 1: 𝑇 the task is to generate a corresponding pose sequence ˆ 𝒈 = [ ˆ 𝑔 𝑡 ] 𝑡 = 1: 𝑇 of gestures that an agent might per- form while uttering this sp eech. Here, 𝑡 = 1 : 𝑇 denotes a sequence of vectors for 𝑡 in 1 to 𝑇 . Each speech segment 𝒔 𝑡 is represented by sev eral dierent fea- tures, such as acoustic featur es (e .g., spectrograms), semantic fea- tures (e.g., word emb eddings) or a combination of the two . The ground-truth pose 𝒈 𝑡 and the predicted pose ˆ 𝒈 𝑡 at the same time instance 𝑡 can be represented in 3D space as a sequence of joint rotations: 𝒈 𝑡 = [ 𝛼 𝑖 ,𝑡 , 𝛽 𝑖 ,𝑡 , 𝛾 𝑖 ,𝑡 ] 𝑖 = 1: 𝑛 , 𝑛 being the number of keypoints of the body and 𝛼 , 𝛽 and 𝛾 representing r otations in three axes. 2.2 Related W ork As this work contributes toward data-driven gesture generation, we conne our revie w to these methods. 2.2.1 A udio-Driven Gesture Generation. Most prior work on data- driven gesture generation has used the audio-signal as the only speech-input modality in the model [ 14 , 15 , 19 , 28 , 42 ]. For example, Sadoughi and Busso [ 42 ] traine d a probabilistic graphical model to generate a discrete set of gestures based on the speech audio- signal, using discourse functions as constraints. Hasegawa et al. [ 19 ] developed a more general mo del capable of generating arbitrary 3D motion using a deep recurrent neural network, applying smoothing as postprocessing step. Kucherenko et al. [ 28 ] extended this work by applying representation learning to the human pose and reducing the ne ed for smoothing. Recently , Ginosar et al. [ 15 ] applie d a convolutional neural network with adversarial training to generate 2D poses from spectrogram features. However , driving either virtual avatars or humanoid robots requires 3D joint angles. Ferstl et al. [ 14 ] followed the approach of adversarial training and applie d it to a recurrent neural network together with a gestur e phase classier . Our model diers from these systems in that it leverages both the audio signal and the text transcription for gesture generation. 2.2.2 T ext- Transcription-Driven Gesture Generation. Several recent works mapped from text transcripts to co-spe ech gestures. Ishi et al. [ 22 ] generated gestures from text input through a series of probabilistic functions: W ords were mapped to word concepts using W ordNet [ 34 ], which then were mapped to a gesture function (e .g., iconic or beat), which in turn were mapped to clusters of 3D hand gestures. Y oon et al. [ 48 ] learned a mapping from the utterance text to gestures using a recurrent neural network. The produced gestures were aligned with audio in a post-processing step. Although these works capture important information fr om te xt transcriptions, they may fail to reect the strong link between gestures and spee ch acoustics such as intonation, prosody , and loudness [ 40 ]. 2.2.3 Multimodal Gesture-Generation Models. Only a handful of works have used multiple modalities of the speech to predict match- ing gestures. The model in Ne et al. [ 37 ] predicted gestures based on text, theme, rheme, and utterance focus. They also incorporated text-to-concept mapping. Concepts w ere then mapped to a set of 28 discrete gestures in a speaker-dependent manner . Chiu et al [ 9 ] used both audio signals and text transcripts as input, to predict a total of 12 gesture classes using deep learning. Our appr oach diers from these works, as we aim to generate a wider range of gestures: rather than predicting a discrete gesture class, our model produces arbitrary gestures as a sequence of 3D poses. 2.2.4 Regarding Motion Continuity . Separate from the input modal- ities of the system is the aspect of visual motion quality . Continuous 2 gesture generation can avoid the concatenation-point discontinu- ities exhibited by playback-base d approaches such as motion graphs [ 3 , 26 ]. That said, comparativ ely few appr oaches to continuous ges- ture generation e xplicitly try to enfor ce continuity in the generated pose sequence. Instead, they rely on postprocessing to increase smoothness as in [ 19 ]. Y oon et al. [ 48 ] include a velocity penalty in training that discourages jerky motion. The recurrent connections used in several models [ 13 , 19 , 48 ] can also act as a pose memory that may help the model to produce smo oth output motion. A u- toregressive motion models have recently demonstrated promising results in probabilistic audio-driven gesture generation [ 2 ]. In this paper , we similarly investigate autoregressive connections for im- proving motion quality , which explicitly provide the most recent poses as input to the model when generating the next pose. 3 TRAINING AND TEST DA T A W e develop our gesture generation model using machine learning: we learn a gesture estimator ˆ 𝒈 = 𝐹 ( 𝒔 ) based on a dataset of human gesticulation, where we have both speech information 𝒔 (acoustic and semantic) and gesture data 𝒈 . For this work, we specically used the Trinity Gesture Dataset [ 13 ], comprising 244 minutes of audio and motion capture recordings of a male actor speaking freely on a variety of topics. W e remov ed lower-body data, retaining 15 upper-body joints out of the original 69. Fingers were not modelled due to poor data quality . T o obtain semantic information for the speech, we rst tran- scribed the audio recordings using Google Cloud automatic spe ech recognition (ASR), follo wed by thorough manual review to corr ect recognition errors and add punctuation for both the training and test parts of the dataset. The same data was used by the GENEA 2020 gesture generation challenge 1 and has been made publicly available in the original dataset repository 2 . 3.1 T est-Segment Selection T wo 10-minute recordings from the dataset were held out from train- ing. W e sele cted 50 segments of 10 s for testing: 30 random segments and 20 semantic segments , in which spe ech and recorded gestures were semantically linked. Three human annotators marked time in- stants wher e the recorded gesture was semantically linked with the speech content. Instances where all thr ee annotators agreed (within 5 s tolerance) were used as semantic segments in our experiments. 3.2 A udio- T ext Alignment T ext transcriptions and audio typically have dierent sequence lengths. T o over come this, we encode wor ds into frame-lev el fea- tures as illustrated in Figure 2 . First, the sentence, e xcluding ller words, is encoded by BERT [ 11 ], which is the state-of-the-art model in natural language processing (NLP). W e encode ller w ords and si- lence, which do not contain semantic information, as special, xed vectors 𝑉 𝑓 and 𝑉 𝑠 , respe ctively . Filler words typically indicate a thinking process and can occur with a variety of gestur es. There- fore, w e set the text feature v ector 𝑉 𝑓 during ller words equal to the av erage of the feature vectors for the most common ller w ords in the data. Silence typically has no gesticulation [ 17 ], so the silence 1 genea-workshop.github .io/2020/#gesture-generation-challenge 2 trinityspeechgesture.scss.tcd.ie Figure 2: Encoding text as frame-level features. First, the sentence (omitting ller words) is encoded by BERT [ 11 ]. W e thereafter re- peat each vector according to the duration of the corresponding word. Filler words and silence are encoded as xed vectors, here de- noted Vf and V s. T able 1: T ext and duration features for each frame. BERT encoding of the current word Time elapsed from the beginning of the word (in seconds) Time left until the end of the word (in seconds) Duration of this word (in seconds) Relative progress thr ough the word (in %) Speaking rate of this word (in syllables/second) feature vector 𝑉 𝑠 was made distinct from all other encodings, by setting all elements equal to − 15. Finally , we use timings from the ASR system to nonuniformly upsample the text features, such that both text and audio feature sequences have the same length and timings. This is a standard text-speech alignment method in the closely-related eld of speech synthesis [ 47 ]. 4 SPEECH-DRI VEN GEST URE GENERA TION This section describes our proposed metho d for generating upper- body motion from spee ch acoustics and semantics. 4.1 Feature T ypes W e base our featur es on the state of the art in speech audio and te xt processing. Throughout our experiments, we use frame-synchronized features with 20 fps. Like previous research in gesture generation [ 13 , 15 ], we rep- resent speech audio by log-power mel-spectrogram features. For this, we extracted 64-dimensional acoustic feature vectors using a window length of 0.1 s and hop length 0.05 s (giving 20 fps). For semantic features, we use BERT [ 11 ] pretrained on English Wikipedia: each sentence of the transcription is enco ded by BERT resulting in 768 features per word, aligned with the audio as de- scribed in Sec. 3.2 . W e supplement these by ve frame-wise scalar features, listed in T able 1 . T o extract motion features, the motion-capture data was do wn- sampled to 20 fps and the joint angles wer e converted to an expo- nential map representation [ 16 ] relative to a T -pose; this is common in computer animation. W e veried that the resulting features did 3 not contain any discontinuities. Thereafter , we reduced the dimen- sionality by applying PCA and keeping 92% of the variance of the training data, similar to [ 48 ]. This resulted in 12 components. 4.2 Model Architecture and Training W e b elieve that a simple model architecture is preferable to a more complex one, everything else being equal. Hence, the intent of this work was to develop a straightforward mo del that solves the studied task. Figure 3 illustrates our model architecture. First, the text and audio features of each frame are jointly encode d by a feed-for ward neural network to reduce dimensionality . T o provide more input context for predicting the current frame, we pass a sliding window spanning 0.5 s (10 frames) of past speech and 1 s (20 frames) of future speech features over the enco ded feature vectors. These time spans are grounded in research on gesture-speech alignment, as review ed in Se c. 2.1.2 . The enco dings inside the context window are concatenated into a long vector and passed through several fully-connected layers. The model is also autoregressive: w e feed preceding model predictions back to the model as can be seen in the gure, to ensure motion continuity . T o condition on the information from the previous poses, we use FiLM conditioning [ 38 ], which generalizes regular concatenation. FiLM applies element-wise ane transforms FiLM ( 𝒙 , 𝜶 , 𝜷 ) = 𝒙 ∗ 𝜶 + 𝜷 to network activations 𝒙 , where scaling 𝜶 and oset 𝜷 vectors are produced by a neural net taking other information (here previous poses) as input. The nal layer of the model and of the conditioning network for FiLM are linear to not restrict the attainable output range. 4.3 Training Procedure W e train our model on se quences of aligned spee ch audio, text, and gestures from the dataset. Each training sequence contains 70 consecutive frames from a larger recording. The rst 10 and the last 20 frames establish context for the sliding windo w , while the 40 central frames are used for training. The model is opti- mized end-to-end for 100 ep ochs using stochastic gradient descent (SGD) and Adam [ 24 ] to minimize the loss function loss ( 𝒈 , ˆ 𝒈 ) = MSE ( 𝒈 , ˆ 𝒈 ) + 𝜆 MSE ( Δ 𝒈 , Δ ˆ 𝒈 ) , here 𝒈 and Δ 𝒈 are the ground-truth position and velocity , ˆ 𝒈 and Δ ˆ 𝒈 are the same quantities for the model prediction and MSE stands for Mean Squared Error . The weight 𝜆 was set empirically to 0.6. Our velocity penalty can b e seen as an improvement on the p enalty used by Y oon et al. [ 48 ]. Instead of penalizing the absolute value of the velocity , we enforce velocity to be close to that of the ground truth. During development, we observed that information from pr evi- ous poses (the autoregression) tende d to overpower the information from the speech: our initial mo del moved independently of speech input and quickly converged to a static pose. This is a common failure mode in generative sequence models, cf. [ 8 , 20 ]. T o coun- teract this, we pretrain our model without autoregression for the rst seven epochs (a number chosen empirically), b efore letting the model receive autoregressive input. This pretraining helps the network learn to extract useful features fr om the speech input, an ability which is not lost during further training. A dditionally , while full training begins without any teacher forcing (meaning that the model r eceives its own previous predictions as autoregressive input instead of the ground-truth poses), this is annealed over time: after . . . T e x t f e a t u r e s A u d i o f e a t u r e s C o n c a t . . . . A u t o r e g r e s si o n L e g e n d : Fe e d f o r w a r d N N H i d d e n l a ye r Fi L M c o n d i t i o n i n g Figure 3: Our model architecture. T ext and audio features are en- coded for each frame and the enco dings concatenated. Then, several fully-connected layers are applied. The output p ose is fed back into the model in an autoregressive fashion. one epoch, the model receives the ground-truth poses instead of its own prediction (for two consecutive frames) every 16 frames, which incr eased to every eight frames after another epoch, to every four frames after the next ep och, and then to every single frame after that. Hence, after ve epochs of training with autoregression, our model has full teacher forcing: it always receives the ground- truth poses for autoregression. This pr ocedure greatly helps with learning a model that pr operly integrates non-autoregr essive input. 4.4 Hyper-Parameter Settings For the experiments in this paper , we used the hyper-parameter search tool T une [ 30 ]. W e performe d random search ov er 600 con- gurations with velocity loss as the only criterion, obtaining the following hyper-parameters: Speech-enco ding dimensionality 124 at each of 30 frames, producing 3720 elements after concatenation. The three subsequent layers had 612, 256, and 12 or 45 nodes (the output dimensionality with or without PCA). Three previous poses were encoded into a 512-dimensional conditioning vector . The acti- vation function was tanh , the batch size was 64 and the learning rate 10 -4 . For regularization, we applied dropout with probability 0.2 to each layer , except for the pose enco ding, which had dropout 0.8 to prevent the model from attending too much to past poses. 5 EV ALU A TION MEASURES In this section we describe the objective and subjective measures we used in our experiments (Secs. 6 and 7 ). 5.1 Objective Measures There is no consensus in the eld about which objective measures should be used to evaluate the quality of generate d gestures. As a step towards common e valuation measures for the gestur e gen- eration eld, we primarily use metrics propose d by previous re- searchers. Sp ecically , we evaluated the average values of root- mean-square error (RMSE), acceleration and jerk (rate of change of acceleration), and acceleration histograms of the produced motion, in line with Kucherenko et al. [ 28 ]. T o obtain these statistics, the gestures were conv erted from joint angles to 3D joint positions. The acceleration and jerk were averaged over all frames for all 14 3D joints (except for the hips, which wer e xed). T o investigate the motion statistics in more detail, we also computed velocity histograms of the generated motion and compared those against 4 histograms derived from the ground-truth test data. W e calculated the relative frequency of dierent velocity values over time-frames in all 50 test sequences, split into bins of width 1 cm/s. 5.2 Subjective Measures T o investigate human perception of the gestures we conducted sev- eral user studies that all followed the same protocol and procedure. 5.2.1 Experiment Design. W e assessed the p erceived human-likeness of the virtual character’s motion and how the motion related to the character’s speech using measures adapted from recent co-speech gesture generation papers [ 15 , 48 ]. Specically , we asked the ques- tions “In which video... ”: (Q1) “ ...are the character’s movements most human-like?” (Q2) “...do the character’s movements most reect what the character says?” (Q3) “ ...do the character’s movements most help to understand what the character says?” (Q4) “ ...are the character’s voice and movement mor e in sync?” W e use d attention checks to lter out inattentive participants. For four of the six attention checks, we picked a random video in the pair and heavily distorted either the audio (in the 2nd and 17th video pairs) or the video quality (in the 7th and 21st vide o pairs). Raters wer e asked to report any video pairs where they experienced audio or video issues, and were automatically excluded from the study upon failing any two of these four attention checks. In addition, the 13th and 24th video pairs presented the same video (from the random po ol) twice. Here an attentive rater should answer “no dierence ” . 5.2.2 Experimental Procedure. Participants were recruited on Ama- zon Mechanical T urk (AMT) and assigned to one specic compari- son of two systems; they could complete the study only once, and were thus only exposed to one system pair. Each participant was asked to evaluate 26 same-speech video pairs on the four subje ctive measures: 10 pairs randomly sampled from a pool of 28 random segments, 10 from a pool of 20 semantic segments, and 6 attention checks (see above). These video pairs were then randomly shued. Every participant rst completed a training phase to familiarize themselves with the task and interface . This training consisted of ve items not included in the analysis, with video segments not present in the study , showing gestures of dierent quality . Then, during the experiment, the videos in each pair were presented side by side in random order and could be replayed as many times as desired. For each pair , participants indicated which video they thought best corresponded to a given question (one of Q1 through Q4 above), or that they perceived both vide os to be equal in regard to the question. 6 ABLA TION ST UD Y In this section, we evaluate the imp ortance of various mo del compo- nents by individually ablating them, training seven dierent system variants including the full model (see T able 2 ). Comparisons against other gesture-generation approaches are r eported in Sec. 7 . 6.1 Objective Evaluation In this section we report objective metrics, as described in Se c. 5.1 . 6.1.1 A verage Motion Statistics. T able 3 illustrates acceleration and jerk, as w ell as RMSE, averaged o ver 50 test samples for the ground T able 2: The seven system variants in the ablation study System Description Full model The proposed method No PCA No PCA is applied to output poses No A udio Only text is used as input No T ext Only audio is used as input No FiLM Concatenation instead of FiLM No V elocity loss The velocity loss is removed No A utoregression The previous poses are not used T able 3: Objective evaluation of our systems: mean and standard de- viation over 50 samples. System Accel. ( cm/s 2 ) Jerk (cm/s 3 ) RMSE (cm) Full model 37.6 ± 4.3 830 ± 89 11.4 ± 11.8 No PCA 63.8 ± 8.3 1332 ± 192 13.0 ± 14.7 No A udio 26.9 ± 3.9 480 ± 67 11.3 ± 11.7 No T ext 27.0 ± 1.9 715 ± 63 10.9 ± 11.3 No FiLM 44.2 ± 6.6 931 ± 181 11.0 ± 11.5 No V elocity loss 36.4 ± 4.1 779 ± 93 11.4 ± 12.3 No A utoregression 120.3 ± 19.2 3890 ± 637 11.2 ± 12.0 Ground truth 144.7 ± 36.6 2322 ± 538 0 truth and the dierent ablations of the proposed method. Ground- truth statistics ar e given as reference values for natural motion. W e focus our analysis on the jerk , since it is commonly used to evaluate the smoothness of the motion: the lower the jerk the smo other the motion is [ 36 , 46 ]. W e can obser ve that the proposed model exhibits lower jerk than the original motion. This is probably because our model is deterministic and hence produces gestures closer to the mean pose. Not using PCA results in higher acceleration and jerk, and made the model statistics closer to the ground truth. Our intuition for this is that PCA reduced variability in the data, which resulted in over- smoothed motion. Removing either audio or text input reduced the jerk even further. This is probably be cause these ablations provide a weaker input signal to drive the model, making it gesticulate closer to the mean pose. Both FiLM conditioning and the velocity penalty seem to have little eect on the motion statistics and are likely not central to the model. That autoregression is a key aspe ct of our system is clear from this evaluation: without autoregression, the model loses continuity and generates motion with e xcessive jerk. RMSE appears to not be informative. This is expecte d since there are many plausible ways to gesticulate, so the minimum-expected-loss output gestures do not have to be close to our ground truth. 6.1.2 Motion V elocity Histograms. The values in T able 3 were av- eraged over all time-frames and ov er all joints. T o investigate the motion statistics in more detail, we computed velocity histograms of the generated motion and compared those against histograms derived from the ground-truth test data. As previous work has shown that wrist histograms are more informative than histograms averaged over all joints [ 28 ], we consider only left and right wrist joints. 5 0 5 10 15 20 25 30 35 40 45 50 V e l o c i t y ( c m / s ) 0.00 0.01 0.02 0.03 0.04 0.05 Frequency (%) Ground Truth Proposed Model No Autoregression No FiLM No Velocity Loss (a) Comparing dierent architectures. 0 5 10 15 20 25 30 35 40 45 50 V e l o c i t y ( c m / s ) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Frequency (%) Ground Truth Proposed Model No PCA No Text No Speech (b) Comparing dierent input/output data. Figure 4: V elocity histograms of the wrist joints for the ablation study . Figure 4 a illustrates the velocity histogram of the wrist joints for the dierent model architectures and loss functions we consid- ered. W e observe two things: 1) the distributions are not inuenced strongly by either FiLM conditioning or by velocity loss; and 2) autoregression reduces the amount of fast moments, making the velocity histogram more similar to the ground truth. V elocity histograms for dierent input/output data are shown in Figure 4 b . Removing PCA increases velocity , making the distri- bution more similar to the gr ound truth. In other words, training our mo del in the PCA space leads to reduced variability , which makes sense. W e observe that excluding the text input makes the velocity smaller . This agrees with T able 3 and probably means that without semantic information the model produces mainly beat ges- tures, whose characteristics dier from other gesture types. While these numerical evaluations are valuable, they say very little about people’s perceptions of the generated gestures. 6.2 First Perceptual Study T o investigate human p erception of the gestures we conducted several user studies. This section reports on Perceptual Study 1, in which w e e valuated participants’ p erception of a virtual character’s gestures as produced by the seven variants of our model describ ed in T able 2 . The experimental procedure and evaluation measur es (see Sec. 5.2 ) were identical across all perceptual studies, including this one. Video samples fr om all systems in this study can be found at vimeo.com/showcase/6737868. In the comparison of system ablations (Perceptual Study 1), 123 participants ( 𝜇 age = 41 . 8 ± 12 . 3 ; 52 male, 70 female, 1 other) re- mained after exclusion of 477 participants who failed the attention checks, experienced technical issues, or stopped the study prema- turely . The majority were from the USA ( 𝑁 = 120). Each sub-study had between 19 and 21 participants. W e conducted a binomial test excluding ties with Holm-Bonferroni correction of 𝑝 -values to analyze the responses. (24 responses that participants agged for te chnical issues were excluded.) Our analysis was done in a double-blind fashion such that the conditions were obfuscated dur- ing analysis and only revealed to the authors after the statistical tests had been performe d. The results are sho wn in Figure 5 . W e can se e from the evaluation of the “No T ext” system that re- moving the semantic input drastically decreases both the perceiv ed human-likeness of the produced gestures and how much the y are linked to speech: participants preferred the full model over the one without text across all four questions asked with 𝑝 <.0001. This conrms that semantics are important for appropriate automatic gesture generation. The “No A udio ” model is unlikely to generate beats, and might not follow an appropriate speech rhythm . Results in Figure 5 con- rm this: participants preferred the full model over the one without audio across all four questions asked ( 𝑝 <.0001). Removing autoregression from the mo del only aected p erceived naturalness, where it performed signicantly worse ( 𝑝 <.0001), as shown in Figure 5 . This aligns with the ndings from the objec- tive evaluation: without autoregr ession the model produces jerky , unnatural-looking gestures, but the jerkiness does not inuence whether gestures are semantically linked to the speech content. There was no statistical dierence between the Full model and the model without FiLM conditioning in terms of Q1 and Q2, but the model without FiLM was preferred with 𝑝 <.02 for Q3 and 𝑝 <.04 for Q4. This suggests that FiLM conditioning was not helpful for the model and regular concatenation worked better . Removing the velocity penalty did not have a statistical dier- ence on user responses, except for reducing user preference on Q4 with 𝑝 <.04, suggesting that this component is not critical for the model. The model without PCA gave unexpected results. In videos, we see that removing PCA impro ved gesture variability . While for human-likeness, there was no statistical dierence, “No PCA ” was signicantly better ( 𝑝 <.0001) on Q2, Q3 and Q4 (see Figure 5 ). In summary , participants preferred the system without PCA, so it was chosen as our nal mo del for the remaining comparisons (in Sec. 7 ). 6.3 Relation b etween objective and subjective evaluations Objective and subjective evaluations each have their pros and cons. In this subsection we analyze the empirical correlation between the two for the experiments reported here. From, the “No FiLM” and “No V elo city Loss” conditions, we see that user ratings did not change much for ablations that only produced minor changes in motion statistics. This is not surprising. For models with low jerk compared to the ground truth, we can see that participants preferred the models where the jerk was closer to that of the ground truth motion (“No PCA ”). Howev er , too high jerk 6 No PCA No Audio No Text No FiLM No Velocity loss No Autoregression −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Preference towards Full model Q1: In which video are the character’s movements most human-like? No PCA No Audio No Text No FiLM No Velocity loss No Autoregression −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Preference towards Full model Q2: In which video do the character’s movements most reflect what the character says? No PCA No Audio No Text No FiLM No Velocity loss No Autoregression −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Preference towards Full model Q3: In which video do the character’s movements most help to understand what the character says? No PCA No Audio No Text No FiLM No Velocity loss No Autoregression −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Preference towards Full model Q4: In which video are the character’s voice and movement more in sync? Figure 5: Results of Perceptual Study 1: comparing dierent ablations of our mo del in pairwise preference tests. Four questions, listed above each bar chart, were asked about each pair of videos. The bars show the preference towards the full model (higher values mean stronger preference) with 95% condence intervals. was associate d with unnatural motion (“No A utoregression”). These results se em to indicate that jerk analysis provides information about the human-likeness of the motion. 7 ADDI TIONAL EV ALU A TIONS AND COMP ARISONS The primary goal of this work is to develop the rst model for continuous gesture generation that takes into account both the se- mantics and the acoustics of the speech. That said, we also bench- mark our model against the state of the art in gesture generation. W e compare the proposed approach to the model by Ginosar et al. [ 15 ], which is based on CNNs (conv olutional neural networks) and GANs ( generative adversarial networks), and ther efore denoted CNN-GAN. The hyper-parameters for the baseline method were ne-tuned by changing one parameter at a time and manually inspecting the visual quality of the resulting gestures on the validation dataset. The nal hyper-parameters for the CNN-GAN [ 15 ] model were: batch size = 256, number of neurons in the hidden layer = 256, learning rate = 0.001, training duration = 300 epochs and 𝜆 coecient for the discriminator loss = 5. This tuned system was compared against the best system (“No PCA ”) identied in Sec. 6 . T able 4: Objective comparison of our systems with the state-of-the- art: mean and standard deviation over 50 samples System Accel. ( cm/s 2 ) Jerk (cm/s 3 ) Final model (no PCA) 63.8 ± 8.3 1330 ± 192 CNN-GAN [ 15 ] 254.7 ± 31.8 5280 ± 631 Ground truth 144.2 ± 35.9 2315 ± 530 7.1 Comparing with the state-of-the-art Like the previous experiments, we follo w the objective evaluation setup described in Sec. 5.1 . T able 4 displays the average accelera- tion and jerk over 50 test sequences. W e observe that the propose d method has acceleration and jerk values roughly half of those e xhib- ited by the ground truth, while the CNN-GAN [ 15 ] baseline instead has twice the acceleration and jerk of the ground truth. T o investigate which model is preferred by human obser vers, we conducted another user study . W e evaluated participants’ pref- erence between the gestures as produced by the proposed models (No PCA) and CNN-GAN [ 15 ] (Perceptual Study 2). Video samples from this study can be found at vimeo.com/showcase/7127462 . 7 CNN-GAN −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Preference towards proposed model Q1: In which video are the character’s movements most human-like? CNN-GAN −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Preference towards proposed model Q2: In which video do the character’s movements most reflect what the character says? Figure 6: Results of Perceptual Study 2: comparing with the state-of- the-art in pairwise preference tests. The bars show the preference towards the full model with 95% condence intervals. The study setup was the same described in Sec. 5.2 , except for three minor changes: (1) W e paid online participants more (5$ instead of 3$), since we realized that the eort required from participants in the previous study was higher than we had anticipated. (2) W e claried the instructions for reporting broken audio/video. (3) W e only asked questions Q1 (human-likeness of movements) and Q2 (movements reect what the character says). In this study 27 participants ( 𝜇 age = 41 . 7 ± 11 . 3 ; 14 male, 13 female) remained after exclusion of 43 participants based on the same criteria as before. The majority were fr om the USA ( 𝑁 = 25). Like for Perceptual Study 1, we analyzed the responses using bi- nomial tests excluding ties followed by Holm-Bonferroni correction. The results are shown in Figure 6 . Our model was preferred ov er the CNN-GAN baseline for Q1 with 𝑝 <.0001 and for Q2 with 𝑝 <.02, indicating that the gestures generated by our model w ere perceived as more human-like and better reected what the character said. 7.2 Comparison with the Ground Truth W e also compared our model to the ground-truth gestures using the same procedure as before. In this study (Perceptual Study 3), 20 participants ( 𝜇 age = 39 . 1 ± 8 . 4 ; 9 male, 11 female) remained after excluding 31 participants thr ough the same criteria as in Perceptual Study 1. 𝑁 = 18 were fr om the USA. There was a v ery substantial preference for the ground-truth motion (between 84 and 93%) across all questions. All dierences w ere statistically signicant according to Holm-Bonferroni-corrected binomial tests ignoring ties. 7.3 What Do “Semantic” Gestures Even Mean? Finally , we evaluated if using text input helps our model to produce more semantically-linked gestures, such as iconic, metaphoric and diectic. T o this end, we compared our best model (No PCA) with and without text information in the input: the rst variant of this model received both audio and text, while the second one received only audio as input. W e asked three annotators to select which segments out of 50 test segments for b oth conditions that were semantically linked with the speech content. The annotators were all male and had an average age of 25.3 years. They were not aware of our research questions. The results of this annotation were inter esting and surprising: while all of them marked more gestures to be semantically linked with the speech content for the model that use d text than the model without text (2 vs 0, 21 vs 9 and 9 vs 4), they had very low agreement: Cr onbach’s alpha was below 0.5. The low agr eement on which segment were semantic indicates that it is very subje ctive which gestures should be classied as semantically linked, which makes this and any similar evaluation challenging. 8 CONCLUSIONS AND F U T URE W ORK W e have presented a new machine learning-based model for co- speech gesture generation. T o the best of our knowledge, this is the rst data-driven model capable of generating continuous gestures linked to both the audio and the semantics of the spe ech. W e evaluated dierent architecture choices and compared our model to an audio-based state-of-the-art baseline using both obje c- tive and subjective measures. All the study materials are publicly available at gshare.com/projects/Gesticulator/87128 . Our ndings indicate that: (1) Using both modalities of the spee ch – audio and text – can improve continuous gestur e-generation models. (2) A utoregressiv e connections, while not commonplace in con- temporary gesture-generation models, can enforce continu- ity of the gestures, without vanishing-gradient issues and with few parameters to learn. W e also described a training scheme that prevents autoregressiv e information from over- powering other inputs. (3) PCA applie d to the motion space (as used in [ 48 ]) can restrict the model by removing perceptually-important variation from the data, which may reduce the range of gestures. (4) The gestures from our model were preferr ed over the CNN- GAN [ 15 ] baseline by the study participants. The main limitation of our work is that it requires an anno- tated dataset ( with text transcriptions), which is labor-intensive. T o overcome this, one could consider training the model dir ectly on transcriptions from A utomatic Sp eech Recognition. Additionally , the vocabulary used in this dataset is sub-optimal. As we can see in the frequency table provided at tinyurl.com/y22h6rtt , out of the 50k total words there are 4230 unique words and the rst 8 words account for 30% of all words spoken. This makes it challenging to learn semantic relations between gestures and text. Future work also involves making the model stochastic (as in [ 2 ]), using larger datasets (such as [ 29 ]) and further improving the semantic coherence of the gestures, for instance by treating dierent gesture types separately . A CKNO WLEDGEMEN T The authors would like to thank Andre Pereira, Federico Baldassarre and Marcus Klasson for helpful discussions. This work was par- tially supported by the Swedish Foundation for Strategic Research Grant No.: RI T15-0107 (EA Care), by the Swedish Research Council projects 2017-05189 (Cro wdVR) and 2018-05409 (StyleBot) and by the W allenberg AI, A utonomous Systems and Software Program (W ASP) funde d by the Knut and Alice W allenb erg Foundation. 8 REFERENCES [1] Chaitanya Ahuja, Shugao Ma, Louis-Philippe Morency , and Y aser Sheikh. 2019. T o react or not to react: End-to-end visual pose forecasting for personalized avatar during dyadic conversations. In Pr oceedings of the International Conference on Multimodal Interaction . 74–84. [2] Simon Alexanderson, Gustav Eje Henter , Taras Kucher enko, and Jonas Beskow . 2020. Style-controllable spe ech-driven gesture synthesis using normalising ows. Computer Graphics Forum 39, 2 (2020), 487–496. [3] Okan Arikan and David A. Forsyth. 2002. Interactive motion generation from examples. ACM Transactions on Graphics 21, 3 (2002), 483–490. [4] Kirsten Bergmann, V olkan Aksu, and Stefan Kopp. 2011. The relation of spee ch and gestures: T emporal synchrony follows semantic synchr ony . In Proceedings of the 2nd Workshop on Gesture and Spe ech in Interaction (GeSpIn 2011) . [5] Timothy W . Bickmore. 2004. Unspoken rules of spoken interaction. Commun. ACM 47, 4 (2004), 38–44. [6] Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore . 2001. BEA T: The behavior expression animation toolkit. In Proceedings of the Conference on Computer Graphics and Interactive T echniques . A CM. [7] Gabriel Castillo and Michael Ne. 2019. What do we express without know- ing?: Emotion in Gesture. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems . International Foundation for Autonomous A gents and Multiagent Systems, 702–710. [8] Xi Chen, Diederik P. Kingma, Tim Salimans, Y an Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever , and Pieter Abbeel. 2017. Variational lossy autoencoder. In Proceedings of the International Conference on Learning Representations . [9] Chung-Cheng Chiu, Louis-P hilippe Morency , and Stacy Marsella. 2015. Predicting co-verbal gestures: A deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents . Springer . [10] Mingyuan Chu and Sotar o Kita. 2016. Co-thought and co-speech gestures are gen- erated by the same action generation process. Journal of Experimental Psychology: Learning, Memory, and Cognition 42, 2 (2016), 257. [11] Jacob Devlin, Ming- W ei Chang, Kenton Lee, and Kristina T outanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Pro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (2018). [12] Gaëlle Ferré. 2010. Timing relationships between speech and co-verbal gestures in sp ontaneous French. In Language Resources and Evaluation, W orkshop on Multimodal Corpora , V ol. 6. 86–91. [13] Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Procee dings of the International Conference on Intelligent Virtual Agents . ACM. [14] Ylva Ferstl, Michael Ne, and Rachel McDonnell. 2020. Adversarial gesture generation with realistic gesture phasing. Computers & Graphics (2020). [15] Shiry Ginosar , Amir Bar, Gefen Kohavi, Caroline Chan, Andrew O wens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In Proceedings of the International Conference on Computer Vision and Pattern Recognition . IEEE. [16] F. Sebastian Grassia. 1998. Practical parameterization of rotations using the exponential map. Journal of Graphics Tools 3, 3 (1998), 29–48. [17] Maria Graziano and Marianne Gullberg. 2018. When sp eech stops, gesture stops: Evidence from developmental and crosslinguistic comparisons. Frontiers in Psychology (2018). [18] Maria Graziano, Elena Nicoladis, and Paula Mar entette. 2019. How referential gestures align with speech: Evidence from monolingual and bilingual speakers. Language Learning 70, 1 (2019), 266–304. [19] Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of sp eech-to-gesture generation using bi-directional LSTM network. In Proceedings of the International Conference on Intelligent Virtual Agents . ACM. [20] Gustav Eje Henter , Simon Alexanderson, and Jonas Beskow . 2020. MoGlow: Probabilistic and controllable motion synthesis using normalising ows. ACM Transactions on Graphics 39, 4 (2020), 236:1–236:14. https://doi.org/10.1145/ 3414685.3417836 [21] Chien-Ming Huang and Bilge Mutlu. 2012. Robot behavior toolkit: Generating eective social behaviors for robots. In Procee dings of the International Conference on Human Rob ot Interaction (HRI ’12) . ACM/IEEE. [22] Carlos T . Ishi, Daichi Machiyashiki, Ryusuke Mikata, and Hiroshi Ishiguro. 2018. A speech-driven hand gesture generation method and evaluation in android robots. IEEE Robotics and Automation Letters (2018). [23] Jana M. Iverson and Esther Thelen. 1999. Hand, mouth and brain. The dynamic emergence of speech and gesture. Journal of Consciousness Studies 6, 11-12 (1999), 19–40. [24] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimiza- tion. In Proceedings of the International Conference on Learning Representations . [25] Stefan Kopp, Hannes Rieser, Ipke W achsmuth, Kirsten Bergmann, and Andy Lücking. 2007. Speech-gesture alignment. In Proceedings of the Conference of the International Society for Gesture Studies . [26] Lucas Kovar , Michael Gleicher , and Frédéric Pighin. 2002. Motion graphs. ACM Transactions on Graphics 21, 3 (2002), 473–482. [27] T aras Kucherenko. 2018. Data driven non-verbal behavior generation for hu- manoid robots. In ACM International Conference on Multimodal Interaction, Doc- toral Consortium (Boulder , CO, USA) (ICMI ’18) . A CM, 520–523. [28] T aras Kucherenko, Dai Hasegawa, Gustav E. Henter , Naoshi K aneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In Proceedings of the International Conference on Intelligent Virtual Agents . ACM. [29] Gilwoo Lee, Zhiwei Deng, Shugao Ma, T akaaki Shiratori, Siddhartha S Srinivasa, and Y aser Sheikh. 2019. T alking With Hands 16.2 M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis.. In ICCV . 763–772. [30] Richard Liaw , Eric Liang, Robert Nishihara, P hilipp Moritz, Joseph E. Gonzalez, and Ion Stoica. 2018. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118 (2018). [31] Daniel P. Loehr . 2012. T emporal, structural, and pragmatic synchrony between intonation and gesture. Laborator y Phonology 3, 1 (2012), 71–89. [32] Samuel Mascarenhas, Manuel Guimarães, Rui Prada, João Dias, Pedro A Santos, Kam Star , Ben Hirsh, Ellis Spice, and Rob Kommeren. 2018. A virtual agent toolkit for serious games developers. In Proceedings of the Conference on Computational Intelligence and Games (CIG) . IEEE, 1–7. [33] David McNeill. 1992. Hand and Mind: What Gestures Reveal ab out Thought . University of Chicago Press. [34] George A. Miller . 1995. WordNet: a lexical database for English. Commun. ACM (1995). [35] Shannon Monahan, Emmanuel Johnson, Gale Lucas, James Finch, and Jonathan Gratch. 2018. Autonomous agent that provides automated feedback improves negotiation skills. In Proceedings of the International Conference on A rticial Intelligence in Education . Springer, 225–229. [36] Pietro Morasso. 1981. Spatial control of arm movements. Experimental brain research 42, 2 (1981), 223–227. [37] Michael Ne, Michael Kipp, Irene Albr echt, and Hans-Peter Seidel. 2008. Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Transactions on Graphics (2008). [38] Ethan Perez, Florian Strub, Harm De V ries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Articial Intelligence . [39] Wim Pouw and James A. Dixon. 2019. Quantifying gesture-speech synchrony . In Proceedings of the Gesture and Speech in Interaction W orkshop . [40] Wim Pouw, Steven J. Harrison, and James A. Dixon. 2019. Gesture–speech physics: The biomechanical basis for the emergence of gesture–speech synchrony . Journal of Experimental Psychology: General (2019). [41] Lazlo Ring, Timothy Bickmore, and Paola Pedrelli. 2016. Real-time tailoring of depression counseling by conversational agent. Iprocee dings 2, 1 (2016), e27. [42] Najmeh Sadoughi and Carlos Busso. 2019. Sp eech-driven animation with mean- ingful behaviors. Speech Communication (2019). [43] Maha Salem, Katharina Rohlng, Stefan Kopp, and Frank Joublin. 2011. A friendly gesture: Investigating the eect of multimodal robot behavior in human-robot interaction. In Proceedings of the International Symposium on Robot and Human Interactive Communication . IEEE. [44] Giampiero Salvi, Jonas Beskow , Samer Al Moubayed, and Björn Granström. 2009. SynFace: Sp eech-driven facial animation for virtual sp eech-reading support. Journal on Audio, Speech, and Music Processing (2009). [45] William R Swartout, Jonathan Gratch, Randall W Hill Jr, Eduard Hovy, Stacy Marsella, Je Rickel, and David Traum. 2006. T oward virtual humans. AI Magazine 27, 2 (2006), 96–96. [46] Y oji Uno, Mitsuo Kawato, and Rika Suzuki. 1989. Formation and control of optimal trajectory in human multijoint arm movement. Biological cybernetics 61, 2 (1989), 89–101. [47] Zhizheng Wu, Oliver W atts, and Simon King. 2016. Merlin: An open sour ce neural network speech synthesis system. In Procee dings of the ISCA Speech Synthesis W orkshop . [48] Y oungwoo Y oon, W oo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In Pr oceedings of the International Conference on Robotics and Automation . IEEE. 9

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment