Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction

Facial Expression Generation Aligned with Human Preference for Natural Dyadic Interaction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Achieving natural dyadic interaction requires generating facial expressions that are emotionally appropriate and socially aligned with human preference. Human feedback offers a compelling mechanism to guide such alignment, yet how to effectively incorporate this feedback into facial expression generation remains underexplored. In this paper, we propose a facial expression generation method aligned with human preference by leveraging human feedback to produce contextually and emotionally appropriate expressions for natural dyadic interaction. A key to our method is framing the generation of identity-independent facial expressions as an action learning process, allowing human feedback to assess their validity free from visual or identity bias. We establish a closed feedback loop in which listener expressions dynamically respond to evolving conversational cues of the speaker. Concretely, we train a vision-language-action model via supervised fine-tuning to map the speaker’s multimodal signals into controllable low-dimensional expression representations of a 3D morphable model. We further introduce a human-feedback reinforcement learning strategy that integrates the imitation of high-quality expression response with critic-guided optimization. Experiments on two benchmarks demonstrate that our method effectively aligns facial expressions with human preference and achieves superior performance.


💡 Research Summary

The paper tackles the problem of generating listener facial expressions that are both emotionally appropriate and socially aligned with human preferences during dyadic interaction. While recent works have leveraged deep generative models to produce facial motions conditioned on speaker multimodal cues, they typically entangle expression with identity, making it difficult to obtain unbiased human feedback that truly reflects expression quality. To address this, the authors reformulate facial expression generation as an action‑learning problem in an identity‑independent space. Using the FLAME 3D morphable model, they fix the identity parameters and treat only the expression coefficients (a_exp) and head‑pose parameters (a_pose) as the action vector A_t. This isolates the learning target to pure facial dynamics, allowing human raters to evaluate expressions without visual or identity bias.

The core architecture is a Vision‑Language‑Action (VLA) model built on a 7‑billion‑parameter LLaMA‑2 language model. Multimodal inputs consist of a sequence of speaker video frames and corresponding textual transcripts. Each frame is processed by two pretrained visual encoders: DINO captures fine‑grained facial motion, while SigLIP extracts high‑level affective semantics. The concatenated visual features are projected into image tokens via an MLP, and the text is tokenized with the LLaMA tokenizer. Image and text tokens are concatenated and fed to the LLM, which predicts a sequence of discrete action tokens. Continuous actions are recovered by a quantization‑de‑tokenizer that maps each dimension into 256 uniform bins, a strategy inspired by Rt‑2 that balances granularity and representational capacity.

Training proceeds in two stages. First, Supervised Fine‑Tuning (SFT) uses paired data (speaker video, transcript, ground‑truth listener actions) to teach the VLA model to imitate real listener behavior. The loss combines cross‑entropy terms for expression and pose, weighted by λ_exp and λ_pose, and a temporal coherence regularizer that penalizes abrupt changes across consecutive frames. This stage equips the model with basic lip‑synchronization and smooth motion capabilities.

Second, a Human‑Feedback Reinforcement Learning loop introduces preference alignment. The SFT‑trained policy samples N candidate action sequences for a given dialogue segment; each candidate is rendered into a visual listener response using FLAME. Human annotators rank these responses, designating the highest‑scored sample as “preferred” (Pre) and the lowest‑scored as “dispreferred” (Dispre). The authors then apply Direct Preference Optimization (DPO), which directly maximizes the likelihood ratio between preferred and dispreferred samples without training a separate reward model. DPO updates the policy parameters to increase the probability of generating actions that humans favor, effectively aligning the model with human social norms and emotional expectations.

Experiments on two public dyadic interaction benchmarks—L2L‑trevor and RealTalk—demonstrate that the proposed method outperforms state‑of‑the‑art baselines (e.g., FaceFormer, DualTalk, Avatar Forcing) on both quantitative metrics (expression appropriateness, social consistency) and qualitative user studies. Notably, in scenarios where the speaker expresses disgust, the model reliably generates a disgusted listener expression, whereas baseline systems often produce mismatched emotions such as happiness, leading to conversational dissonance.

The paper’s contributions are threefold: (1) introducing the first closed‑loop human‑feedback framework for facial expression generation that explicitly aligns outputs with human preference; (2) formulating expression generation as an action‑learning task in an identity‑independent space, enabling unbiased feedback collection; and (3) integrating a multimodal large language model to jointly process visual and textual cues and produce controllable low‑dimensional action tokens. This approach opens avenues for more natural and socially aware virtual avatars, remote collaboration tools, and embodied conversational agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment