UGotMe: An Embodied System for Affective Human-Robot Interaction

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/.

💡 Research Summary

The paper presents UGotMe, an embodied affective human‑robot interaction (HRI) system that enables a humanoid robot to recognize human emotions in real time and respond with appropriate facial expressions. The authors identify two major embodiment challenges that hinder the deployment of existing vision‑aware multimodal emotion recognition models in real‑world HRI: (1) environmental visual noise in multiparty conversations, caused by distracting objects and inactive speakers appearing in the robot’s field of view, and (2) the difficulty of meeting real‑time response requirements due to large model inference latency and communication overhead between robot and processing server.

To address the first challenge, the system employs a two‑stage denoising pipeline. First, raw RGB frames captured by the robot’s left‑eye camera are processed with MTCNN to detect faces, and the detected face regions are cropped using the OpenFace toolkit. Second, a customized active‑face extraction strategy aligns the robot’s head pose and camera orientation with the direction of sound arrival, ensuring that the face centered on the image’s x‑axis corresponds to the active speaker. Among the detected faces, only this active face is retained for emotion analysis. The authors further apply person‑specific neutral normalization: for each utterance they subtract a neutral face (either a manually selected neutral frame for each role in the MELD dataset or the first frame of the sequence in deployment) to obtain delta images that reduce inter‑person variability.

For the second challenge, the authors implement an efficient data transmission scheme. Images are streamed as byte arrays from the robot to a local edge server using a separate thread and buffered for the most recent T = 640 frames (≈25 FPS). Textual data (speech‑to‑text transcriptions) are sent only when a conversational turn is generated. Communication relies on ZeroMQ over TCP, providing low‑latency, continuous delivery of visual data while keeping network load manageable.

The core emotion recognition model, Vision‑Language to Emotion (VL2E), is designed to work seamlessly with the denoising pipeline. The visual encoder is an InceptionResnet‑v1 network pretrained on CASIA‑WebFace, which processes the extracted face sequences frame‑by‑frame. Frame‑level visual features are fed into a self‑attention transformer to capture intra‑modal temporal dynamics. For textual modality, the authors use SimCSE as a sentence encoder and enrich the input with a prompt‑based context modeling scheme: the most recent k utterances are concatenated, and a prompt “for <u_t>, speaker feels .” is appended, allowing the model to attend to the current turn while preserving conversational context. The resulting visual and textual embeddings are fused via a cross‑modal transformer, and a final classifier predicts one of seven emotion categories (neutral, surprise, fear, sadness, joy, disgust, anger).

Evaluation on the MELD dataset—a multiparty conversational emotion recognition benchmark derived from the TV series Friends—shows that VL2E achieves the highest weighted‑average F1 score among all baselines, demonstrating robustness to noisy visual inputs and the benefit of context‑aware textual prompting.

To validate the system in a real‑world setting, the authors deploy UGotMe on Ameca, a commercially available humanoid robot from Engineered Arts. The robot’s onboard camera and microphone feed visual and audio streams to the edge server; Google Cloud Speech‑to‑Text converts audio to text; VL2E predicts the interlocutor’s emotion; the predicted emotion is mapped to one of the robot’s seven predefined facial expressions (neutral, surprise, fear, sadness, joy, disgust, anger); and the robot animates the corresponding facial muscles. Language responses are generated by a GPT‑style model and synthesized with Amazon Polly. All training and inference are performed on a single NVIDIA H100 GPU, using AdamW optimization and a cosine‑warmup learning‑rate schedule.

Experiments involve scripted multiparty dialogues with multiple volunteers, where each participant converses with the robot one at a time. The robot successfully recognizes the speaker’s emotional state despite the presence of other people and background objects, and it reacts with appropriate facial expressions in near‑real‑time, confirming the efficacy of both the denoising strategies and the low‑latency data pipeline.

The paper’s contributions are threefold: (1) the design of an embodied HRI system that simultaneously mitigates environmental visual noise and satisfies real‑time constraints, (2) the introduction of the VL2E multimodal emotion recognition model, which outperforms existing methods on a challenging multiparty dataset, and (3) a practical demonstration of the system on a physical humanoid robot, showcasing its applicability in real‑world affective interactions. The authors suggest future work on automatic neutral‑face estimation, handling multiple simultaneous active speakers, and extending robot expressive capabilities beyond facial expressions to include body gestures and prosodic modulation.

UGotMe: An Embodied System for Affective Human-Robot Interaction

💡 Research Summary

Comments & Academic Discussion

Leave a Comment