Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems

February 19, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems
ArXiv ID: 2512.10975
Date: 2025-12-02
Authors: Matvey Nepomnyaschiy, Oleg Pereziabov, Anvar Tliamov, Stanislav Mikhailov, Ilya Afanasyev

📝 Abstract

Effective human-agent interaction (HAI) relies on accurate and adaptive perception of human emotional states. While multimodal deep learning models - leveraging facial expressions, speech, and textual cues - offer high accuracy in emotion recognition, their training and maintenance are often computationally intensive and inflexible to modality changes. In this work, we propose a novel multi-agent framework for training multimodal emotion recognition systems, where each modality encoder and the fusion classifier operate as autonomous agents coordinated by a central supervisor. This architecture enables modular integration of new modalities (e.g., audio features via emotion2vec), seamless replacement of outdated components, and reduced computational overhead during training. We demonstrate the feasibility of our approach through a proof-of-concept implementation supporting vision, audio, and text modalities, with the classifier serving as a shared decision-making agent. Our framework not only improves training efficiency but also contributes to the design of more flexible, scalable, and maintainable perception modules for embodied and virtual agents in HAI scenarios.

💡 Deep Analysis

📄 Full Content

Human-agent interaction (HAI) is becoming increasingly important as autonomous agents need to understand and respond to human emotions to work effectively with people [1]. To achieve this, agents must be able to recognize emotions from multiple sources such as facial expressions, speech, and text [2]. This multimodal emotion recognition capability is essential for creating socially intelligent agents that can adapt their behavior based on human emotional states.

Current approaches to multimodal emotion recognition typically use large neural networks that process all input types together [3]. While these methods work well in controlled environments, they face several problems in real-world ap-plications. These monolithic systems are computationally expensive to train, difficult to modify when new input types are added, and challenging to maintain because changing one component affects the entire system [4]. Multi-agent systems offer a promising alternative approach to solving complex problems by dividing tasks among specialized agents [5]. For emotion recognition, this means that each input type (like facial expressions or speech) can be processed by a dedicated agent with specific expertise [6]. These agents can then coordinate their results, making the system more modular and easier to update or improve individual components.

Recent developments in machine learning have produced powerful models for emotion recognition, such as emo-tion2vec [7], which works well across different languages and audio conditions. Similarly, specialized models for face detection [8] 1 and text emotion analysis 2 have achieved excellent performance in their specific areas. However, combining these different models into a single emotion recognition system remains challenging, especially considering the computational costs and design limitations of traditional approaches.

In this work, we propose a multi-agent framework for multimodal emotion recognition that addresses these limitations using a supervisor-based architecture. Our approach processes video input through specialized agents for each primary modality (facial expressions, speech, and text), with an additional Audio Event Detection (AED) component that provides auxiliary audio tags (e.g., speech presence) rather than a separate modality in the fusion space. A central supervisor then coordinates these agents and makes the final emotion prediction. This design allows for easy integration of new input types, simple replacement of outdated components, and reduced computational costs during training and operation. The complete implementation is available as open-source software 3 .

The main contributions of this work are: (1) we introduce a multi-agent architecture for multimodal emotion recognition that processes video input through specialized agents using supervisor-based coordination to overcome the limitations of traditional approaches; (2) we demonstrate our approach through a complete implementation that supports vision, audio, and text processing extracted from video, using state-of-the-art models including emotion2vec, YOLOv8-Face, and FRIDA; and (3) we provide a modular framework that improves system flexibility and maintainability for real-world applications, allowing for easy updates and improvements without affecting the entire system. Our multi-agent framework provides a new approach to multimodal emotion recognition that offers better modularity and flexibility compared to traditional methods. While our current implementation uses logistic regression as the initial classifier, the modular design makes it easy to integrate more advanced classification methods in the future. This work contributes to the development of more robust and maintainable perception systems for human-agent interaction applications.

Multimodal emotion recognition (MER) integrates signals from visual, acoustic, and textual modalities to improve affective understanding in human-agent interaction. Early research focused on unimodal systems such as ResNet and ViT for visual cues [9], [10], PANNs and Whisper Large V3 Turbo 4 for speech [11], [12], and RoBERTa for textual emotion modeling [13]. While effective for isolated channels, these systems fail to capture inter-modal dependencies critical to robust emotion inference.

To address this, multimodal fusion models employ tensor fusion networks, gating mechanisms, or transformer-based fusion (e.g., BLIP-2) to integrate modalities [14], [15]. However, most of these architectures are monolithic, with tightly coupled input streams, which limits their adaptability and interpretability. Updating or replacing a modality encoder often requires retraining the entire model [16].

Recent work explores modular and collaborative architectures inspired by multi-agent systems [17], [18]. Studies such as Inside Out [19] and Project Riley [20] employ autonomous agents coordinated by supervisory modules, demonstrating enhanced modularity and interpretabilit

📄 Read Full PDF on ArXiv