DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling
Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients’ mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.
💡 Research Summary
The paper introduces DELTA, a deliberative multi‑agent framework designed to bring multimodal reasoning, structured mental‑state abstraction, and emotion‑attuned response generation into AI‑driven psychological counseling. Recognizing that effective counseling relies on visual, vocal, and linguistic cues, the authors argue that text‑only large language models (LLMs) miss critical non‑verbal information and implicitly infer mental states, limiting empathy and accuracy. DELTA addresses this gap by decomposing the counseling workflow into three distinct stages, each handled by a specialized agent.
In the first stage, the Multimodal Grounding Agent (MGA) has exclusive access to raw video and audio streams. Two inquiry agents—Visual Cue Inquiry Agent (VCIA) and Vocal Cue Inquiry Agent (VoCIA)—formulate modality‑specific questions (e.g., “What facial expression indicates anxiety?” or “What prosodic pattern suggests distress?”). MGA answers these queries using the multimodal inputs, producing a cross‑modal question‑answer history H that explicitly links hypothesized mental states to observable evidence.
The second stage is mental‑state structuring. The Mental State Structuring Agent (MSSA) consumes H and synthesizes a structured mental‑state representation M, inspired by the clinical Mental State Examination (MSE). This representation encodes affective, behavioral, and cognitive dimensions in a traceable, label‑based format, providing an interpretable bridge between evidence gathering and response generation.
In the third stage, the Counseling Response Generation Agent (CRGA) generates the final therapeutic utterance y conditioned on the case context c and the structured mental state M. To ensure that the response is not only coherent but also emotionally attuned, the authors apply reinforcement learning exclusively to CRGA using Group Relative Policy Optimization (GRPO). The reward signal is the Emotion Attunement Score (EAS), a distribution‑level metric that compares the emotion distribution of the generated response (derived from a text‑based emotion encoder) with modality‑specific emotion distributions of the client (derived from visual, vocal, and textual emotion encoders). EAS computes Jensen‑Shannon distances for each modality and aggregates them, rewarding responses whose emotional profile aligns closely with the client’s multimodal emotional context.
Experiments are conducted on a multimodal emotional‑support benchmark covering a variety of client scenarios. The authors evaluate several proprietary and open‑source LLMs (e.g., GPT‑4, Claude, LLaMA‑2) both with and without DELTA. Results show consistent improvements in two key metrics: Counseling Quality (CQ) and Emotion Attunement (EA). Across models, DELTA yields a 12‑18 % boost in CQ and a comparable rise in EA, with the most pronounced gains in cases rich in visual and vocal cues.
Ablation studies dissect the contribution of each component. Removing the multimodal inquiry agents (reverting to text‑only queries) degrades performance sharply, confirming the value of explicit visual and vocal grounding. Omitting MSSA and feeding raw evidence directly to CRGA reduces EA, highlighting the importance of structured mental‑state abstraction for emotional alignment. Finally, training CRGA without the EAS‑based reinforcement signal leads to higher linguistic fluency but lower emotional compatibility, underscoring the necessity of the distribution‑level reward.
Qualitative analyses illustrate that DELTA‑generated responses contain concrete references to observed non‑verbal behavior (“I notice your shoulders are tense, which often signals anxiety”) and nuanced empathic language, whereas baseline systems tend to produce generic supportive statements. The authors also discuss ethical considerations, emphasizing that DELTA is intended to augment—not replace—human clinicians, and that interpretability of the mental‑state representation can aid safety monitoring.
In summary, DELTA advances AI‑assisted counseling by (1) grounding reasoning in multimodal evidence, (2) providing an interpretable, clinically inspired mental‑state abstraction, and (3) optimizing response generation with a novel emotion‑attunement reinforcement signal. This integrated approach bridges theoretical counseling practice and modern AI, offering a promising direction for more empathetic, reliable, and transparent mental‑health support systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment