제목이 제공되지 않았습니다

Reading time: 4 minute
...

📝 Abstract

💡 Analysis

📄 Content

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personalityaligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

Most existing digital humans remain primarily imitative, reproducing surface patterns of behavior without true understanding of interaction logic. While visual fidelity has greatly improved in recent years [154,163], a fundamental gap remains in enabling these avatars to function as responsive, logic-driven entities. To bridge this gap, we introduce Interactive Intelligence, a novel paradigm of digital humans that interact seamlessly with users, while possessing personality-aligned expression, adaptive responsiveness, and self-evolution capabilities. This paradigm transforms the digital human from a passive playback system into an embodied agent capable of coherent multimodal engagement within a dynamic narrative context [93].

Current approaches to digital human creation generally fall into two categories: traditional CG pipelines and generative model-based workflows. Traditional CG methods can offer precise control but are hindered by prohibitive production times and reliance on labor-intensive manual processes. On the other hand, workflows utilizing general-purpose multimodal generative models leverage massive audiovisual corpora to accelerate production but remain fundamentally limited to offline generation [5,13,143,63,19]. Consequently, the resulting characters are primarily imitative rather than autonomous, reproducing surface behavioral patterns without genuine interaction logic. This leaves them incapable of real-time responsiveness and prone to failures in maintaining consistent identity and behavioral coherence over long-term interactions [121,96].

Constructing an end-to-end interactive system presents unique challenges across multiple modalities. In response generation, standard LLMs often violate narrative causality (give spoilers) and drift out of persona during extended interactions [126,57]. In speech synthesis, existing TTS models lacks efficiently discrete speech representations, hindering the low-latency generation required for fluid conversation [29,123,104,43,9]. In facial animation, a critical issue is the “zombie-face” phenomenon, where digital avatars exhibit stiffness and lack natural listening behaviors when not speaking, breaking user immersion [98,1,114]. Furthermore, generating coherent full-body motion remains difficult; autoregressive models often suffer from error accumulation, while standard diffusion models are computationally prohibitive for real-time streaming [134,155,117,156,157]. Finally, rendering these motions into a visual avatar requires maintaining strict multi-view identity consistency, which is often compromised in image-driven diffusion approaches [140,106,83,92].

To address these challenges, we introduce Multimodal Interactive Omni-Avatar Mio, a comprehensive framework that models digital humans as autonomous agents with interactive intelligence. We propose a cascading paradigm composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. The Thinker serves as the cognitive core, utilizing a hierarchical memory system and diegetic knowledge graph to ensure narrative consistency and personality fidelity. The Talker leverages high-fidelity speech representations and produce clear and expressive voice that is well-aligned with the context. The Face Animator introduces a unified listening-speaking framework to generate responsive facial dynamics even during silence. The Body Animator utilizes a novel streaming diffusion forcing strategy to convert text instructions into physically plausible body motions in real time. Finally, the Renderer leverages a parameter-based diffusion transformer to synthesize the visual avatar with precise control over facial and body dynamics while ensuring multi-view consistency.

Extensive quantitative and qualitative experiments demonstrate the superiority of our approach. Our Talker module outperforms exisiting speech tokenizers and auto-regressive TTS models in speech generation metrics with a balanced multilingual capability. The Facial Animator significantly outperforms baselines in listening naturalness, with over 90% of users preferring o

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut