Electrical Engineering and Systems Science / Audio Processing

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

February 20, 2026

Reading time: 5 minute

...

#System #Audio Processing #Electrical Engineering and Systems Science

📝 Original Info

Title: Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
ArXiv ID: 2602.15909
Date: 2026-02-16
Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자 명단이 필요하면 원문을 확인해 주세요.) **

📝 Abstract

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

💡 Deep Analysis

📄 Full Content

Respiratory auscultation is a fundamental component of clinical diagnosis, providing critical acoustic evidence for assessing pulmonary health (Heitmann et al., 2023;Bohadana et al., 2014). Accurate and automated analysis of respiratory sounds holds substantial clinical value for the early screening, diagnosis, and monitoring of respiratory diseases (Rocha et al., 2019). Although deep learning has driven significant progress in this domain, existing methods remain constrained by fundamental limitations that hinder both performance and practical deployment (Huang et al., 2023;Xia et al., 2022;Coppock et al., 2024).

The first challenge is a unimodal representational bottleneck. Audio models often convert signals into mel-spectrograms for image-style CNNs (Bae et al., 2023;He et al., 2024), which discards phase and blurs fine temporal structure, obscuring transient events such as crackles (Paliwal et al., 2011). Conversely, text-only models capture electronic health record (EHR) context but lack objective acoustic evidence, limiting discrimination between conditions with similar narratives but distinct auscultatory patterns. Without deep multimodal fusion, performance and reliability saturate.

The second limitation is the lack of large, well-annotated multimodal datasets. Most public respiratorysound corpora are small, cover only a few conditions, and lack systematic curation (Zhang et al., 2024a). Even when auxiliary metadata such as demographics and symptoms is available, existing approaches rely on basic fusion techniques and task-specific designs, limiting the development of generalized multimodal models (Zhang et al., 2024b).

We introduce Resp-229k to address the scarcity of multimodal supervision and the lack of robust cross-domain evaluation in respiratory sound analysis. Unlike existing datasets, RESP-229K provides paired audio with standardized clinical summaries, converting diverse metadata into a format suitable for multimodal modeling. We also establish a strict out-of-domain evaluation protocol to explicitly test model generalization. The dataset comprises 229,101 quality-controlled samples sourced from five public databases, categorized into 16 classes (15 conditions and 1 control).

A core contribution is the textual supervision. Instead of full electronic health records, each clip is paired with a standardized clinical summary, a concise paragraph synthesized from available source fields. Summaries adapt to source coverage: when demographics and symptoms exist, they are included; when only auscultation events and acquisition context are present, the summary focuses on those. Concretely, we retain two typical regimes as a modeling challenge: technical/event-driven summaries (auscultatory events, site, sensor/filter, phases, wheezes/crackles) and clinically enriched summaries (demographics, smoking status, comorbidities, symptoms, past medical history).

We programmatically convert heterogeneous CSV/TXT/JSON fields and filename-derived codes into standardized summaries using DeepSeek-R1-Distill-Qwen-7B (Guo et al., 2025) as a lightweight data-to-text engine. The model does not interpret audio; instead, it consolidates existing metadata into a schema-grounded paragraph with a consistent style across sources, enabling reproducible, low-cost annotation refreshes while preserving diagnostically relevant heterogeneity.

To mitigate hallucination and governance risks, all LLM-generated clinical summaries undergo a second-stage audit that combines rule-based consistency checks, critique from a stronger reasoning model acting as a verifier, and sampling-based human review. This process ensures that only summaries that pass the pipeline, or are rewritten and reverified after being flagged, are retained in RESP-229K. A detailed description of the auditing pipeline is provided in Appendix E.

To standardize comparisons, we specify two tasks and metrics: (i) multimodal disease classification, reporting accuracy and macro-F1; and (ii) controllable audio generation conditioned on disease semantics, reporting objective acoustic similarity and clinical-event fidelity. We report both indomain validation results and strictly out-of-domain test results. For evaluation, RESP-229K enforces a strict cross-domain split: training/validation on ICBHI, SPRSound, and UK COVID-19, and testing (Coppock et al., 2024;Budd et al., 2024;Pigoli et al., 2022), ICBHI (Rocha et al., 2017), SPRSound (Zhang et al., 2022), COUGHVID (Orlandic et al., 2021), and KAUH (Fraiwan et al., 2021).

The overall architecture of Resp-Agent is depicted in Figure 1. Given the paired text-audio supervision and the cross-domain split established by RESP-229K, Resp-Agent is designed as a centrally planned, compute-aware multi-agent system that integrates standalone audio and NLP modules into a closed loop. A compute-efficient planner, Thinker-A 2 CA (DeepSeek-V3.2-Exp; (Guo et al., 2025)), performs semantic intent parsing and plan-execute tool rout

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Related Posts

Active RIS-Assisted MIMO System for Vital Signs Extraction: ISAC Modeling, Deep Learning, and Prototype Measurements

Adaptive Selection of Codebook Using Assistance Information and Artificial Intelligence for 6G Systems

Autonomous and non-autonomous fixed-time leader-follower consensus for second-order multi-agent systems

Start searching

No results found