Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Reading time: 5 minute
...

📝 Original Info

  • Title: Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
  • ArXiv ID: 2602.15909
  • Date: 2026-02-16
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자 명단이 필요하면 원문을 확인해 주세요.) **

📝 Abstract

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

💡 Deep Analysis

📄 Full Content

Respiratory auscultation is a fundamental component of clinical diagnosis, providing critical acoustic evidence for assessing pulmonary health (Heitmann et al., 2023;Bohadana et al., 2014). Accurate and automated analysis of respiratory sounds holds substantial clinical value for the early screening, diagnosis, and monitoring of respiratory diseases (Rocha et al., 2019). Although deep learning has driven significant progress in this domain, existing methods remain constrained by fundamental limitations that hinder both performance and practical deployment (Huang et al., 2023;Xia et al., 2022;Coppock et al., 2024).

The first challenge is a unimodal representational bottleneck. Audio models often convert signals into mel-spectrograms for image-style CNNs (Bae et al., 2023;He et al., 2024), which discards phase and blurs fine temporal structure, obscuring transient events such as crackles (Paliwal et al., 2011). Conversely, text-only models capture electronic health record (EHR) context but lack objective acoustic evidence, limiting discrimination between conditions with similar narratives but distinct auscultatory patterns. Without deep multimodal fusion, performance and reliability saturate.

The second limitation is the lack of large, well-annotated multimodal datasets. Most public respiratorysound corpora are small, cover only a few conditions, and lack systematic curation (Zhang et al., 2024a). Even when auxiliary metadata such as demographics and symptoms is available, existing approaches rely on basic fusion techniques and task-specific designs, limiting the development of generalized multimodal models (Zhang et al., 2024b).

We introduce Resp-229k to address the scarcity of multimodal supervision and the lack of robust cross-domain evaluation in respiratory sound analysis. Unlike existing datasets, RESP-229K provides paired audio with standardized clinical summaries, converting diverse metadata into a format suitable for multimodal modeling. We also establish a strict out-of-domain evaluation protocol to explicitly test model generalization. The dataset comprises 229,101 quality-controlled samples sourced from five public databases, categorized into 16 classes (15 conditions and 1 control).

A core contribution is the textual supervision. Instead of full electronic health records, each clip is paired with a standardized clinical summary, a concise paragraph synthesized from available source fields. Summaries adapt to source coverage: when demographics and symptoms exist, they are included; when only auscultation events and acquisition context are present, the summary focuses on those. Concretely, we retain two typical regimes as a modeling challenge: technical/event-driven summaries (auscultatory events, site, sensor/filter, phases, wheezes/crackles) and clinically enriched summaries (demographics, smoking status, comorbidities, symptoms, past medical history).

We programmatically convert heterogeneous CSV/TXT/JSON fields and filename-derived codes into standardized summaries using DeepSeek-R1-Distill-Qwen-7B (Guo et al., 2025) as a lightweight data-to-text engine. The model does not interpret audio; instead, it consolidates existing metadata into a schema-grounded paragraph with a consistent style across sources, enabling reproducible, low-cost annotation refreshes while preserving diagnostically relevant heterogeneity.

To mitigate hallucination and governance risks, all LLM-generated clinical summaries undergo a second-stage audit that combines rule-based consistency checks, critique from a stronger reasoning model acting as a verifier, and sampling-based human review. This process ensures that only summaries that pass the pipeline, or are rewritten and reverified after being flagged, are retained in RESP-229K. A detailed description of the auditing pipeline is provided in Appendix E.

To standardize comparisons, we specify two tasks and metrics: (i) multimodal disease classification, reporting accuracy and macro-F1; and (ii) controllable audio generation conditioned on disease semantics, reporting objective acoustic similarity and clinical-event fidelity. We report both indomain validation results and strictly out-of-domain test results. For evaluation, RESP-229K enforces a strict cross-domain split: training/validation on ICBHI, SPRSound, and UK COVID-19, and testing (Coppock et al., 2024;Budd et al., 2024;Pigoli et al., 2022), ICBHI (Rocha et al., 2017), SPRSound (Zhang et al., 2022), COUGHVID (Orlandic et al., 2021), and KAUH (Fraiwan et al., 2021).

The overall architecture of Resp-Agent is depicted in Figure 1. Given the paired text-audio supervision and the cross-domain split established by RESP-229K, Resp-Agent is designed as a centrally planned, compute-aware multi-agent system that integrates standalone audio and NLP modules into a closed loop. A compute-efficient planner, Thinker-A 2 CA (DeepSeek-V3.2-Exp; (Guo et al., 2025)), performs semantic intent parsing and plan-execute tool rout

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut