Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
💡 Research Summary
Resp‑Agent tackles two fundamental obstacles that have limited deep‑learning‑based respiratory auscultation: (i) information loss caused by converting raw acoustic signals into spectrograms, which discards transient events and clinical context, and (ii) severe data scarcity compounded by long‑tailed class imbalance. The authors propose an autonomous, closed‑loop system driven by a novel Active Adversarial Curriculum Agent (Thinker‑A²CA). Unlike static pipelines, Thinker‑A²CA continuously monitors the diagnostic model’s error patterns, identifies “weakness zones” in the label space, and dynamically schedules targeted data synthesis to address those zones. This creates a feedback loop where synthetic data are not generic but purposefully crafted to challenge and improve the current model.
The diagnostic backbone, called the Modality‑Weaving Diagnoser, weaves audio tokens derived from respiratory recordings together with electronic health record (EHR) text. It employs Strategic Global Attention to capture long‑range clinical dependencies while introducing Sparse Audio Anchors—special tokens that mark millisecond‑scale acoustic transients such as crackles or wheezes. By explicitly representing these fleeting events, the model preserves fine‑grained acoustic information that conventional Transformers often overlook, and it simultaneously integrates patient‑level metadata (age, smoking history, comorbidities, etc.) for richer context.
To generate the targeted synthetic samples, the authors adapt a text‑only large language model (LLM) through a modality‑injection layer, producing a Flow Matching Generator. This generator decouples pathological content (e.g., “pneumonia”, “asthma”) from acoustic style (recording device characteristics, background noise). During training, real audio recordings are paired with LLM‑distilled clinical narratives, allowing the model to learn a mapping from textual pathology descriptions to audio token space. Flow Matching then defines a continuous probability flow that can be sampled to synthesize high‑fidelity respiratory sounds with controllable style attributes. The generated samples are fed back to Thinker‑A²CA, which selects those that most effectively expose the diagnostic model’s current blind spots.
A new benchmark, Resp‑229k, underpins the entire framework. It comprises 229,000 high‑resolution respiratory recordings paired with automatically generated clinical narratives, covering 27 disease categories plus normal breathing. The dataset preserves the natural long‑tail distribution observed in clinical practice, making it an ideal testbed for evaluating robustness under data scarcity.
Extensive experiments were conducted across four settings: (1) full‑data training, (2) extreme low‑data regimes (10 %, 5 %, 1 % of the full set), (3) balanced oversampling baselines, and (4) varied noise conditions. Resp‑Agent consistently outperformed strong baselines—including CNN‑based spectrogram classifiers, pure Transformer models, and recent multimodal approaches—by 4.2 %–7.8 % absolute accuracy gains. The most pronounced improvements appeared in minority classes such as tuberculosis and pulmonary fibrosis, where F1 scores rose by more than 15 %. Ablation studies revealed that removing Thinker‑A²CA (i.e., using static synthetic data) caused a notable performance drop, and omitting Sparse Audio Anchors reduced the model’s ability to detect brief auscultatory events.
Human evaluation further validated the realism of the synthesized sounds: five board‑certified pulmonologists listened to a blind mix of real and generated recordings and achieved an average 82 % agreement with the ground‑truth labels, indicating that the synthetic data are clinically plausible.
The paper’s contributions are fourfold: (1) introduction of Thinker‑A²CA, an active curriculum agent that closes the loop between diagnosis and data synthesis; (2) design of the Modality‑Weaving Diagnoser with strategic global attention and sparse audio anchors to preserve both long‑range context and millisecond‑scale transients; (3) development of a Flow Matching Generator that leverages a text‑only LLM with modality injection to decouple pathology from acoustic style; and (4) release of the large‑scale Resp‑229k multimodal benchmark.
Limitations include the current focus on respiratory audio alone—extension to other physiological signals (e.g., ECG, blood pressure) remains to be explored—and the reliance on LLM‑generated clinical narratives, which may contain factual errors without an automated verification step. Future work will aim to integrate additional biosignals into the multimodal agent framework and to incorporate rigorous validation pipelines for LLM‑derived clinical text.
Comments & Academic Discussion
Loading comments...
Leave a Comment