ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue

ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in Multimodal Large Language Models have rapidly expanded to electrocardiograms, focusing on classification, report generation, and single-turn QA tasks. However, these models fall short in real-world scenarios, lacking multi-turn conversational ability, on-device efficiency, and precise understanding of ECG measurements such as the PQRST intervals. To address these limitations, we introduce ECG-Agent, the first LLM-based tool-calling agent for multi-turn ECG dialogue. To facilitate its development and evaluation, we also present ECG-Multi-Turn-Dialogue (ECG-MTD) dataset, a collection of realistic user-assistant multi-turn dialogues for diverse ECG lead configurations. We develop ECG-Agents in various sizes, from on-device capable to larger agents. Experimental results show that ECG-Agents outperform baseline ECG-LLMs in response accuracy. Furthermore, on-device agents achieve comparable performance to larger agents in various evaluations that assess response accuracy, tool-calling ability, and hallucinations, demonstrating their viability for real-world applications.


💡 Research Summary

The paper addresses two critical gaps in current ECG‑focused large language models (LLMs): the inability to sustain multi‑turn, context‑aware conversations and the impracticality of deploying such models on resource‑constrained devices like smartphones or wearables. Existing ECG‑LLMs excel at single‑turn classification, report generation, or question answering, but they lack the conversational flow needed for real‑world patient interactions and often sacrifice fine‑grained signal details required for precise measurements (e.g., P‑R‑Q‑S‑T intervals). To bridge these gaps, the authors introduce ECG‑Agent, the first tool‑calling LLM designed specifically for multi‑turn ECG dialogue, and they release a new dataset, ECG‑Multi‑Turn‑Dialogue (ECG‑MTD), to train and evaluate such agents.

Dataset Construction (ECG‑MTD).
The authors mined two large online medical consultation corpora (HealthCareMagic‑100k and icliniq‑10k), filtered for ECG‑related exchanges, and performed a qualitative analysis of ~1,000 representative conversations. From this analysis they derived seven clinically relevant topic categories (heart‑rate concerns, arrhythmias, treatment, preventive health, etc.) and aligned them with three language‑proficiency levels based on the Common European Framework of Reference (CEFR A, B, C). They also defined a set of user actions (ECG inquiry, follow‑up request, goodbye) and seven agent actions (direct response, response‑fail, response‑follow‑up, system goodbye, and three tool calls: classification, measurement, explanation). By combining a topic, a CEFR level, and one of 20 pre‑designed action sequences, they prompted Gemini‑2.5‑Flash to generate realistic dialogues. The final corpus contains 21,837 multi‑turn dialogues (≈98 k training instances) across three lead configurations (12‑lead clinical, Lead I wearable, Lead II patch), with an average of 7.68 turns per dialogue. Medical accuracy was ensured by consulting a licensed physician to map FDA‑cleared diagnostic classes to each lead configuration.

Agent Architecture and Tools.
ECG‑Agent follows the “tool‑calling” paradigm: a lightweight text‑only LLM performs reasoning and decides whether to call an external specialized tool. Three tools are integrated:

  1. Classification tool – a self‑supervised ECG encoder trained with contrastive learning and Random Lead Masking (RLM) to be robust to arbitrary lead subsets.
  2. Measurement tool – NeuroKit2, which extracts precise interval measurements (PR interval, QRS duration, QTc, heart rate, etc.) from raw ECG waveforms.
  3. Explanation tool – SpectralX, a time‑frequency based explainer for univariate time‑series classifiers (used for Lead I and Lead II only).

The LLM backbone is instantiated with four model sizes: Llama‑3.2‑1B, Llama‑3.2‑3B, Llama‑3.1‑8B, and Qwen‑3‑32B. Instruction‑tuning is performed on the ECG‑MTD dataset using LoRA (rank 16, α = 16) and 8‑bit AdamW for efficiency (Unsloth framework). Each training instance consists of the full dialogue history, the current user turn (action + content), and the target agent turn (action, “thought” reasoning, and either tool output or natural language response). The “thought” token sequence explicitly teaches the model to map a user query to a tool‑call decision or a direct answer.

Experimental Setup.
Training uses a batch size of 128, 3 epochs with early stopping, a learning rate schedule starting at 2 × 10⁻⁴, and a maximum sequence length of 4096 tokens. Datasets are split 80/10/10 for train/validation/test. Evaluation is conducted on three lead configurations separately. The authors employ Gemini‑2.5‑Pro as an LLM‑as‑a‑Judge to score accuracy (how well the response matches ground truth) and completeness (coverage of key information) on a 1‑5 scale. Ground truth responses are generated by prompting Gemini‑2.5‑Pro with cardiologist‑labeled PTB‑XL diagnostic codes (for classification) and PTB‑XL+ Uni‑G measurements (for interval queries). Human evaluation on 300 random dialogues validates the automatic scores. Additional metrics include Next Action Prediction (NAP) – the percentage of correctly predicted next actions – and Faithfulness – alignment between tool outputs and the subsequent LLM response, measured both with and without ground‑truth tool outputs.

Results.
Across all lead configurations, ECG‑Agent consistently outperforms baseline ECG‑LLMs (Gemini‑Flash, PULSE, GEM, Llama‑3.2‑Flash). For example, in the 12‑lead setting, the 3 B ECG‑Agent achieves an average accuracy of 3.45 versus 2.27 for the best baseline, and a completeness of 2.64 versus 1.87. Similar gains are observed for Lead I and Lead II. Notably, the on‑device 1 B and 3 B agents achieve performance comparable to the 8 B and 32 B models, demonstrating that tool‑calling compensates for reduced model capacity. NAP scores hover around 70 %+, and Faithfulness exceeds 85 % when ground‑truth tool outputs are provided, indicating reliable tool usage. Direct response evaluation (turns that do not invoke tools) also shows higher scores for ECG‑Agent (e.g., accuracy 3.71 vs. 1.86 for Gemini‑Flash). Per‑class Temporal Intersection‑over‑Union (TIoU) for SpectralX explanations on Lead I/II reaches 64‑76 % for common arrhythmia classes, confirming that the explanation tool works well on univariate leads.

Ablation and Limitations.
The authors note that SpectralX cannot handle multivariate 12‑lead data, so explanation capability is limited to single‑lead configurations. The dataset, while large and diverse, is synthetic; real‑world patient‑physician dialogues may exhibit different linguistic patterns and clinical nuances. Moreover, the current tool set focuses on classification, measurement, and basic explanation; future work could integrate risk stratification, medication recommendation, or longitudinal trend analysis. Privacy and safety mechanisms for on‑device deployment (e.g., secure enclaves, model verification) are mentioned as future directions.

Impact and Future Directions.
ECG‑Agent demonstrates that a lightweight, on‑device LLM equipped with domain‑specific tools can deliver accurate, context‑aware ECG counseling, bridging the gap between high‑performance cloud models and the latency, privacy, and connectivity constraints of wearable health devices. This work paves the way for personalized cardiac monitoring assistants that can answer follow‑up questions, explain findings in lay language, and provide actionable health advice without relying on continuous internet access. Extending the framework to other biosignals (e.g., PPG, respiration) and integrating more sophisticated clinical decision support tools could further broaden its applicability in digital health.


Comments & Academic Discussion

Loading comments...

Leave a Comment