Looking Beyond Accuracy: A Holistic Benchmark of ECG Foundation Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The electrocardiogram (ECG) is a cost-effective, highly accessible and widely employed diagnostic tool. With the advent of Foundation Models (FMs), the field of AI-assisted ECG interpretation has begun to evolve, as they enable model reuse across different tasks by relying on embeddings. However, to responsibly employ FMs, it is crucial to rigorously assess to which extent the embeddings they produce are generalizable, particularly in error-sensitive domains such as healthcare. Although prior works have already addressed the problem of benchmarking ECG-expert FMs, they focus predominantly on the evaluation of downstream performance. To fill this gap, this study aims to find an in-depth, comprehensive benchmarking framework for FMs, with a specific focus on ECG-expert ones. To this aim, we introduce a benchmark methodology that complements performance-based evaluation with representation-level analysis, leveraging SHAP and UMAP techniques. Furthermore, we rely on the methodology for carrying out an extensive evaluation of several ECG-expert FMs pretrained via state-of-the-art techniques over different cross-continental datasets and data availability settings; this includes ones featuring data scarcity, a fairly common situation in real-world medical scenarios. Experimental results show that our benchmarking protocol provides a rich insight of ECG-expert FMs’ embedded patterns, enabling a deeper understanding of their representational structure and generalizability.

💡 Research Summary

The paper addresses a critical gap in the evaluation of ECG‑specific foundation models (FMs) by proposing a holistic benchmarking framework that goes beyond downstream task performance and examines the quality of the learned embeddings themselves. Four state‑of‑the‑art ECG FMs are considered: ECG‑FM (a CNN‑Transformer trained with contrastive learning and masking), ECGFounder (RegNet‑based multi‑label pre‑training), HuBERT‑ECG (BERT‑style with k‑means label induction, evaluated in small, base and large scales), and ECG‑JEP‑A (Joint‑Embedding Predictive Architecture adapted to ECG signals). Each model is frozen after pre‑training and used solely as an embedding extractor, ensuring a zero‑shot setting.

Four geographically diverse 12‑lead ECG datasets are used: GEO (USA), C15 (Europe), PTB‑XL (Europe) and CHN (Asia). The datasets differ markedly in sample size (from <500 to >5,000) and number of diagnostic classes, allowing the authors to test model robustness under data‑scarcity conditions that are common in clinical practice.

The evaluation proceeds in two complementary stages. First, linear probing is performed by training five lightweight classifiers (XGBoost, Decision Tree, Random Forest, Logistic Regression, MLP) on the frozen embeddings and measuring F1 scores via 15‑fold cross‑validation. The best‑performing classifier for each FM is identified as the “optimal probe.” Second, representation‑level analysis is conducted. SHAP (Shapley Additive Explanations) is applied to the optimal probe to rank feature importance; the top‑50 most influential features are extracted for every dataset, and the overlap of these feature sets across datasets is quantified as a proxy for generalization. UMAP is then used to visualize the high‑dimensional embedding space in two dimensions, while quantitative clustering metrics—k‑Nearest Neighbor intra‑cluster distance, centroid separation, and Adjusted Rand Index (ARI) derived from Gaussian Mixture Model clustering—assess the geometry of the space before dimensionality reduction.

Results reveal nuanced differences among the models. All FMs achieve high F1 scores when ample data are available, but performance gaps widen under data‑limited regimes. SHAP analysis shows that ECGFounder and HuBERT‑ECG consistently select similar high‑importance features across datasets (over 70 % overlap), suggesting they capture clinically meaningful waveform components (e.g., QRS complex, ST segment). ECG‑FM exhibits more dataset‑specific feature selection, indicating weaker generalization. In the embedding geometry, ECG‑JEP‑A displays the most distinct class clusters, achieving the highest ARI (0.68) and centroid separation, while ECGFounder attains the lowest average k‑NN distance, reflecting tighter intra‑class cohesion.

The study demonstrates that relying solely on downstream accuracy can mask critical deficiencies in embedding quality, especially in safety‑critical domains like healthcare. By jointly evaluating performance, feature attribution, and embedding structure, the proposed benchmark provides a more comprehensive picture of FM suitability for real‑world ECG analysis. The authors also release the full codebase and benchmark pipeline as open‑source resources, enabling the community to extend the framework to new models, tasks, or modalities. This work thus sets a new standard for responsible assessment of foundation models in medical time‑series domains.

Looking Beyond Accuracy: A Holistic Benchmark of ECG Foundation Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment