Interpretable and backpropagation-free Green Learning for efficient multi-task echocardiographic segmentation and classification
Echocardiography is a cornerstone for managing heart failure (HF), with Left Ventricular Ejection Fraction (LVEF) being a critical metric for guiding therapy. However, manual LVEF assessment suffers from high inter-observer variability, while existing Deep Learning (DL) models are often computationally intensive and data-hungry “black boxes” that impede clinical trust and adoption. Here, we propose a backpropagation-free multi-task Green Learning (MTGL) framework that performs simultaneous Left Ventricle (LV) segmentation and LVEF classification. Our framework integrates an unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction with a multi-level regression decoder and an XG-Boost classifier. On the EchoNet-Dynamic dataset, our MTGL model achieves state-of-the-art classification and segmentation performance, attaining a classification accuracy of 94.3% and a Dice Similarity Coefficient (DSC) of 0.912, significantly outperforming several advanced 3D DL models. Crucially, our model achieves this with over an order of magnitude fewer parameters, demonstrating exceptional computational efficiency. This work demonstrates that the GL paradigm can deliver highly accurate, efficient, and interpretable solutions for complex medical image analysis, paving the way for more sustainable and trustworthy artificial intelligence in clinical practice.
💡 Research Summary
The paper introduces a novel back‑propagation‑free, multi‑task Green Learning framework (MTGL) for simultaneous left‑ventricular (LV) segmentation and left‑ventricular ejection fraction (LVEF) classification in echocardiographic video data. The authors motivate the work by highlighting the limitations of current deep‑learning (DL) approaches: they require large labeled datasets, consume substantial computational resources, and act as “black boxes” that hinder clinical trust and deployment on edge devices. To address these issues, MTGL combines three key components: (1) an unsupervised VoxelHop encoder that extracts hierarchical spatio‑temporal features via a series of Saab transforms (a bias‑adjusted PCA). Each VoxelHop layer expands the receptive field, applies a 3‑D neighborhood extraction, and decomposes the signal into a direct‑current (DC) component (mean intensity) and multiple alternating‑current (AC) components (edge, texture, motion). This linear, statistically‑driven representation is fully interpretable; the energy spectrum of the AC filters provides a natural criterion for selecting the most informative channels, dramatically reducing the number of parameters. (2) A multi‑level regression decoder for segmentation that follows a coarse‑to‑fine residual correction strategy. At the coarsest resolution (14 × 14), an XGBoost regressor predicts a continuous mask (values between 0 and 1) using the VoxelHop features. The ground‑truth masks are down‑sampled by patch averaging, preserving sub‑pixel boundary information. Subsequent finer levels each train a separate XGBoost model to correct the residual errors of the previous level, ultimately yielding a high‑resolution (128 × 128) binary LV mask. Because each decoder stage is a tree‑based regression, the overall segmentation module contains only a few thousand trainable parameters and requires no gradient descent. (3) An XGBoost classifier that directly consumes the VoxelHop features for LVEF categorization. The LVEF labels are discretized into three clinically relevant classes (> 50 %, 40–50 %, < 40 %). Gradient‑boosted decision trees provide intrinsic feature‑importance scores, enabling clinicians to see which encoder layers and AC filters drive the classification decisions, thereby delivering transparent AI. The encoder is pre‑trained once in an unsupervised manner and then frozen; the segmentation and classification decoders are trained independently, which simplifies the training pipeline and facilitates modular updates. The authors evaluate MTGL on the publicly available EchoNet‑Dynamic dataset (10,030 echocardiography videos from Stanford Health Care). Input volumes consist of 12 consecutive frames, each containing two channels (end‑diastolic and end‑systolic volumes), resized to 128 × 128 pixels. MTGL achieves a classification accuracy of 94.3 % and a Dice Similarity Coefficient (DSC) of 0.912 for LV segmentation, outperforming strong 3‑D DL baselines such as 3‑D V‑Net, 3‑D U‑Net, 3‑D UNETR, and nnU‑Net. Remarkably, the total parameter count of MTGL is roughly 0.5 million, an order of magnitude smaller than the millions of parameters typical of the DL baselines. This reduction translates into lower memory footprints, faster inference, and suitability for deployment on low‑power clinical workstations or portable devices. Moreover, because every transformation in the encoder is linear and analytically defined, the model’s internal operations can be inspected, quantified, and visualized, addressing the “black‑box” criticism of conventional DL. The paper concludes that Green Learning offers a viable, efficient, and interpretable alternative to gradient‑based deep networks for complex medical image analysis. Future directions include extending the approach to additional cardiac views, other cardiovascular pathologies, and real‑world integration into clinical workflows, where the combination of high accuracy, low computational cost, and transparency can accelerate AI adoption in cardiology.
Comments & Academic Discussion
Loading comments...
Leave a Comment