Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework
Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) …
Authors: Ya Zhou, Tianxiang Hao, Ziyi Cai
Detecting lo w left v en tricular ejection fraction from ECG using an in terpretable and scalable predictor-driv en framew ork Y a Zhou 1,* , Tianxiang Hao 2 , Ziyi Cai 3 , Hao jie Zh u 4 , Hejun He 3 , Jia Liu 1 , Xiaohan F an 4,5, * and Jing Y uan 1, * 1 Dep artment of Information Center, F uwai Hospital, Chinese A c ademy of Me dic al Scienc es and Peking Union Me dic al Col le ge, Beijing, China 2 The T singhua Shenzhen International Gr aduate Scho ol, T singhua University, Shenzhen, China 3 Institute of Statistics and Big Data, R enmin University of China, Beijing, China 4 Car diac A rrhythmia Center, F uwai Hospital, National Center for Car diovascular Dise ases, Chinese A c ademy of Me dic al Scienc es and Peking Union Me dic al Col le ge, Beijing, China 5 F unction T est Center, F uwai Hospital, National Center for Car diovascular Dise ases, Chinese A c ademy of Me dic al Scienc es and Peking Union Me dic al Col le ge, Beijing, China Abstract Lo w left v en tricular ejection fraction (LEF) frequently remains undetected until pro- gression to symptomatic heart failure, underscoring the need for scalable screening strate- gies. Although artificial in telligence-enabled electro cardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box mo dels with limited in terpretability or on tabular systems dep endent on commercial ECG measurement algo- rithms with sub optimal p erformance. W e introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diag- nostic probabilities with in terpretable mo deling for detecting LEF from ECG. T rained on the b enchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and ev al- uated in predefined independent in ternal (n=5,442) and external (n=16,017) cohorts, our framew ork ac hieved robust discrimination for moderate LEF (internal A UROC 88.4%, F1 64.5%; external AUR OC 86.8%, F1 53.6%), consistently outp erforming the official end- to-end baseline provided with the b enchmark across demographic and clinical subgroups. In terpretability analyses iden tified high-impact predictors, including normal ECG, in- complete left bundle branc h blo c k, and sub endo cardial injury in an terolateral leads, driv- ing LEF risk estimation. Notably , these predictors indep endently enabled zero-shot-lik e inference without task-sp ecific retraining (internal AUR OC 75.3-81.0%; external A U- R OC 71.6-78.6%), indicating that ven tricular dysfunction is intrinsically enco ded within structured diagnostic probability representations. This framework reconciles predictive p erformance with mechanistic transparency , supp orting scalable enhancement through additional predictors and seamless in tegration with existing AI-ECG systems. * Email: Jing Y uan (yuanjing@fu wai.com), Xiaohan F an (fanxiaohan@fuw aihospital.org), Y a Zhou (zhouy a@fuw ai.com) 1 1 In tro duction Heart failure (HF) is a ma jor global health burden, affecting ov er 64 million individuals w orldwide, with prev alence contin uing to rise ( Sa v arese et al. , 2022 ). In many patients, left v entricular systolic dysfunction (L VSD) is initially asymptomatic, y et it frequently progresses to symptomatic HF and is associated with p o or prognosis if left undetected. Low left v en- tricular ejection fraction (LEF) is a critical indicator that reflects the progression of L VSD to ward clinically manifest HF ( Heidenreic h et al. , 2022 ). Early detection of L VSD is critical, as timely initiation of evidence-based therapies can mitigate symptoms, slow disease progres- sion, and impro ve surviv al. Although cardio v ascular imaging, particularly echocardiography (ECHO), remains the clinical gold standard for assessing v entricular function, it is time- and resource-in tensive, requires specialized op erators and equipment, and is not readily scalable for p opulation-level screening ( Ciampi and Villari , 2007 ). These limitations underscore the need for cost-effectiv e, widely accessible approaches capable of reliably identifying patien ts with LEF ( Attia et al. , 2019a ; Lakshmanan and Mbanze , 2023 ; T ran et al. , 2025 ; Y ao et al. , 2021 ). Electro cardiograph y (ECG), whic h records cardiac electrical activit y , is a promising tool for large-scale screening ( Hughes et al. , 2024 ). Since its introduction by Willem Ein thov en, whic h earned him the Nob el Prize in 1924, ECG science has steadily adv anced: acquisition has b ecome easier, signal qualit y has impro ved, and pathognomonic features for a wide range of rh ythm and conduction disorders ha ve b een defined ( Khera , 2024 ). More recently , artificial in telligence (AI) has transformed ECG analysis, enabling mo dels to ac hieve cardiologist-level p erformance in traditional interpretation tasks ( Hann un et al. , 2019 ; Jiang et al. , 2024 ; Li et al. , 2025 ; Rib eiro et al. , 2020 ) and to detect subtle patterns asso ciated with conditions suc h as aortic stenosis ( K w on et al. , 2020 ), tricuspid regurgitation ( Diao et al. , 2025 ), and LEF ( Attia et al. , 2019a ), which are often imp erceptible to humans. Despite these adv ances, clinical adoption v aries: AI-ECG mo dels are widely accepted for well-established conditions suc h as atrial fibrillation and sin us bradycardia, which can b e visually confirmed according to guidelines ( Joglar et al. , 2024 ; Kusumoto et al. , 2019 ), but remain limited for emerging tasks suc h as LEF detection due to concerns ab out in terpretability and the difficult y of linking predictions to recognizable ECG features ( Hughes et al. , 2024 ; V an De Leur et al. , 2022 ). T o enhance in terpretability and facilitate broader clinical deploymen t, a recent study prop osed a random forest model based on 555 discrete ECG measurements based on commercial al- gorithms, which demonstrated sup erior p erformance compared with N-terminal prohormone brain natriuretic peptide (NT-proBNP) for LEF detection ( Hughes et al. , 2024 ). Although this approach sho w ed promising results, its p erformance is slightly b elow deep learning. More- o ver, as highlighted in the original study , man y features cannot b e obtained when applied across different ECG machines, and measurement v alues differ markedly b etw een devices ( Stro dthoff et al. , 2023 ), which presents a significan t barrier to broader clinical adoption. In this study , we aim to ev aluate the efficacy of interpretable metho ds for detecting LEF using standard 12-lead ECGs. W e prop ose ECG-based Predictor-Driven LEF (ECGPD- LEF), a framework that in tegrates traditional AI-ECG in terpretation mo dels with either a single-predictor approac h or a multi-predictor approach to enhance recognition of subtle ECG features and improv e LEF detection. The single-predictor approac h p erforms zero-shot-lik e inference, requiring no additional training on the target dataset. The multi-predictor ap- proac h w as trained on the publicly av ailable ECG-ECHO paired dataset Ec hoNext, a com- 2 prehensiv e dataset encompassing diverse p opulations across races and clinical contexts. W e also constructed an external v alidation dataset, MIMIC-LEF, based on MIMIC-IV ( John- son et al. , 2024 ), MIMIC-IV-ECG ( Go w et al. , 2023 ), and MIMIC-IV-Note( Johnson et al. , 2023a ), which similarly co vers a heterogeneous p opulation. The mo del was ev aluated on m ulti-center datasets, including in ternal v alidation (the EchoNext test set) and external v al- idation (MIMIC-LEF). The mo del p erformance w as assessed across multiple dimensions, including the presence or absence of other structural heart disease (SHD) and v alvular heart disease (VHD), as well as across different care settings and racial or ethnic subgroups. Global and lo cal mo del explanations were p erformed to iden tify imp ortan t predictors contributing to LEF detection, and to c haracterize these subtle ECG c hanges asso ciated with LEF. 2 Metho ds 2.1 Data sources An o verview of the study design is pro vided in Figure 1 , illustrating the data sources, construc- tion of the predictor extractor, model dev elopment for low left v en tricular ejection fraction (LEF) detection, and downstream ev aluation. F our ECG-only datasets were implicitly used to dev elop the predictor extractor, and tw o indep enden t paired datasets (ECG-ECHO and ECG-Note) were explicitly used for developmen t and testing. The ECG-only datasets included Ningbo ( Zheng et al. , 2020a ), Chapman ( Zheng et al. , 2020b ), CODE-15% ( Rib eiro et al. , 2021 ), and PTB-XL ( W agner et al. , 2020 ). Informa- tion from Ningb o, Chapman, and CODE-15% was implicitly incorp orated through the pre- trained weigh ts of a T ransformer-based mo del (ST-MEM) ( Na et al. , 2024 ). PTB-XL was subsequen tly used for p ost-training to obtain the final predictor extractor. F or the detection of LEF, w e used the ECG-ECHO paired EchoNext dataset collected at Colum bia Universit y Irving Medical Cen ter ( Elias and Finer , 2025 ). External v alidation was p erformed using an ECG-Note paired dataset constructed from MIMIC-IV ( Johnson et al. , 2024 ), MIMIC-IV-ECG ( Go w et al. , 2023 ), and MIMIC-IV-Note ( Johnson et al. , 2023a ), comprising records from Beth Israel Deaconess Medical Cen ter. 2.2 Implicit ECG-only datasets Chapman ( Zheng et al. , 2020b ) and Ningb o, created under the auspices of Chapman Uni- v ersity , Shaoxing P eoples Hospital, and Ningb o First Hospital, contains 10,646 and 34,905 12-lead ECG, resp ectiv ely , stored in MUSE ECG system with a sampling rate of 500Hz and a duration of 10 seconds. Code-15% ( Rib eiro et al. , 2021 ) is obtained through strati- fied sampling from the CODE dataset ( Rib eiro et al. , 2020 ), containing 345,779 exams from 233,770 patients, collected by the T elehealth Netw ork of Minas Gerais in the p erio d b etw een 2010 and 2016. The S12L-ECG exam was p erformed mostly in primary care facilities us- ing a tele-electro cardiograph man ufactured b y T ecnologia Eletrônica Brasileira (São P aulo, Brazil)mo del TEB ECGPCor Micromed Biotecnologia (Brasilia, Brazil)mo del ErgoPC 13 and the duration of the ECG recordings is b etw een 7 and 10 s sampled at frequencies ranging from 300 to 600 Hz. The processed subset of these datasets are implicitly in the pre-trained w eights of the traditional automatic ECG diagnosis mo del (the traditional AI-ECG model). 3 Figure 1: Flo wc hart of mo del dev elopment and ev aluation. Mo del developmen t is based on a traditional AI-ECG interpretation mo del, where a large unlab eled ECG dataset is used for pre- training and a smaller lab eled ECG dataset is used for enhanced p ost-training. The automatic ECG diagnosis mo del outputs diagnostic probabilities for eac h ECG finding and serves as the predictor extractor. Based on the predictor extractor, we dev elop ed b oth a single-predictor approach and a m ulti-predictor approac h. The single-predictor approach enables zero-shot-like inference without further training to detect LEF and can serv e as an imp ortant indicator. The multi-predictor approach further trains a tabular mo del using ECG-ECHO pairs and provides b oth lo cal- and global-level explanations based on SHAP v alues. Model ev aluation is conducted on the EchoNext test set and an indep enden tly collected ECHO-Note pairs dataset. Multidimensional mo del dissection is p erformed, including o v erall performance ev aluation, model in terpretability analysis, and subgroup analyses across div erse p opulations. LEF, low left ven tricular ejection fraction; No-LEF, absence of low left ven tricular ejection fraction; SHD, structural heart disease; VHD, v alvular heart disease, SHAP , Shapley A dditive exPlanations. 4 PTB-XL ( W agner et al. , 2020 ) dataset was recorded b y devices from Sc hiller AG b et ween Octob er 1989 and June 1996, consisting 21,837 clinical 12-lead ECG records of 10 seconds length from 18885 patients. The ECG records were annotated b y up to tw o cardiologists with p oten tially multiple ECG statements out of a set of 71 different statemen ts conforming to the SCP-ECG standards ( ISO Central Secretary . , 2009 ). This dataset is a commonly used b enchmark for traditional ECG in terpretation algorithms ( Stro dthoff et al. , 2020 ). The dataset id divided into training, v alidation, and test sets with a ratio of 8:1:1. In this pap er, the dataset is implicitly in the post-trained weigh ts of traditional automatic ECG diagnosis mo dels (the traditional AI-ECG mo del). 2.3 ECG-ECHO and ECG-Note datasets Ec hoNext is a recen t published ECG-ECHO paired dataset for b enc hmarking ECG screening mo dels ( Elias and Finer , 2025 ). It in v olves 82,543 de-identified paired ECG-ECHO records from 36,286 unique patients and those aged 18 years or older w ere iden tified, who underwen t a digitally stored 12-lead ECG and a transthoracic ECHO within a 1-y ear interv al b etw een 2008 and 2022. The dataset is divided into training, v alidation, and test splits. There might includes m ultiple ECG-ECHO pairs in the training set, while only the latest ECG are adopted in the v alidation and test sets. The ECG signals w ere extracted from the GE MUSE ECG managemen t system at a sampling frequency of 250 Hz across all 12 leads with 10 seconds duration. The ECG-Note dataset is constructed based on MIMIC-IV ( Johnson et al. , 2024 ), MIMIC- IV-ECG ( Go w et al. , 2023 ), and MIMIC-IV-Note ( Johnson et al. , 2023a ), as illustrated in Figure A.1 of the App endix. W e initially iden tified adult patients who p ossessed at least one standard 10-second 12-lead ECG with sampling frequency of 500Hz and a corresp onding clinical note recorded within one year following the ECG. Exclusion criteria w ere applied to remov e: 1) pairs where clinical notes lack ed keyw ords related to ejection fraction(EF) and ECHO/TTE; 2) ECGs containing missing data (NaN). F ollo wing these exclusions, the samples were cross-referenced with MIMIC-ECG-L VEF ( Li et al. , 2025 ) to iden tify consisten t ECG-EF class pairs. F or eac h patien t, only the ECG-Note pair with the shortest time interv al w as retained, resulting in a final testing set of 16,017 ECGs. 2.4 Outcomes F or the ECG-ECHO paired dataset, v alues of left ven tricular ejection fraction were extracted from Syngo Dynamics (Siemens) and Xcelera (Philips). The lab eling strategy was defined in accordance with P oteruc ha et al. ( 2025 ), using an upper threshold of 45% to define lo w left v entricular ejection fraction (LEF). An ECG w as lab eled p ositive if it was p erformed within 1 year prior to an echocardiogram demonstrating an ejection fraction ≤ 45%. F or patien ts confirmed to hav e ejection fraction > 45%, all ECGs prior to the most recen t ECHO w ere lab eled negative. F or the ECG-Note paired dataset, a large language mo del( Qwen et al. , 2025 ) using a mo dified strategy adapted from Gao et al. ( 2025 ) w as emplo yed to extract ejection fraction v alues from the discharge table of MIMIC-IV-Note (Figure A.1 b in the App endix). The p ositiv e class w as defined as ejection fraction ≤ 45%, consistent with the definition applied to the ECG-ECHO dataset. 5 2.5 Mo del developmen t W e developed an ECG-based Predictor-Driven LEF framew ork (ECGPD-LEF) for the de- tection of LEF. As illustrated in Figure 1 b, the framework comprises t w o comp onen ts: (1) a predictor extractor that generates structured probabilistic represen tations from ra w ECG w av eforms and (2) predictor-based inference mo dels, including single-predictor and multi- predictor approaches. The single-predictor approac h p erforms LEF inference without addi- tional task-sp ecific training, whereas the m ulti-predictor approach learns a ligh tw eight tabular classifier based on the extracted predictors. 2.5.1 Predictor extractor T o instan tiate the predictor extractor, w e adopted a T ransformer-based automatic ECG di- agnosis mo del. Sp ecifically , we used the ST-MEM architecture ( Na et al. , 2024 ), pre-trained on the Chapman, Ningb o, and CODE-15% datasets. Using the publicly av ailable pre-trained w eights, w e further fine-tuned the model on the PTB-XL dataset to predict 71 con v entional ECG diagnoses. T wo training strategies w ere implemen ted: the original approac h (denoted as T ransformer) and a mo dified post-training strategy (denoted as T ransformer-PT) describ ed previously ( Zhou et al. , 2025 ). The final p ost-trained mo del, equipp ed with sigmoid activ a- tion, outputs probability estimates for eac h of the 71 diagnoses. These probabilistic outputs w ere used as predictors for the do wnstream LEF mo deling task. 2.5.2 Single-predictor and multi-predictor approaches F or eac h ECG recording, the predictor extractor generates 71 predictor v alues ranging from 0 to 1, each corresp onding to a clinically defined ECG interpretation. In the single-predictor approac h, eac h predictor v alue (PV) was ev aluated indep endently for LEF detection without additional training on the ECG-ECHO dataset. F or predictors corresp onding to normal ECG or sinus rhythm, we used (1 − PV) to reflect their inv erse clinical asso ciation with reduced LEF; for all other predictors, the original PV was used directly . Threshold-independent metrics, including AUR OC and A UPRC, w ere computed directly on the test set. F or the F1 score, the classification threshold was selected by maximizing v alidation-set p erformance. In the m ulti-predictor approac h, the 71 predictors were join tly mo deled using light weigh t tabular classifiers. As a linear model, w e implemen ted logistic regression with an l 2 p enalt y , with the regularization parameter selected via grid searc h o ver {0.001, 0.01, 0.1, 1.0, 10.0} on the v alidation set. As a nonlinear alternativ e, w e implemented XGBoost ( Chen , 2016 ), a gradien t-b o osted decision tree metho d w ell suited for structured data. The learning rate and maxim um tree depth were tuned via grid search ov er {0.05, 0.1, 0.2} and {3, 5, 7}, resp ectiv ely . The n umber of estimators was set to 1000 with early stopping (30 rounds) based on v alidation p erformance. F or b oth mo dels, the final classification threshold was determined by maximizing the F1 score on the v alidation set. 2.6 P erformance ev aluation W e first ev aluated the predictor extractor on the PTB-XL test set, as it constitutes a k ey comp onen t of the prop osed framework. LEF detection p erformance was subsequently as- sessed on t wo indep enden t datasets: the internal test set, consisting of held-out test set from 6 the ECG-ECHO dataset, and the external test set, ECG-Note, which was constructed in this study (Figure 1 ). Both single- and multi-predictor approac hes were ev aluated on these datasets. Mo del p erformance was quan tified using the area under the receiver op erating c haracter- istic curve (AUR OC), area under the precision-recall curve (A UPRC), and F1 score, consis- ten t with prior work ( Poteruc ha et al. , 2025 ). Confidence in terv als w ere estimated via 1,000 b o otstrap resamples. F or b enchmarking, we compared our metho d against the Columbia mini deep learning mo del ( Poteruc ha et al. , 2025 ), the official end-to-end baseline for LEF detection, whic h is publicly av ailable with source co de and pre-trained w eights, enabling re- pro ducible comparison. The Columbia mini model was also ev aluated on the external test set (see Section A.3 of the App endix). T o assess the relativ e con tributions of differen t comp onents and design choices, w e fur- ther ev aluated multiple configurations of the multi-predictor approac h, including alternativ e predictor extractors, tabular mo dels, and v arying num b ers of predictors. 2.7 In terpretabilit y The predictor-driven framework enables transparen t in terpretation for b oth single- and m ulti- predictor approaches. In the single-predictor approac h, eac h predictor v alue directly repre- sen ts the probability of the corresponding ECG diagnosis, pro viding inheren t clinical in ter- pretabilit y . W e identified the most important predictors and performed combined analyses with the m ulti-predictor approach. F or the m ulti-predictor approach, we emplo yed SHAP (SHapley A dditiv e exPlanations) v alues ( Lundb erg and Lee , 2017 ; Shapley et al. , 1953 ) to quantify feature con tributions at global and lo cal levels. SHAP is a game-theoretic, additive feature attribution metho d that pro vides consistent and lo cally accurate explanations ( Lundb erg and Lee , 2017 ). F or compu- tational efficiency in tree-based mo dels, we used the T ree SHAP algorithm ( Lundb erg et al. , 2018 ). Global explanation plots included cumulativ e con tributions, b eesw arm summaries, and SHAP versus predictor v alue plots to identify k ey predictors and characterize their be- ha vior in the model. Local plots highligh ted the ten predictors SHAP v alues and display ed predicted probabilities relativ e to b oth the ECG diagnosis thresholds and the LEF decision thresholds, facilitating in terpretation of individual mo del predictions. 2.8 Subgroup analysis W e ev aluated mo del p erformance across clinically relev ant subgroups in b oth the internal and external test sets. In the internal test set, predefined subgroups included the presence or absence of other structural heart disease (SHD), v alvular heart disease (VHD), age groups, sex, race/ethnicit y , and clinical con text (definitions of SHD and VHD are pro vided in Section B.1 of the App endix). In the external test set, subgroup ev aluation w as performed for the a v ailable v ariables, including age groups, sex, race/ethnicit y , and clinical con text. 7 3 Results 3.1 P opulation characteristics The ECG-ECHO (EchoNext) cohort is a publicly a v ailable b enchmark comprising 82,543 ECG examinations, partitioned in to training (n=72,475), v alidation (n=4,626), and internal test (n=5,442) sets according to the official proto col ( P oterucha et al. , 2025 ). W e further constructed an external cohort (ECG-Note; n=16,017) as an indep enden t test set (Figure A.1 ). Baseline demographic and clinical c haracteristics are summarized in T able 1 . Compared with the in ternal test cohort, the external cohort exhibited a higher prop ortion of male patien ts (54.9%) and White individuals (73.0%), whereas racial/ethnic representation in the ECG-ECHO cohort was more evenly distributed. Clinical context distributions also differed substantially , with the external cohort enriched for emergency encoun ters (57.3%) compared with the more evenly distributed emergency (36.2%), inpatient (40.5%), and out- patien t (19.5%) settings in the in ternal test cohort. These differences reflect substan tial demographic and clinical heterogeneity across cohorts. The distribution of the 71 extracted predictors is detailed in App endix T able A.1 , whic h also sho ws differences in binarized counts for selected predictors, including NORM and ILBBB. 3.2 Predictor recognition b y the predictor extractor Reliable predictor extraction is a prerequisite for do wnstream LEF mo deling. T ransformer- PT achiev ed a macro AUR OC of 94.5%, AUPR C of 41.4%, and F1 score of 38.3%, out- p erforming the original T ransformer (AUR OC 89.8%, AUPR C 30.7%) and demonstrating p erformance comparable to previously rep orted state-of-the-art results on PTB-XL ( Zhou et al. , 2025 ). Classification p erformance for 10 representativ e predictors is shown in T a- ble 2 , with results for all 71 predictors pro vided in Appendix T able C.1 . Across predictors, T ransformer-PT ac hieved consistently strong discriminative capacity as reflected b y A U- R OC v alues, whereas A UPRC and F1 v aried due to class im balance (see App endix Section C.1 ). Imp ortantly , the downstream LEF framework lev erages con tinuous predictor scores rather than threshold-dep enden t binary decisions; th us, the strong ranking p erformance of T ransformer-PT is sufficien t to ensure reliable information transfer to subsequent modeling stages, even for ultra-rare categories. 3.3 Single-predictor approach p erformance W e ev aluated the ability of individual predictors to detect LEF using their contin uous output scores, without task-sp ecific fine-tuning (zero-shot-like inference). Results for 10 represen- tativ e predictors in the in ternal test set are summarized in T able 2 , with results for the remaining predictors and the external test set provided in App endix T ables C.1 and C.2 . Predictors are ordered by decreasing F1 score in the in ternal test set. Notably , several pre- dictors demonstrated substantial standalone discriminative p erformance. Eigh t predictors (NORM, ILBBB, INJAL, ISCLA, ANEUR, ISCAL, ASMI, and SV ARR) achiev ed A UROC v alues ranging from 71.0% to 81.0% in ternally , with fiv e main taining A UR OC v alues b et ween 70.7% and 78.8% externally . The NORM predictor was the strongest individual predictor in b oth cohorts, yielding an AUR OC of 81.0%, AUPR C of 47.4%, and F1 score of 51.4% in ternally , and an A UROC of 78.8%, AUPR C of 36.2%, and F1 score of 42.3% externally . 8 T able 1: Baseline demographic and clinical c haracteristics in the ECG-ECHO and external ECG- Note cohorts. ECG–ECHO ECG–Note T raining set V alidation set In ternal test set External test set P atients (n) 26,218 4,626 5,442 16,017 ECGs (n) 72,475 4,626 5,442 16,017 Age groups 18–59 29,783 (41.1%) 1,787 (38.6%) 2,124 (39.0%) 4,270 (26.7%) 60–69 18,745 (25.9%) 1,093 (23.6%) 1,318 (24.2%) 3,637 (22.7%) 70–79 14,898 (20.6%) 975 (21.1%) 1,154 (21.2%) 3,761 (23.5%) 80+ 9,049 (12.5%) 771 (16.7%) 846 (15.5%) 4,349 (27.2%) Sex F emale 33,524 (46.3%) 2,356 (50.9%) 2,731 (50.2%) 7,222 (45.1%) Male 38,951 (53.7%) 2,270 (49.1%) 2,711 (49.8%) 8,795 (54.9%) Race/ethnicit y Hispanic 22,806 (31.5%) 1,351 (29.2%) 1,649 (30.3%) 638 (4.0%) White 21,289 (29.4%) 1,385 (29.9%) 1,569 (28.8%) 11,688 (73.0%) Blac k 11,559 (15.9%) 728 (15.7%) 846 (15.5%) 1,845 (11.5%) Asian 2,602 (3.6%) 134 (2.9%) 153 (2.8%) 414 (2.6%) Other 5,272 (7.3%) 380 (8.2%) 457 (8.4%) 524 (3.3%) Unkno wn 8,947 (12.3%) 648 (14.0%) 768 (14.1%) 908 (5.7%) Clinical context Emergency 22,811 (31.5%) 1,688 (36.5%) 1,971 (36.2%) 9,170 (57.3%) Inpatien t 34,906 (48.2%) 1,903 (41.1%) 2,203 (40.5%) - Outpatien t 12,423 (17.1%) 858 (18.5%) 1,059 (19.5%) - Pro cedural 2,335 (3.2%) 177 (3.8%) 209 (3.8%) - Urgen t - - - 3,028 (18.9%) Observ ation - - - 2,173 (13.6%) Surgical Same Day - - - 1,022 (6.4%) Electiv e - - - 624 (3.9%) Outcome Ejection F raction ≤ 45% 16,962 (23.4%) 866 (18.7%) 962 (17.7%) 2,517 (15.7%) The training, v alidation, and internal test splits corresp ond to the official splits of the ECG-ECHO dataset, Ec hoNext ( Elias and Finer , 2025 ). The external test cohort, ECG-Note, was derived from the MIMIC-IV database and its associated ECG and clinical note mo dules ( Gow et al. , 2023 ; Johnson et al. , 2024 , 2023a ). V alues are shown as coun ts and p ercentages. 9 Among abnormal ECG diagnoses, ILBBB and INJAL ac hieved the highest discriminative p erformance, with in ternal A UR OCs of 80.0% and 75.3%, and external A UROCs of 73.4% and 71.6%, respectively . A cross all 71 predictors, discriminativ e p erformance v aried substan tially . Nevertheless, the ma jority of predictors, 58 in the internal test set and 54 in the external test set, ac hieved A UROC v alues ab o ve chance level (50%), suggesting that LEF-related information is dis- tributed across div erse ECG-deriv ed predictors rather than confined to a small subset of diagnoses. Predictors with weak er standalone p erformance still con tributed incremen tal im- pro vemen ts when integrated into the multi-predictor mo del describ ed in the next subsection. An additional observ ation w as that optimal LEF detection thresholds w ere consisten tly substan tially lo wer than those used for con ven tional ECG classification. F or example, while a threshold of 0.370 for the NORM predictor identified abnormal ECGs, a substan tially low er threshold of 0.003641 was optimal for LEF detection, with similar patterns observ ed across most predictors (Appendix Figure C.1 ). T able 2: Performance of traditional AI-ECG predictors and LEF detection using a single-predictor approac h. Predictor T raditional ECG Mo del LEF Detection (Single-predictor) A UROC A UPRC F1 Score Thresh A UROC A UPRC F1 Score Thresh NORM 94.9 (94.1–95.7) 92.8 (91.4–94.2) 85.1 (83.5–86.7) 0.370 81.0 (79.6–82.4) 47.4 (44.2–50.9) 51.4 (49.3–53.8) 0.003641 ILBBB 90.9 (65.3–99.7) 31.6 ( 8.1–70.3) 30.0 ( 0.0–55.6) 0.163 80.0 (78.4–81.5) 48.5 (45.3–52.0) 50.5 (48.0–53.1) 0.000349 INJAL 98.6 (97.3–99.6) 52.2 (27.0–79.8) 47.6 (19.0–72.0) 0.500 75.3 (73.7–76.8) 34.8 (32.4–37.6) 45.1 (42.8–47.4) 0.000386 ISCLA 92.3 (85.5–97.6) 19.0 ( 5.6–45.9) 12.5 ( 0.0–36.4) 0.248 75.9 (74.5–77.4) 38.9 (36.2–42.3) 43.9 (41.6–46.3) 0.001918 ANEUR 96.7 (91.7–99.2) 15.0 ( 6.0–37.3) 11.8 ( 0.0–33.3) 0.294 74.8 (73.1–76.6) 38.9 (35.9–42.3) 42.9 (40.4–45.4) 0.001924 ISCAL 95.3 (94.0–96.6) 33.9 (25.2–46.8) 34.8 (23.3–45.4) 0.315 73.7 (72.1–75.2) 32.1 (29.8–34.6) 42.3 (40.1–44.4) 0.001148 ASMI 98.1 (97.4–98.7) 88.4 (84.9–91.4) 80.3 (76.4–83.8) 0.300 72.8 (71.0–74.7) 37.4 (34.3–40.5) 41.8 (39.4–43.9) 0.023987 SV ARR 92.0 (85.2–97.2) 21.3 ( 6.1–45.9) 22.2 ( 5.9–40.0) 0.061 71.0 (69.3–72.6) 31.8 (29.5–34.5) 40.4 (38.1–42.9) 0.000218 INJIL 92.9 (85.8–99.2) 2.3 ( 0.3–12.0) 0.0 ( 0.0– 0.0) 0.012 69.8 (68.0–71.5) 29.5 (27.3–32.0) 40.1 (37.9–42.3) 0.000035 CRBBB 99.8 (99.6–99.9) 89.1 (80.8–95.3) 83.2 (74.8–89.8) 0.144 69.7 (68.1–71.5) 28.4 (26.3–30.8) 39.5 (37.6–41.6) 0.000005 Predictor p erformance was obtained from the traditional automatic AI-ECG model. LEF detection perfor- mance was derived using the single-predictor approach prop osed in this study . A UROC, A UPRC, and F1 are reported with 95% confidence interv als. The tw o Thresh columns indicate the thresholds used to max- imize the F1 score on the v alidation set for predictor performance and LEF detection, respectively . The 10 predictors with the highest F1 scores for LEF detection are rep orted in this table. NORM, normal ECG; ILBBB, incomplete left bundle branc h blo c k; INJAL, sub endocardial injury in an terolateral leads; ISCLA, isc hemic in lateral leads; ANEUR, ST-T changes compatible with ven tricular aneurysm; ISCAL, ischemic in anterolateral leads; ASMI, anteroseptal m yocardial infarction; SV ARR, suprav entricular arrhythmia; INJIL, sub endo cardial injury in inferolateral leads; CRBBB, complete right bundle branch blo c k. 10 3.4 Multi-predictor approach p erformance W e ev aluated the ECGPD-LEF multi-predictor framework across com binations of predictor extractors and tabular mo dels, b enchmarking against the official end-to-end Colum bia mini mo del ( Poteruc ha et al. , 2025 ) (T able 3 ). Among all configurations, XGBoost combined with the p ost-trained T ransformer predictor extractor (T ransformer-PT-XGBoost) achiev ed the highest p erformance in b oth internal and external test sets. In the internal test set, T ransformer-PT-XGBoost yielded an AUR OC of 88.4% (95% CI, 87.1–89.5), an AUPR C of 69.2% (66.2–72.0), and an F1 score of 64.5% (62.2–66.7), significan tly outp erforming the Colum bia mini mo del, which achiev ed an A UROC of 85.2% (83.9–86.5), an AUPR C of 59.9% (56.6–63.3), and an F1 score of 57.9% (55.4–60.3), with non-o verlapping confidence interv als. Comparable p erformance impro v ements w ere observ ed in the external test set (T able 3 ). T o disentangle the contributions of individual comp onen ts, we conducted con trolled com- parisons on the in ternal test set (T able 3 ; Figure 2 ). Holding the predictor extractor constant, X GBo ost consistently outp erformed logistic regression for b oth T ransformer and T ransformer- PT, with the largest gains observed for T ransformer-PT (A UR OC +3.5 p ercen tage p oints; A UPRC +11.4 p oints). Holding the tabular mo del constant, p ost-training of the predic- tor extractor (T ransformer-PT vs. T ransformer) yielded consistent impro vemen ts across all metrics. W e further examined performance as predictors were progressively added to the tabular mo del in descending order of single-predictor F1 score. A cross configurations, performance generally improv ed with additional predictors, with T ransformer-PT-X GBo ost consistently ac hieving the highest p erformance across all predictor coun ts and metrics (Figure 2 b-d). Other configurations follow ed similar trends but did not exceed the Colum bia mini mo del baseline for A UR OC or AUPR C, ev en at their resp ective p eak performance. 3.5 Mo del explanation The tabular comp onen t of ECGPD-LEF enables b oth global and lo cal interpretabilit y via SHAP analysis. W e analyzed T ransformer-PT-XGBoost on the internal test set (Figure 3 , Figure 4 ). At the global level, cumulativ e SHAP contributions increased with the n umber of predictors included in the mo del (Figure 3 a), consisten t with the p erformance trends observ ed in Figure 2 of Section 3.4 . The ranking of predictors b y mean absolute SHAP v alue sho wed sligh t differences compared with the ranking based on single-predictor F1 scores, although the top con tributors remained largely consistent (Figure 3 b). F or example, NORM, ILBBB, INJAL, ISCLA, and ANEUR rank ed 1-5 b y single-predictor F1 score, whereas their SHAP- based ranking was 1, 2, 5, 3, and 4, respectively . In addition, SHAP v alues v aried markedly at low predictor v alues (Figure 3 c-g), often at magnitudes substantially b elo w the diagnostic thresholds, which is consistent with the observ ation in Section 3.3 . A t the lo cal lev el, explanation plots for a p ositive and a negative case illustrate individual- ized predictor con tributions (Figure 4 ). In these t w o examples, global imp ortance patterns are reflected in case-sp ecific prediction profiles, with NORM and ILBBB contributing the most. When combined with the single-predictor approac h and corresp onding diagnostic thresh- olds, the predictions for b oth cases are in terpretable. In particular, the v alues of NORM and ILBBB exceeded the LEF-p ositive thresholds in the p ositive case and the LEF-negativ e thresholds in the negativ e case. F urthermore, the LA O/LAE predictor in the p ositive case 11 T able 3: Performance comparison of the prop osed ECGPD-LEF framework and baseline Columbia mini mo del for LEF detection. Metho d T abular Mo del Predictor Extractor AUR OC A UPRC F1 Score In ternal test set Colum bia mini mo del – – 85.2 (83.9–86.5) 59.9 (56.6–63.3) 57.9 (55.4–60.3) ECGPD-LEF Logistic Regression T ransformer 84.5 (83.1–85.9) 56.8 (53.4–60.3) 57.0 (54.6–59.7) T ransformer-PT 84.9 (83.6–86.3) 57.8 (54.6–61.2) 58.5 (56.2–60.8) X GBo ost T ransformer 84.9 (83.5–86.2) 58.5 (55.1–61.7) 57.6 (55.1–60.1) T ransformer-PT 88.4 (87.1–89.5) 69.2 (66.2–72.0) 64.5 (62.2–66.7) External test set Colum bia mini mo del – – 80.8 (79.9-81.7) 45.8 (43.7-47.7) 47.7 (46.3-49.1) ECGPD-LEF X GBo ost T ransformer-PT 86.9 (86.2-87.7) 57.6 (55.6-59.7) 53.8 (52.5-55.0) The Columbia mini mo del is the official b enchmark on the internal test set, EchoNext ( P oterucha et al. , 2025 ), with trained official w eights. This mo del relies on sev en tabular features; since the in ternal test set provides all seven features, it can b e applied directly . F or the external cohort, five of these features are not directly av ailable, so w e computed them from the MIMIC-IV ECG mo dule to enable inference with the Columbia mini mo del (details of this feature computation are provided in Section A.3 ). The prop osed ECGPD-LEF framew ork is ev aluated with different configurations of automatic AI-ECG mo dels and tabular mo dels. F or example, "XGBoost" indicates that the tabular model is X GBo ost applied to the extracted predictors, and "T ransformer-PT" indicates that the predictor extractor of the prop osed framew ork is based on T ransformer-PT. Since the baseline Columbia mini mo del is an end-to-end metho d, it do es not dep end on the c hoice of predictor extractor or tabular mo del. This table th us compares both the performance of the baseline mo del and the p erformance of the proposed framework under different configurations. AUR OC, AUPR C, and F1 are rep orted with 95% confidence interv als. 12 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate a Transformer-LR (AUROC = 0.845) Transformer-PT-LR (AUROC = 0.849) Transformer-XGBoost (AUROC = 0.849) Transformer-PT-XGBoost (AUROC = 0.884) Columbia mini model (AUROC = 0.852) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision b Transformer-LR (AUPRC = 0.568) Transformer-PT-LR (AUPRC = 0.578) Transformer-XGBoost (AUPRC = 0.585) Transformer-PT-XGBoost (AUPRC = 0.692) Columbia mini model (AUPRC = 0.599) 1 20 40 60 71 Number of predictors 80 82 84 86 88 90 AUROC (%) c Transformer-LR Transformer-PT-LR Transformer-XGBoost Transformer-PT-XGBoost 1 20 40 60 71 Number of predictors 45 50 55 60 65 70 75 AUPRC (%) d Transformer-LR Transformer-PT-LR Transformer-XGBoost Transformer-PT-XGBoost 1 20 40 60 71 Number of predictors 50 52 54 56 58 60 62 64 F1 Score (%) e Transformer-LR Transformer-PT-LR Transformer-XGBoost Transformer-PT-XGBoost Figure 2: Mo del p erformance across differen t ev aluation settings. The first row shows re- ceiv er op erating characteristic (ROC) curv es (a) and precision–recall (PR) curves (b) for five methods using the full feature set (71 predictors). The second ro w presents the p erformance of four models ev aluated with an increasing n umber of predictors (1–71), in terms of A UROC (c), AUPR C (d) and F1 score (e). The gray dashed line indicates the Columbia mini mo del and serves as a reference baseline. 13 and the NORM predictor in the negativ e case exceeded their standard diagnostic thresholds (Figures 4 b-d), indicating that these patterns are potentially h uman-recognizable. 1 20 40 60 71 Number of predictors 0 20 40 60 80 100 Cumulative contribution (%) a −1 0 1 2 SHAP va ue (impact on mode o utput) SV ARR CRBBB INJAL ISCAL ASMI INJIL ANEUR ISCLA ILBBB NORM b Lo) High Feature (a ue 0.0 0.2 0.4 0.6 0.8 1.0 NORM value −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 SHAP value c Diagnosis th eshold LEF th eshold 0.2 0.4 0.6 0.8 1.0 Relative density 0.0 0.1 0.2 0.3 0.4 ILBBB val e −1.0 −0.5 0.0 0.5 1.0 1.5 SHAP val e d Diagnosis threshold LEF threshold 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Relative density 0.0 0.1 0.2 0.3 ISCLA value −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 SHAP value e Diagnosis hreshold LEF hreshold 1 2 3 4 Rela ive densi y 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ANEUR value −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 1.25 SHAP value f Diagnosis th eshold LEF th eshold 1 2 3 4 5 Relative density 0.0 0.2 0.4 0.6 INJIL value −0.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 SHAP value g Diagno i thre hold LEF thre hold 1 2 3 4 5 6 Relative den ity 0.0 0.2 0.4 0.6 0.8 1.0 ASMI value −0.2 0.0 0.2 0.4 0.6 0.8 SHAP value h Diagno i thre hold LEF thre hold 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Relative den ity Figure 3: Global-level explanation of ECG mo del predictions across all predictors. The first row sho ws (a) the cumulativ e absolute SHAP contributions of all 71 predictors, expressed as p ercen tages of the total con tribution, and (b) a SHAP b eeswarm plot for the top 10 predictors ranked b y F1 score obtained from the single-predictor metho d. The second and third ro ws show (c-h) the relationships b et ween individual predictors (NORM, ILBBB, ISCLA, ANEUR, INJIL and ASMI) and their SHAP v alues. P oint densit y was estimated using a Gaussian k ernel density on the log- transformed v alues. Eac h plot includes t wo vertical reference lines: the first (LEF threshold) indicates the threshold that ac hieves the optimal F1 score using a single-predictor method, and the second (Diagnosis threshold) indicates the p ositivity threshold defined by an indep endent ECG diagnosis mo del. 3.6 Subgroup analysis Mo del p erformance w as ev aluated across clinically relev ant subgroups for ECGPD-LEF. A cross subgroups defined b y age, sex, race/ethnicit y , and clinical context, the p erformance of the single-predictor approach (NORM, ILBBB, INJIL, and ISCLA) in the internal and 14 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 62 oth r f atur s 0.018 = ALMI 8.9 -07 = ISCAS 0.293 = LAO/LAE 0.024 = ANEUR 8.2 -04 = AMI 2.1 -05 = T AB_ 8.8 -06 = ISCAN 8.7 -03 = ILBBB 7.2 -06 = NORM 62 oth r f atur s ALMI ISCAS LAO/LAE ANEUR AMI T AB_ ISCAN ILBBB NORM +1.46 +0.76 +0.59 +0.21 +0.17 +0.17 +0.12 +0.44 −0.52 −0.16 E [ f ( X )] = − 1. 219 f ( x ) = 2.027 a −4.0 −3.5 −3.0 −2.5 −2.0 −1.5 −1.0 62 other features 2.2e-04 = ANEUR 9.8e-06 = ISCAS 2.5e-03 = T AB_ 0.081 = SBRAD 1.7e-05 = S T AC 2.3e-05 = ISCLA 6.9e-06 = LNGQT 1.8e-06 = ILBBB 0.764 = NORM 62 other features ANEUR ISCAS T AB_ SBRAD S T AC ISCLA LNGQT ILBBB NORM +0.19 +0.21 −0.93 −0.69 −0.36 −0.32 −0.28 −0.26 −0.16 −0.13 E [ f ( X )] = − 1. 219 f ( x ) = − 3.936 b NORM ILBBB ISCAN T AB_ AMI ANEUR LAO/LAE ISCAS ALMI PMI 0.0 0.2 0.4 0.6 0.8 1.0 Predictor probability c Diagnosis t res old LEF t res old NORM ILBBB LNGQT ISCLA SVT AC SBRAD T AB_ ISCAS ANEUR AMI 0.0 0.2 0.4 0.6 0.8 1.0 Pre dictor probability d Diagnosis thr shold LEF thr shold Figure 4: Lo cal-lev el explanation of mo del predictions for a p ositive and a negative case. The first row shows SHAP waterfall plots for a p ositive case (a) and a negativ e case (b), illustrating the top ten ECG predictors con tributing to the prediction, with the remaining predictors aggregated as “62 other features” . F eature contributions are shown in the log-odds space and sum to the final mo del output. The second row presen ts the corresp onding predictor–probability relationships for the same cases (c,d). V ertical reference lines indicate positivity thresholds deriv ed from an indep endent ECG diagnosis mo del. 15 external test sets is presented in Section C.4 of the Appendix. These predictors remained effectiv e across b oth test sets, with NORM, ILBBB, and INJIL generally achieving higher p erformance. F or the m ulti-predictor approac h, subgroup results for T ransformer-PT-XGBoost in the in ternal test set are shown in T able 4 . T ransformer-PT-XGBoost consisten tly outperformed the Columbia mini mo del. Relative impro vemen ts ranged from 2.0%-12.2% for AUR OC, 3.5%-32.3% for A UPR C, and 5.9%-30.4% for F1 score across 16 subgroups. Improv ements in A UPRC and F1 score exceeded 10% in 15 and 12 subgroups, respectively . Similar trends w ere observ ed in the external test set (T able B.2 ). Stratified analyses by the presence of structural heart disease (SHD) and v alvular heart disease (VHD) are presen ted in Figures B.1 and B.2 in the App endix. In b oth settings, T ransformer-PT-XGBoost demonstrated sup erior p erformance compared with the Colum bia mini mo del. Notably , the performance of T ransformer-PT-XGBoost remained stable, whereas the Columbia mini mo del sho w ed reduced p erformance in subgroups with SHD and VHD. 4 Discussion 4.1 A structured paradigm for LEF detection In this study , w e prop ose a structured predictor-integration framew ork (ECGPD-LEF) for ECG-based detection of reduced left ven tricular function that departs from con ven tional end-to-end wa v eform mo deling. The proper m ulti-predictor configuration (T ransformer-PT- X GBo ost) demonstrated robust p erformance in b oth the internal hold-out test set (AUR OC 88.4%, F1 64.5%) and the external test set (AUR OC 86.8%, F1 53.6%), significantly outp er- forming a recen t strong end-to-end baseline, the Colum bia mini mo del ( Poteruc ha et al. , 2025 ) (in ternal AUR OC 85.2%, F1 57.9%; external AUR OC 80.7%, F1 46.4%). Performance gains w ere main tained across subgroups defined b y age, sex, race/ethnicity , clinical context, and the presence or absence of structural or v alvular heart disease, supp orting the generalizabilit y of the approac h. Within this predictor-based architecture, sev eral clinically recognizable ECG diagnoses, including NORM, ILBBB, and INJAL, were strongly asso ciated with LEF detec- tion (in ternal A UROC 75.3%–81.0%; external A UROC 71.6%–78.6%), enabling transparen t in terpretation of model outputs. As a ligh t weigh t extension built up on an existing ECG di- agnosis mo del, this structured framew ork provides improv ed discrimination while preserving mo dularit y and clinical interpretabilit y . 4.2 The need for publicly b enc hmarked AI-ECG ev aluation Artificial intelligence applied to ECG has demonstrated strong p erformance in emerging tasks, including detection of aortic stenosis, tricuspid regurgitation and left ven tricular dysfunction. Ho wev er, most prior models hav e b een developed and ev aluated on institution-specific pri- v ate datasets with heterogeneous p opulation characteristics and outcome definitions, limiting repro ducibilit y and preven ting rigorous cross-mo del comparison. Early w ork by A ttia et al. ( 2019a ) trained a con volutional neural netw ork on 44,959 patien ts to iden tify v entricular dys- function defined as ejection fraction ≤ 35%, with prosp ectiv e v alidation p erformed at the same institution ( A ttia et al. , 2019b ). Subsequent refinemen ts extended detection to ejec- tion fraction ≤ 40% and rep orted multisite v alidation using digital ECG input alone ( Carter 16 T able 4: Subgroup p erformance of ECGPD-LEF and the Columbia mini model in the internal test set. Colum bia mini mo del ECGPD-LEF (Multi-predictor) Subgroup n Prev alence (%) A UROC AUPR C F1 Score AUR OC A UPRC F1 Score Age groups 18–59 2124 13.6 85.2 (82.8–87.5) 53.5 (48.1–60.0) 53.8 (49.0–58.3) 87.4 (85.0–89.8) 64.6 (59.4–69.9) 59.0 (54.7–63.6) 60–69 1318 16.6 83.8 (80.9–86.6) 57.6 (51.1–64.6) 54.7 (49.0–59.7) 87.2 (84.2–90.2) 70.6 (64.9–76.3) 63.3 (58.1–68.5) 70–79 1154 21.7 86.2 (83.7–88.7) 65.8 (59.5–72.2) 63.7 (58.9–67.9) 89.7 (87.2–91.8) 74.7 (69.5–80.0) 70.5 (65.8–74.8) 80+ 846 24.1 83.5 (80.4–86.5) 65.4 (59.5–71.8) 59.5 (54.7–64.4) 87.3 (84.8–89.7) 67.7 (60.9–74.1) 66.1 (61.4–70.8) Sex F emale 2731 12.4 86.9 (85.0–88.8) 52.3 (46.9–58.0) 53.3 (49.0–57.4) 89.9 (88.0–91.7) 63.4 (58.2–68.3) 60.4 (56.0–64.5) Male 2711 23.0 83.1 (81.2–84.7) 64.0 (59.9–68.0) 60.6 (57.5–63.5) 86.9 (85.1–88.5) 72.9 (69.1–76.1) 67.0 (64.0–69.7) Race / ethnicity Hispanic 1649 16.7 86.9 (84.5–89.1) 63.3 (57.2–68.5) 60.0 (55.3–64.5) 89.5 (87.3–91.7) 71.7 (66.0–76.6) 65.2 (60.1–69.3) White 1569 16.3 84.0 (81.1–86.6) 55.7 (49.4–62.1) 55.1 (50.2–60.0) 88.6 (86.4–90.8) 64.8 (58.7–71.1) 63.6 (59.0–68.0) Blac k 846 19.3 85.4 (82.3–88.7) 65.5 (58.5–72.4) 59.8 (53.6–65.6) 88.1 (84.9–91.1) 73.6 (67.2–79.3) 66.9 (61.2–72.2) Asian 153 16.3 87.8 (80.2–94.3) 61.4 (44.5–81.4) 57.1 (41.5–69.4) 92.5 (87.3–96.7) 72.6 (54.7–86.2) 66.7 (50.0–79.1) Other 457 15.8 83.1 (78.0–87.9) 52.1 (40.7–64.0) 49.4 (40.0–58.0) 86.9 (81.5–91.6) 65.3 (53.4–76.8) 59.4 (50.0–68.4) Unkno wn 768 22.3 84.1 (80.8–87.2) 61.5 (53.9–69.1) 61.1 (55.0–66.7) 85.8 (82.4–89.1) 68.8 (61.6–75.9) 64.7 (58.5–70.0) Clinical context Emergency 1971 15.7 86.0 (83.6–88.2) 59.9 (54.1–65.3) 58.5 (53.7–62.8) 87.8 (85.5–89.9) 67.1 (61.7–72.2) 62.0 (57.5–65.8) Inpatien t 2203 24.4 81.9 (79.7–84.0) 62.4 (58.1–67.0) 59.1 (55.9–62.5) 86.1 (84.1–88.1) 71.8 (67.7–75.7) 66.8 (63.3–69.8) Outpatien t 1059 6.6 87.3 (82.0–91.5) 44.3 (33.4–56.5) 48.4 (38.4–57.0) 90.6 (85.9–94.5) 54.6 (42.8–65.4) 56.4 (46.4–64.9) Pro cedural 209 21.5 79.5 (71.6–85.9) 57.3 (41.7–72.6) 52.3 (37.8–63.8) 89.2 (83.0–94.0) 75.8 (62.8–86.3) 68.2 (54.5–78.2) ECGPD-LEF is configured with the predictor extractor based on T ransformer-PT and the tabular model X GBo ost. This configuration was selected for illustration in the in ternal test set. Colum bia mini model is the official b enchmark on the same set ( Poteruc ha et al. , 2025 ). AUR OC, AUPR C, and F1 are rep orted with 95% confidence interv als. 17 et al. , 2026 ). Although these studies ac hieved strong discrimination, the absence of publicly accessible b enc hmarking datasets constrains transparen t ev aluation. Recen tly , P oterucha et al. ( 2025 ) addressed this gap b y releasing a de-iden tified ECG dataset comprising 36,286 unique patients with predefined training, v alidation, and test splits, and by benchmarking a baseline mo del (the Columbia mini mo del) that achiev ed discrimination comparable to mo dels trained on substantially larger proprietary cohorts. By providing standardized data partitions, mo del weigh ts, and op en-source co de, this work established a transparent and repro ducible ev aluation framew ork for AI-ECG researc h, up on whic h our study enables rig- orous and fair comparison. F urthermore, to extend v alidation b ey ond a single b enc hmarked dataset, we developed ECG-Note using publicly a v ailable datasets and large language mo dels, establishing an additional external v alidation dataset for LEF detection that captures diverse patien t populations and clinical contexts. 4.3 Mo ving b ey ond end-to-end blac k-b ox mo deling Most AI-ECG approaches for left ven tricular dysfunction hav e relied on end-to-end deep learning arc hitectures ( Attia et al. , 2019a , b ; Carter et al. , 2026 ; P oterucha et al. , 2025 ). While suc h mo dels ac hieve strong predictive p erformance, their limited interpretabilit y p oses c hallenges for clinical integration, particularly in applications that extend b ey ond conv en- tional ECG criteria grounded in established ECG principles. In the absence of transparen t mec hanistic reasoning, black-box predictions may b e difficult to reconcile with established diagnostic framew orks, p otentially limiting clinician trust and adoption. Efforts to enhance transparency ha v e included tabular mo dels constructed from 555 discrete ECG measuremen ts ( Hughes et al. , 2024 ). Although more interpretable, these approaches dep end on propri- etary commercial measuremen t algorithms that may v ary across ECG platforms ( Strodthoff et al. , 2023 ), thereb y constraining generalizability and standardized external ev aluation. In con trast, we in tro duce a structured representation paradigm that in tegrates the predictive capacit y of deep learning with the interpretabilit y of tabular mo deling. Eac h predictor corre- sp onds to clinically meaningful features and it do es not rely on v endor-sp ecific measuremen t pip elines. Within the publicly benchmark ed setting, this approach demonstrates impro ved p erformance ov er the end-to-end Columbia mini mo del, supp orting the feasibilit y of clinically in terpretable y et high-p erforming AI-ECG systems. 4.4 In terpretabilit y rev eals clinically meaningful indicators for LEF detec- tion In terpretability analyses iden tified diagnostically informative predictors that provide mec h- anistic insight into LEF detection. In the single-predictor setting, contin uous probability outputs from the trained traditional ECG diagnosis model w ere sufficient to achiev e mean- ingful discrimination in a zero-shot-like manner. F or example, the predicted probabilit y of NORM alone yielded an in ternal A UROC of 81.0% and an external AUR OC of 78.6%. Im- p ortan tly , this do es not imply that the binary diagnosis of NORM directly indicates LEF. Rather, the con tinuous probability output, well b elow the clinical decision threshold, captures graded deviations from normal ECG patterns that are strongly associated with LEF. This finding suggests that deep neural net works encode sub clinical ECG v ariations within their probabilistic representations, even when such v ariations do not cross conv entional diagnostic 18 b oundaries, offering a p otential explanation for prior observ ations that large-scale deep learn- ing mo dels can detect no vel cardio v ascular phenotypes from subtle signal alterations. Within the m ulti-predictor framew ork, SHAP analyses demonstrated consisten t imp ortance patterns across predictors. A t the p opulation level (Figure 3 ), SHAP v alues v aried substantially across probabilit y ranges, indicating that graded shifts in ECG-deriv ed diagnostic probabilities con- tribute meaningfully to LEF risk estimation. At the individual lev el (Figure 4 ), local expla- nations highlighted the predictors driving each decision and pro vided human-in terpretable insigh ts in to the mo dels reasoning (Figures 4 c-d). T ogether, these findings suggest that LEF- asso ciated ECG signatures ma y b e decomp osed in to combinations of clinically recognizable diagnostic dimensions, enhancing the structural transparency of the prop osed framework. 4.5 Subgroup analysis demonstrates robustness across p opulations Subgroup analyses demonstrated that ECGPD-LEF main tained consistent discriminatory p erformance across diverse demographic and clinical strata, including age, sex, race/eth- nicit y , and care settings. A cross all ev aluated subgroups, ECGPD-LEF outp erformed the Colum bia Mini mo del in terms of A UROC, A UPRC, and F1 score. Notably , p erformance remained stable in patien ts with and without concomitan t structural heart disease (SHD) or v alvular heart disease (VHD), suggesting that the ECG signatures captured b y ECGPD- LEF are not merely pro xies for co existing structural abnormalities but instead reflect signal comp onen ts sp ecifically asso ciated with LEF. Despite limited represen tation of certain racial subgroups (e.g., Asian participants comprising 3.6% of the training cohort), comparable p er- formance w as observ ed in b oth in ternal and external v alidation cohorts (AUR OC 92.5% and 91.0%, respectively), supp orting the generalizability and p otential clinical applicabilit y of the framew ork. 4.6 Scalabilit y and extensibilit y of the framew ork The proposed ECGPD-LEF framework exhibits scalability and extensibility across m ultiple dimensions. First, its mo dular design enables in tegration with existing ECG diagnostic mod- els that are increasingly adopted in clinical practice, allowing light weigh t extension of current AI-ECG systems without requiring full arc hitectural replacement. Second, p erformance im- pro vemen ts observ ed with the T ransformer-PT backbone (T able 3 ) suggest that adv ances in upstream ECG diagnostic mo dels ma y directly translate in to enhanced LEF detection. As ECG diagnostic mo dels contin ue to evolv e with larger and higher-quality datasets, further gains may b e an ticipated. Third, the predictor-based structure is inheren tly expandable. In this study , we incorp orated 71 diagnostic predictors derived from PTB-XL, and observed that mo del p erformance scaled with the num b er of predictors included in the tabular comp onent (Figures 2 and 3 a). A dditional clinically established ECG diagnostic features ( Cardiov as- cular Committee of China Medical W omens Asso ciation et al. , 2023 ) may be incorp orated within the same framew ork, enabling progressive refinement without fundamen tal arc hitec- tural redesign. T ogether, these prop erties underscore the scalability and extensibility of the framew ork for future AI-ECG dev elopment. 19 4.7 Limitations and future directions Despite the fav orable p erformance and robustness of ECGPD-LEF compared with the latest deep learning baseline, sev eral limitations w arrant consideration. First, the multi-predictor framew ork was developed using lab els deriv ed from ECHO rep orts, whic h may b e sub ject to in ter-observer v ariability in ultrasound in terpretation. Although external v alidation on the MIMIC-IV-Note cohortcomprising data from heterogeneous cardiac imaging sourcespartially mitigates this concern, p otential lab eling inconsistencies cannot b e fully excluded. Second, while b oth the EchoNext and MIMIC- IV-Note datasets include diverse p opulations, further v alidation in larger, international cohorts is warran ted to ensure broad generalizability . Third, the predictor extractor was pretrained on a moderately sized ECG diagnosis dataset due to the limited av ailabilit y of large-scale public ECG corp ora. Leveraging larger, high-quality ECG diagnosis datasets may further enhance feature represen tation and downstream LEF detection p erformance. F uture work should explore scaling strategies and prosp ectiv e clinical v alidation to confirm real-world clinical applicability . 5 Conclusion In summary , ECGPD-LEF provides a clinically interpretable, mo dular, and high-p erforming framew ork for ECG-based detection of LEF. By building up on existing ECG diagnostic mod- els, it achiev es sup erior p erformance compared with strong black-box baselines, while remain- ing light weigh t and scalable. The structured predictor-based design enables transparent in- terpretation, revealing clinically meaningful indicators of LEF, suc h as probabilistic outputs from NORM and other predictors. By com bining accuracy , stabilit y , and in terpretability , this framew ork provides a practical and scalable screening to ol for real-w orld clinical applications. 20 A Supplemen tary for the datasets A.1 P opulation characteristics (estimated predictors) T able A.1 summarizes the remaining p opulation c haracteristics of the ECG-ECHO and ECG- Note datasets, as estimated b y a traditional automatic ECG diagnosis mo del (T ransformer- PT). It should b e emphasized that these c haracteristics are deriv ed from mo del-based pre- dictions rather than direct annotations from the original data collection pro cess. Based on these estimates, these datasets cov er a broad range of diagnostic ECG predictors. T able A.1: Estimated patient characteristics across the training, v alidation, and test splits of the Ec hoNext dataset and the external cohort. ECG-ECHO ECG–Note T raining set V alidation set T est set External Set Estimated predictor NORM 19,259 (26.6%) 1,485 (32.1%) 1,806 (33.2%) 3,343 (20.9%) ILBBB 161 (0.2%) 8 (0.2%) 8 (0.1%) 74 (0.5%) INJAL 666 (0.9%) 48 (1.0%) 48 (0.9%) 108 (0.7%) ISCLA 109 (0.2%) 3 (0.1%) 6 (0.1%) 24 (0.1%) ANEUR 185 (0.3%) 13 (0.3%) 9 (0.2%) 27 (0.2%) ISCAL 941 (1.3%) 50 (1.1%) 59 (1.1%) 181 (1.1%) ASMI 12,498 (17.2%) 641 (13.9%) 759 (13.9%) 2,469 (15.4%) SV ARR 690 (1.0%) 47 (1.0%) 57 (1.0%) 127 (0.8%) INJIL 2,173 (3.0%) 125 (2.7%) 171 (3.1%) 393 (2.5%) CRBBB 6,223 (8.6%) 361 (7.8%) 382 (7.0%) 1,147 (7.2%) LAFB 9,713 (13.4%) 590 (12.8%) 701 (12.9%) 1,965 (12.3%) ALMI 1,487 (2.1%) 67 (1.4%) 62 (1.1%) 256 (1.6%) ABQRS 19,488 (26.9%) 1,074 (23.2%) 1,179 (21.7%) 3,267 (20.4%) CLBBB 2,076 (2.9%) 137 (3.0%) 166 (3.1%) 506 (3.2%) ILMI 1,536 (2.1%) 95 (2.1%) 105 (1.9%) 334 (2.1%) INJAS 3,236 (4.5%) 167 (3.6%) 190 (3.5%) 637 (4.0%) INVT 6,849 (9.5%) 396 (8.6%) 400 (7.4%) 1,268 (7.9%) PV C 3,850 (5.3%) 253 (5.5%) 260 (4.8%) 864 (5.4%) ISCIL 1,140 (1.6%) 63 (1.4%) 79 (1.5%) 180 (1.1%) 1A VB 3,462 (4.8%) 202 (4.4%) 231 (4.2%) 640 (4.0%) ISC_ 5,632 (7.8%) 346 (7.5%) 398 (7.3%) 1,675 (10.5%) IV CD 3,062 (4.2%) 173 (3.7%) 204 (3.7%) 696 (4.3%) LA O/LAE 6,886 (9.5%) 375 (8.1%) 460 (8.5%) 1,330 (8.3%) ISCAN 3,160 (4.4%) 154 (3.3%) 153 (2.8%) 507 (3.2%) ISCAS 1,109 (1.5%) 62 (1.3%) 51 (0.9%) 202 (1.3%) AFIB 5,778 (8.0%) 362 (7.8%) 414 (7.6%) 1,158 (7.2%) BIGU 430 (0.6%) 26 (0.6%) 34 (0.6%) 89 (0.6%) SVT AC 173 (0.2%) 14 (0.3%) 12 (0.2%) 44 (0.3%) AMI 758 (1.0%) 43 (0.9%) 54 (1.0%) 99 (0.6%) Con tinued on next page 21 T able A.1 (con tin ued) ECG-ECHO ECG–Note T raining set V alidation set Internal test set External test set Estimated predictor NST_ 5,952 (8.2%) 371 (8.0%) 463 (8.5%) 1,241 (7.7%) 3A VB 74 (0.1%) 7 (0.2%) 5 (0.1%) 21 (0.1%) IMI 9,848 (13.6%) 574 (12.4%) 614 (11.3%) 1,679 (10.5%) LPR 2,619 (3.6%) 168 (3.6%) 243 (4.5%) 607 (3.8%) 2A VB 326 (0.4%) 29 (0.6%) 29 (0.5%) 83 (0.5%) DIG 71 (0.1%) 1 (0.0%) 8 (0.1%) 9 (0.1%) LMI 2,002 (2.8%) 117 (2.5%) 118 (2.2%) 343 (2.1%) LO WT 2,415 (3.3%) 167 (3.6%) 211 (3.9%) 443 (2.8%) SR 54,557 (75.3%) 3,546 (76.7%) 4,211 (77.4%) 11,166 (69.7%) ST ACH 9,938 (13.7%) 546 (11.8%) 620 (11.4%) 1,642 (10.3%) LPFB 3,138 (4.3%) 160 (3.5%) 159 (2.9%) 497 (3.1%) P ACE 154 (0.2%) 3 (0.1%) 7 (0.1%) 18 (0.1%) WPW 57 (0.1%) 0 (0.0%) 3 (0.1%) 3 (0.0%) ISCIN 2,244 (3.1%) 141 (3.0%) 155 (2.8%) 427 (2.7%) PR C(S) 1,016 (1.4%) 72 (1.6%) 70 (1.3%) 219 (1.4%) AFL T 45 (0.1%) 1 (0.0%) 5 (0.1%) 12 (0.1%) INJIN 1,746 (2.4%) 104 (2.2%) 124 (2.3%) 395 (2.5%) P AC 3,187 (4.4%) 202 (4.4%) 251 (4.6%) 733 (4.6%) IPMI 10 (0.0%) 1 (0.0%) 0 (0.0%) 4 (0.0%) STD_ 11,813 (16.3%) 756 (16.3%) 840 (15.4%) 2,745 (17.1%) LNGQT 993 (1.4%) 43 (0.9%) 62 (1.1%) 154 (1.0%) TRIGU 1,564 (2.2%) 100 (2.2%) 117 (2.1%) 370 (2.3%) NDT 5,067 (7.0%) 333 (7.2%) 383 (7.0%) 1,270 (7.9%) L VH 19,728 (27.2%) 1,286 (27.8%) 1,483 (27.3%) 4,312 (26.9%) PSVT 236 (0.3%) 18 (0.4%) 18 (0.3%) 49 (0.3%) INJLA 4,521 (6.2%) 268 (5.8%) 271 (5.0%) 949 (5.9%) PMI 4 (0.0%) 0 (0.0%) 1 (0.0%) 1 (0.0%) STE_ 38,602 (53.3%) 2,595 (56.1%) 3,021 (55.5%) 7,752 (48.4%) SEHYP 5,294 (7.3%) 313 (6.8%) 348 (6.4%) 1,017 (6.3%) SBRAD 876 (1.2%) 75 (1.6%) 94 (1.7%) 255 (1.6%) RA O/RAE 7,546 (10.4%) 370 (8.0%) 422 (7.8%) 1,601 (10.0%) V CL VH 20,708 (28.6%) 1,384 (29.9%) 1,684 (30.9%) 4,715 (29.4%) IRBBB 6,541 (9.0%) 328 (7.1%) 332 (6.1%) 1,037 (6.5%) QW A VE 8,068 (11.1%) 453 (9.8%) 507 (9.3%) 1,590 (9.9%) NT_ 2,052 (2.8%) 150 (3.2%) 177 (3.3%) 358 (2.2%) EL 3,703 (5.1%) 241 (5.2%) 289 (5.3%) 825 (5.1%) HV OL T 13,380 (18.5%) 1,009 (21.8%) 1,181 (21.7%) 3,329 (20.8%) R VH 1,718 (2.4%) 101 (2.2%) 103 (1.9%) 295 (1.8%) L V OL T 243 (0.3%) 12 (0.3%) 20 (0.4%) 79 (0.5%) Con tinued on next page 22 T able A.1 (con tin ued) ECG-ECHO ECG–Note T raining set V alidation set Internal test set External test set Estimated predictor SARRH 1,298 (1.8%) 112 (2.4%) 131 (2.4%) 386 (2.4%) IPLMI 159 (0.2%) 8 (0.2%) 11 (0.2%) 34 (0.2%) T AB_ 86 (0.1%) 11 (0.2%) 5 (0.1%) 8 (0.0%) V alues are shown as counts and p ercentages. Counts are deriv ed from automatic AI-ECG diagnoses generated b y T ransformer-PT. Predictor abbreviations are defined in W agner et al. ( 2020 ). A.2 Construction of the external test set W e constructed the external test set ECG-Note as Figure A.1 . Adult patien ts, who had at least one standard 10 s 12-lead ECG and at least one clinical note within one year following the ECG from MIMIC-IV ( Johnson et al. , 2024 ), MIMIC-IV-ECG ( Go w et al. , 2023 ), and MIMIC-IV-Note ( Johnson et al. , 2023a ), were initially considered, leading to a starting po ol of 1,113,547 ECG- note pairs from 103,505 patien ts with 521,654 ECGs and 247,313 notes. Exclusion criteria included: 1) ECG-Note pairs where the clinical note did not contain key- w ords related to ejection fraction(EF) and Echo/TTE; 2) ECG-Note pairs where the ECG con tained NaN data. After exclusions, 382,308 ECG-Note paired data of 37,234 patients, comprising 246,543 ECGs and 62,023 notes, w ere included for further analysis. F or the note set, we emplo y ed a Large Language Model (LLM), specifically Qwen2.5-72B-Instruct( Qw en et al. , 2025 ), to extract precise ejection fraction(EF) v alues derived solely from the concurrent Ec ho/TTE (Fig. A.1 b). Based on the v alues obtained b y the LLM, w e determined whether the ejection fraction was ≤ 45%, suggesting mo derately reduced systolic function. These lab els w ere cross-referenced with MIMIC-ECG-L VEF ( Li et al. , 2025 ) to iden tify consisten t ECG-EF class pairs, resulting in the collection of 55,318 a v ailable ECGs. T o ensure the ac- curacy of subsequent ev aluations, only the ECG-Note pair with the shortest time interv al for eac h patient was retained among the v alid samples, resulting in a final testing set of 16,017 ECGs. A.3 Comparison with the Columbia mini mo del in the external test set In the ECG-Note dataset, w e compared the proposed ECGPD-EF with the Colum bia mini mo del ( P oterucha et al. , 2025 ). ECGPD-EF requires only raw ECG signals for inference. In contrast, the Colum bia mini mo del requires sev en tabular features in addition to the ECG wa veform: sex, age, PR interv al, QRS duration, corrected QT in terv al (QT c), atrial rate, and v entricular rate. While age and sex were directly obtained from demographic records, the remaining fiv e electro cardiographic metrics were not explicitly a v ailable and w ere derived using the machine_measurements table from the MIMIC-IV ECG mo dule. Sp ecifically , let T P onset , T QRS onset , T QRS end , and T T end denote the timings of P-wa ve onset, QRS onset, QRS offset, and T-wa ve offset, respectively , and let RR denote the RR interv al in milliseconds. The PR interv al was calculated as T QRS onset − T P onset , and the QRS duration w as derived as T QRS end − T QRS onset . The QT c interv al was computed using Bazett’s formula: 23 Figure A.1: Construction of the external test set. (a) Flow chart illustrating the data selection pro cess, including exclusion criteria applied to the MIMIC-IV database to deriv e the final high-quality testing set. (b) The prompt template used for the Large Language Mo del (LLM) Clinical Note Extraction Mo dule, designed to extract quantitativ e ejection fraction(EF) v alues from unstructured clinical notes in to a structured JSON format. 24 ( T T end − T QRS onset ) / √ RR/ 1000 . The ven tricular rate was calculated as 60 , 000 /RR ( Kligfield et al. , 2007 ). Due to the absence of sp ecific atrial rate data, the derived ven tricular rate was used as a proxy for the atrial rate. 25 B Supplemen tary analyses for the m ulti-predictor approac h B.1 Subgroup analysis of the m ulti-predictor approac h stratified by SHD and VHD Baseline echocardiographic c haracteristics are presented in T able B.1 . Subgroup analysis results stratified b y the presence of SHD and VHD are shown in Figures B.1 and B.2 . T able B.1: Baseline echocardiographic characteristics across the training, v alidation, and test splits of the ECG-ECHO dataset. T raining set V alidation set T est set Ec ho cardiographic findings L VWT ≥ 1.3cm 17,667 (24.4%) 877 (19.0%) 1,061 (19.5%) A ortic stenosis 2,919 (4.0%) 252 (5.4%) 286 (5.3%) A ortic regurgitation 878 (1.2%) 62 (1.3%) 66 (1.2%) Mitral regurgitation 6,137 (8.5%) 282 (6.1%) 337 (6.2%) T ricuspid regurgitation 7,707 (10.6%) 305 (6.6%) 353 (6.5%) Pulmonary regurgitation 603 (0.8%) 21 (0.5%) 20 (0.4%) R V systolic dysfunction 9,597 (13.2%) 368 (8.0%) 419 (7.7%) P ericardial effusion 2,079 (2.9%) 52 (1.1%) 69 (1.3%) P ASP ≥ 45mmHg 13,727 (18.9%) 581 (12.6%) 699 (12.8%) TR V max ≥ 3.2cm/s 7,492 (10.3%) 267 (5.8%) 375 (6.9%) The training, v alidation, and test splits corresp ond to the official splits of the ECG-ECHO dataset (Ec hoNext) ( Elias and Finer , 2025 ). V alues are shown as counts and p ercen tages. L VWT, left v entricular wall thickness; R V, right ven tricular; P ASP , pulmonary artery systolic pressure. Other structural heart disease (SHD). F or subgroup analysis, individuals were clas- sified as ha ving other structural heart disease (SHD) if at least one of the following echocardio- graphic abnormalities w as presen t: left v en tricular h yp ertroph y (left ven tricular w all thick- ness ≥ 1.3 cm); mo derate or greater v alvular heart disease (aortic, mitral, tricuspid, or pulmonary); mo derate or greater right ven tricular systolic dysfunction; mo derate or large p ericardial effusion; or pulmonary hypertension, defined as pulmonary artery systolic pres- sure (P ASP) ≥ 45 mmHg or tricuspid regurgitation p eak velocity (TR V max) ≥ 3.2 m/s. Individuals without an y of the ab ov e findings w ere classified as ha ving no other SHD. The p erformance comparison of ECGPD-LEF and the Colum bia mini mo del in patien ts with and without other SHD is presented in Figure B.1 . V alvular heart disease (VHD). F or the VHD subgroup analysis, individuals were cat- egorized according to the presence of mo derate or greater v alvular heart disease on echocar- diograph y , including aortic, mitral, tricuspid, or pulmonary v alv e disease. Those without mo derate or greater v alvular abnormalities were classified as non-VHD. The corresp onding p erformance comparison b etw een ECGPD-LEF and the Colum bia mini mo del in VHD and non-VHD p opulations is sho wn in Figure B.2 . 26 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate a Transformer-PT-XGBoost (AUROC = 0.852) Columnbia mini model (AUROC = 0.797) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision b Transformer-PT-XGBoost (AUPRC = 0.760) Columnbia mini model (AUPRC = 0.665) 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate c Transformer-PT-XGBoost (AUROC = 0.860) Columnbia mini model (AUROC = 0.832) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision d Transformer-PT-XGBoost (AUPRC = 0.535) Columnbia mini model (AUPRC = 0.437) Figure B.1: Mo del p erformance stratified b y other structural heart disease (SHD). ROC (a,c) and PR (b,d) curves for individuals with other SHD (a,b) and without other SHD (c,d). Other SHD was defined as the presence of at least one predefined echocardiographic abnormalit y indep enden t of left ven tricular ejection fraction (see Appendix B.1 ). T ransformer-PT-XGBoost is compared with the Columbia mini mo del. AUR OC and AUPR C are indicated in eac h panel. 27 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate a Transformer-PT-XGBoost (AUROC = 0.853) Columnbia mini model (AUROC = 0.799) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision b Transformer-PT-XGBoost (AUPRC = 0.812) Columnbia mini model (AUPRC = 0.726) 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate c Transformer-PT-XGBoost (AUROC = 0.873) Columnbia mini model (AUROC = 0.838) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision d Transformer-PT-XGBoost (AUPRC = 0.621) Columnbia mini model (AUPRC = 0.516) Figure B.2: Mo del p erformance stratified by v alvular heart disease (VHD). ROC (a,c) and PR (b,d) curves for individuals with mo derate or greater VHD (a,b) and without mo derate or greater VHD (c,d). VHD was defined as mo derate or greater v alvular heart disease on echocardiography (see App endix B.1 ). T ransformer-PT-XGBoost is compared with the Columbia mini model. A UROC and A UPRC are indicated in each panel. 28 B.2 Subgroup analysis for m ulti-predictor metho d in the external test set T o ev aluate mo del robustness and fairness across patient characteristics and clinical settings, w e conducted a subgroup analysis comparing our proposed T ransformer-PT-X GBo ost with Colum bia mini mo del. Subgroups were defined by age, sex, race and ethnicit y , and clinical con text. Data linkage betw een the MIMIC-IV and MIMIC-IV-ECG datasets was p erformed to extract these subgroup attributes. Patien t age w as recalculated sp ecifically at the time of ECG acquisition b y leveraging the time-inv arian t interv al metho d to restore temp oral alignmen t ( Johnson et al. , 2023b ). F or race and ethnicity , we aggregated granular categories in to broader clusters (T able B.3 ). Regarding clinical con text, w e determined the admission t yp e by mapping the ECG timestamp to the corresp onding hospital admission and disc harge windo w. All p erformance metrics are rep orted using the A UROC and the A UPRC (T able B.2 ). Ov erall, our mo del demonstrates comparable or sup erior p erformance across almost all subgroups, suggesting that the prop osed mo del generalizes well across heterogeneous patient p opulations and care environmen ts, supp orting its p oten tial utility in real-w orld deplo yment scenarios. 29 T able B.2: Subgroup p erformance of ECGPD-LEF and the Columbia mini model in the external test set. Colum bia mini model ECGPD-LEF (Multi-predictor) Subgroup n Prev alence (%) AUR OC AUPR C F1 Score A UROC AUPR C F1 Score Age groups 18–59 4270 12.2 83.0 (80.9–84.9) 46.0 (41.5–50.8) 49.1 (45.4–52.4) 89.5 (88.1–90.9) 62.1 (57.4–66.4) 57.3 (54.2–60.4) 60–69 3637 15.7 82.0 (80.2–83.9) 47.7 (43.7–52.6) 50.1 (46.9–53.1) 88.8 (87.3–90.3) 65.1 (61.4–69.2) 56.3 (53.4–59.3) 70–79 3761 18.0 81.1 (79.2–82.8) 51.6 (47.7–55.9) 50.6 (48.0–53.3) 85.5 (84.0–86.9) 56.6 (53.0–60.9) 54.4 (51.6–57.0) 80+ 4349 17.2 77.9 (76.1–79.7) 42.9 (39.5–46.8) 43.5 (40.9–45.9) 83.5 (82.1–85.1) 49.2 (45.6–53.2) 49.6 (47.2–52.1) Sex F emale 7222 11.2 79.4 (77.8–80.9) 32.5 (29.5–36.0) 38.3 (35.9–40.5) 85.9 (84.6–87.2) 45.6 (41.9–48.9) 44.9 (42.6–47.1) Male 8795 19.4 81.3 (80.2–82.4) 52.7 (50.2–55.5) 53.4 (51.7–55.1) 87.2 (86.3–88.1) 63.9 (61.4–66.4) 59.0 (57.3–60.6) Race/ethnicit y Hispanic 638 14.7 84.9 (80.4–88.9) 53.6 (43.3–64.1) 57.1 (49.1–64.1) 89.9 (86.9–92.6) 64.7 (55.0–74.1) 53.5 (46.4–60.4) White 11688 15.5 80.3 (79.2–81.3) 43.9 (41.6–46.5) 46.5 (44.9–48.2) 86.9 (86.0–87.8) 61.9 (59.4–64.4) 56.9 (55.4–58.4) Blac k 1845 17.3 84.7 (82.3–86.9) 57.7 (51.9–63.1) 54.1 (50.1–58.1) 89.8 (88.0–91.4) 72.9 (67.5–78.0) 62.1 (57.9–66.1) Asian 414 12.6 84.6 (79.2–89.1) 43.4 (31.5–58.1) 47.7 (37.7–56.1) 90.7 (86.1–94.1) 56.0 (40.5–72.6) 55.3 (45.2–64.7) Other 524 13.0 79.8 (73.5–85.4) 39.5 (29.3–51.9) 44.4 (35.8–52.9) 87.5 (82.7–91.6) 53.5 (41.3–66.4) 52.7 (44.4–60.9) Unkno wn 908 18.8 75.9 (71.6–79.6) 47.4 (40.3–56.4) 45.6 (40.0–51.5) 81.3 (77.7–84.7) 54.6 (45.6–65.2) 51.9 (46.3–57.1) Clinical context Emergency 9170 14.6 81.4 (80.1–82.6) 44.3 (41.8–47.3) 46.3 (44.4–48.2) 88.4 (87.4–89.4) 61.0 (58.4–63.7) 55.6 (53.7–57.2) Urgen t 3028 20.9 79.1 (77.1–81.0) 50.7 (46.7–54.9) 53.1 (50.1–55.8) 84.1 (82.6–85.7) 60.2 (56.0–64.2) 60.4 (57.7–62.7) Observ ation 2173 19.6 83.9 (81.8–85.9) 58.8 (53.7–64.0) 56.3 (52.8–59.9) 89.4 (87.7–91.1) 72.6 (67.7–77.3) 63.6 (60.2–66.8) Surgical Same Day 1022 6.5 76.8 (70.9–82.2) 22.1 (15.1–33.2) 25.7 (18.9–32.5) 85.9 (81.3–89.8) 26.8 (18.7–38.1) 36.6 (27.5–44.6) Electiv e 624 9.3 72.2 (65.1–79.0) 21.0 (14.7–32.0) 27.0 (20.6–33.6) 82.2 (76.8–87.1) 34.3 (25.0–47.5) 35.1 (27.9–42.8) ECGPD-LEF is configured with the predictor extractor based on T ransformer-PT and the tabular mo del X GBo ost. This configuration was selected for illustration in the internal test set. Columbia mini mo del is the official b enc hmark on the same set ( Poteruc ha et al. , 2025 ). AUR OC, A UPRC, and F1 are reported with 95% confidence interv als. 30 T able B.3: Mapping of original race/ethnicit y categories to aggregated subgroups. Group ed Original Categories Asian ASIAN, ASIAN - ASIAN INDIAN, ASIAN - CHINESE, ASIAN - KOREAN, ASIAN - SOUTH EAST ASIAN Blac k BLA CK/AFRICAN, BLACK/AFRICAN AMERICAN, BLA CK/CAPE VERDEAN, BLA CK/CARIBBEAN ISLAND Hispanic HISP ANIC OR LA TINO, CENTRAL AMERI- CAN, COLUMBIAN, CUBAN, DOMINICAN, GUA TEMALAN, HONDURAN, MEXICAN, PUER TO RICAN, SAL V ADORAN, SOUTH AMERICAN White WHITE, WHITE - BRAZILIAN, EASTERN EU- R OPEAN, OTHER EUR OPEAN, RUSSIAN, POR- TUGUESE Other OTHER, AMERICAN INDIAN/ALASKA NA TIVE, MUL TIPLE RA CE/ETHNICITY, P ACIFIC IS- LANDER Unkno wn UNKNOWN, UNABLE TO OBT AIN, DECLINED 31 C Supplemen tary analyses for the single-predictor approac h C.1 Predictor recognition b y the predictor extractor (remaining predic- tors) The classification p erformance of the predictor extractor for the remaining 61 predictors is rep orted in T able C.1 . Across the 71 predictors, AUR OC v alues ranged from 73.8% to 100%, with 58 exceeding 90%, indicating consisten tly strong discriminative capacity . AUPR C and F1 scores v aried substantially (0.3%-97.6% and 0.0%-94.1%), largely due to class imbalance. F or instance, the INJIL category included only tw o p ositiv e cases in the PTB-XL test set (<0.1% prev alence), resulting in unstable precision-recall estimates (test F1 = 0.0%; v alida- tion F1 = 8.7%) despite a high A UR OC of 92.9%. Decision thresholds were set to maximize F1 on the v alidation set, ensuring all predictors ac hieved non-zero v alidation F1 scores. Since the do wnstream LEF framework leverages con tinuous predictor scores rather than threshold-dependent binary decisions, robust ranking is sufficient to reliably transfer information even for ultra-rare categories. C.2 Single-predictor p erformance (remaining predictors and external re- sults) The performance of the remaining single predictors for LEF detection is rep orted in T able C.1 . External test set performance is provided in T able C.2 . T able C.1: Performance of traditional AI-ECG predictors and LEF detection using a single-predictor approac h (remaining 61 predictors) Predictor T raditional ECG Mo del LEF Detection (One Predictor) A UROC AUPR C F1 Score Thresh AUR OC AUPR C F1 Score Thresh LAFB 98.8 (98.3–99.2) 87.0 (82.1–91.2) 78.9 (74.2–83.5) 0.274 70.0 (68.2–71.7) 29.1 (26.9–31.6) 39.0 (36.8–41.1) 0.001226 ALMI 97.3 (95.0–99.1) 61.3 (43.6–77.1) 58.3 (40.0–73.2) 0.257 68.6 (66.8–70.6) 35.0 (32.3–38.3) 38.6 (36.1–41.1) 0.002106 ABQRS 87.5 (85.5–89.5) 57.4 (51.7–63.5) 54.4 (50.5–58.2) 0.053 68.3 (66.5–70.1) 29.8 (27.5–32.5) 38.0 (35.9–40.1) 0.012123 CLBBB 99.8 (99.6–100.0) 93.0 (85.9–98.4) 89.3 (82.8–94.7) 0.053 67.9 (66.0–69.7) 35.4 (32.5–38.8) 38.0 (35.5–40.2) 0.000006 ILMI 95.7 (91.9–98.5) 63.4 (50.6–75.7) 61.1 (48.9–71.2) 0.157 67.6 (65.7–69.4) 31.7 (29.0–34.8) 37.9 (35.9–40.1) 0.000172 INJAS 99.1 (98.6–99.5) 51.2 (32.1–70.9) 44.4 (25.8–60.7) 0.257 66.8 (64.9–68.7) 27.0 (25.1–29.2) 37.9 (36.0–40.0) 0.000571 INVT 95.3 (92.5–97.4) 27.1 (14.4–45.4) 30.8 (15.4–44.4) 0.269 66.6 (64.7–68.6) 28.4 (26.3–30.8) 37.8 (35.8–39.9) 0.005779 PV C 99.3 (98.9–99.6) 79.1 (69.8–88.2) 85.3 (80.4–89.8) 0.309 67.4 (65.6–69.1) 29.7 (27.3–32.4) 37.3 (35.2–39.5) 0.001254 ISCIL 95.8 (93.4–97.8) 17.4 ( 7.6–36.7) 25.0 (12.5–37.7) 0.053 66.0 (64.3–67.6) 24.5 (22.8–26.3) 37.3 (35.5–39.4) 0.000142 1A VB 98.6 (98.1–99.1) 68.9 (57.4–80.4) 68.2 (59.6–76.1) 0.226 67.8 (65.8–69.5) 28.8 (26.5–31.3) 37.2 (35.1–39.1) 0.000059 ISC_ 96.1 (94.4–97.4) 67.1 (58.7–74.7) 59.3 (52.3–66.7) 0.516 66.8 (64.8–68.6) 29.9 (27.4–32.5) 37.0 (34.9–39.1) 0.007637 Con tinued on next page 32 T able C.1 (con tin ued) Predictor T raditional ECG Mo del LEF Detection (One Predictor) A UROC AUPR C F1 Score Thresh AUR OC AUPR C F1 Score Thresh IV CD 80.3 (74.4–84.8) 20.6 (13.7–30.4) 29.3 (19.8–38.2) 0.209 66.2 (64.1–68.0) 32.4 (29.5–35.3) 36.7 (34.2–39.1) 0.027054 LA O/LAE 88.3 (83.6–92.6) 16.0 (10.0–28.2) 26.5 (14.1–37.6) 0.189 64.1 (61.9–66.3) 32.9 (30.0–36.0) 36.5 (34.0–39.0) 0.051849 ISCAN 93.8 (91.3–96.7) 1.7 ( 0.6– 4.6) 0.0 ( 0.0– 0.0) 0.024 64.7 (62.8–66.6) 26.2 (24.2–28.7) 35.8 (33.7–37.9) 0.000312 ISCAS 96.7 (94.9–98.1) 21.3 ( 8.4–41.1) 20.5 ( 5.1–37.5) 0.247 62.4 (60.7–64.2) 21.6 (20.2–23.2) 35.5 (33.6–37.2) 0.000066 AFIB 98.6 (97.4–99.6) 93.4 (89.6–96.4) 91.4 (87.8–94.3) 0.134 64.4 (62.5–66.3) 27.5 (25.2–30.1) 35.4 (33.4–37.5) 0.000004 BIGU 97.1 (92.3–99.8) 37.5 ( 9.7–73.1) 42.1 (11.8–66.7) 0.143 63.2 (61.4–65.0) 23.9 (22.1–26.0) 34.9 (32.9–37.0) 0.000013 SVT AC 98.6 (96.5–100.0) 16.1 ( 1.4–75.0) 25.0 ( 0.0–66.7) 0.277 63.6 (61.7–65.4) 26.5 (24.4–29.0) 34.4 (32.2–36.6) 0.000113 AMI 92.3 (88.7–95.4) 31.4 (17.0–48.7) 38.8 (23.5–52.6) 0.188 61.3 (59.6–62.9) 22.2 (20.5–24.1) 34.4 (32.7–36.0) 0.000868 NST_ 86.2 (82.6–89.7) 19.4 (13.6–28.6) 25.9 (18.6–33.0) 0.171 60.2 (58.4–61.9) 20.7 (19.2–22.5) 34.4 (32.7–36.1) 0.005043 3A VB 99.6 (99.0–100.0) 55.6 (4.8–100.0) 50.0 (0.0–100.0) 0.254 61.0 (59.0–63.1) 24.8 (22.9–27.1) 34.2 (32.4–36.1) <0.000001 IMI 94.8 (93.8–95.8) 73.2 (67.8–78.2) 66.1 (61.7–70.5) 0.218 63.1 (61.2–65.0) 25.6 (23.5–28.1) 34.2 (32.3–36.2) 0.003210 LPR 98.4 (97.5–99.1) 51.4 (35.8–69.4) 47.6 (33.3–60.5) 0.260 61.6 (59.7–63.6) 23.2 (21.5–25.4) 33.7 (31.7–35.6) 0.000049 2A VB 99.6 (99.4–99.9) 11.1 (7.1–40.0) 0.0 (0.0–0.0) 0.028 61.2 (59.3–63.0) 24.0 (22.2–26.1) 33.3 (31.4–35.2) 0.000002 DIG 93.6 (90.1–96.5) 8.6 (4.4–19.3) 7.4 (0.0–22.2) 0.433 60.6 (58.7–62.4) 22.2 (20.6–24.1) 33.3 (31.5–35.1) 0.000024 LMI 93.6 (90.6–96.3) 10.6 (5.6–21.3) 20.8 (5.3–35.7) 0.176 60.8 (58.9–62.8) 25.1 (23.1–27.5) 33.0 (31.2–34.9) 0.000937 LO WT 91.8 (89.8–93.7) 12.5 (8.7–19.5) 15.4 (7.1–25.0) 0.225 58.1 (56.2–59.8) 20.3 (18.8–21.9) 32.9 (31.3–34.8) 0.000556 SR 92.7 (91.2–94.1) 96.7 (95.7–97.6) 94.1 (93.3–94.9) 0.277 62.3 (60.3–64.3) 25.6 (23.4–28.2) 32.7 (30.8–34.4) 0.943848 ST ACH 99.4 (98.8–99.8) 86.5 (77.1–94.4) 85.9 (80.2–91.0) 0.231 60.2 (58.2–62.0) 22.0 (20.3–23.9) 32.5 (30.9–34.2) 0.000001 LPFB 98.6 (97.7–99.4) 48.3 (26.4–69.2) 43.9 (22.8–61.9) 0.242 59.8 (58.0–61.8) 23.1 (21.3–25.2) 32.4 (30.6–34.4) 0.000076 P ACE 98.1 (95.4–100.0) 87.4 (73.5–96.9) 82.4 (69.2–92.6) 0.425 58.7 (56.8–60.5) 22.4 (20.7–24.5) 31.9 (30.1–33.6) 0.000033 WPW 95.0 (84.6–100.0) 62.2 (24.3–97.6) 66.7 (17.7–93.4) 0.695 60.1 (58.2–62.1) 24.0 (22.1–26.2) 31.9 (29.8–33.9) 0.000028 ISCIN 94.3 (90.0–97.2) 23.8 (8.6–41.1) 27.9 (15.0–40.4) 0.060 58.4 (56.6–60.2) 20.8 (19.4–22.4) 31.8 (29.9–33.8) 0.000432 PR C(S) 99.8 (99.5–100.0) 16.7 (9.1–57.1) 11.8 (8.3–38.1) 0.009 58.6 (56.8–60.5) 21.7 (20.1–23.7) 31.7 (29.6–33.6) 0.000008 AFL T 86.5 (53.9–100.0) 61.5 (19.1–97.3) 44.4 (0.0–83.4) 0.950 58.0 (56.1–60.0) 22.9 (21.0–25.1) 31.5 (29.9–33.2) 0.000010 INJIN 99.9 (99.6–100.0) 64.3 (12.5–100.0) 11.8 (5.0–27.0) 0.006 58.9 (57.0–60.8) 22.9 (21.0–25.1) 31.2 (29.0–33.3) 0.000063 Con tinued on next page 33 T able C.1 (con tin ued) Predictor T raditional ECG Mo del LEF Detection (One Predictor) A UROC AUPR C F1 Score Thresh AUR OC AUPR C F1 Score Thresh P AC 98.1 (97.2–98.8) 48.6 (34.3–67.8) 51.9 (40.4–62.6) 0.298 56.9 (54.8–58.7) 21.0 (19.4–22.8) 31.1 (29.1–33.0) 0.000290 IPMI 98.6 (97.3–99.9) 10.2 (1.7–47.4) 0.0 (0.0–0.0) 0.287 57.9 (55.9–59.9) 22.8 (20.9–25.1) 31.1 (29.4–32.8) 0.000016 STD_ 89.6 (86.6–92.1) 26.1 (19.9–34.5) 39.3 (32.5–46.2) 0.178 58.0 (55.9–60.1) 22.4 (20.7–24.4) 30.8 (29.2–32.4) 0.004936 LNGQT 97.5 (95.0–99.2) 20.4 (7.6–45.8) 30.0 (0.0–54.6) 0.466 54.0 (52.0–55.9) 19.5 (18.0–21.3) 30.4 (28.8–31.9) 0.000042 TRIGU 99.3 (98.3–100.0) 28.2 (2.9–100.0) 9.1 (3.9–22.9) 0.004 55.5 (53.5–57.6) 21.0 (19.4–22.8) 30.2 (28.3–32.1) 0.000002 NDT 93.8 (92.5–95.0) 58.0 (50.7–65.3) 57.9 (52.0–63.4) 0.328 42.2 (40.1–44.1) 14.9 (13.8–16.2) 30.1 (28.7–31.5) <0.000001 L VH 93.6 (92.0–95.1) 69.2 (63.1–74.9) 63.6 (58.7–68.5) 0.267 55.2 (52.9–57.3) 22.9 (21.0–25.2) 30.0 (28.7–31.5) 0.000005 PSVT 99.9 (99.6–100.0) 64.3 (12.5–100.0) 33.3 (0.0–80.0) 0.160 52.3 (50.3–54.4) 19.4 (18.0–21.2) 30.0 (28.7–31.5) <0.000001 INJLA 73.8 (63.8–83.3) 0.3 (0.1–0.9) 0.0 (0.0–0.0) 0.005 54.9 (52.8–57.0) 20.0 (18.6–21.9) 30.0 (28.2–32.0) 0.000020 PMI 89.8 (86.3–93.0) 0.6 (0.3–2.1) 0.0 (0.0–0.0) 0.208 46.5 (44.5–48.4) 16.2 (15.1–17.5) 30.0 (28.7–31.5) <0.000001 STE_ 94.0 (84.4–99.4) 3.7 (0.3–15.4) 0.4 (0.1–1.0) <0.001 45.8 (43.9–47.8) 16.2 (15.0–17.5) 30.0 (28.7–31.5) <0.000001 SEHYP 100.0 (99.8–100.0) 83.3 (25.0–100.0) 28.6 (9.5–54.5) 0.019 49.2 (47.2–51.3) 18.5 (17.1–20.2) 30.0 (28.7–31.5) <0.000001 SBRAD 96.3 (93.8–98.1) 58.7 (45.3–70.3) 60.3 (49.5–70.1) 0.236 41.7 (39.9–43.7) 14.4 (13.4–15.6) 30.0 (28.6–31.4) <0.000001 RA O/RAE 97.3 (95.1–99.1) 41.4 (7.8–70.4) 27.8 (7.1–46.7) 0.094 49.6 (47.5–51.6) 19.3 (17.7–21.3) 30.0 (28.7–31.5) <0.000001 V CL VH 86.4 (82.6–90.1) 29.1 (21.3–40.2) 34.3 (27.5–41.9) 0.122 42.5 (40.4–44.6) 15.3 (14.3–16.7) 30.0 (28.7–31.5) 0.000012 IRBBB 98.3 (97.6–99.0) 78.8 (71.3–85.2) 61.4 (52.4–69.1) 0.775 47.7 (45.8–49.6) 16.1 (15.1–17.3) 30.0 (28.7–31.5) <0.000001 QW A VE 87.5 (82.8–91.5) 21.2 (13.0–32.2) 24.5 (15.4–33.8) 0.165 54.2 (52.1–56.4) 22.0 (20.1–24.3) 30.0 (28.7–31.5) 0.000019 NT_ 95.0 (91.9–97.3) 33.1 (22.7–49.9) 35.4 (20.0–48.7) 0.276 41.6 (39.5–43.7) 14.7 (13.6–15.9) 30.0 (28.7–31.5) <0.000001 EL 93.9 (88.5–97.6) 5.4 (2.1–13.6) 9.8 (0.0–21.7) 0.095 46.4 (44.5–48.4) 16.4 (15.1–17.7) 30.0 (28.6–31.5) 0.000001 HV OL T 89.2 (73.8–98.3) 4.6 (0.7–17.1) 6.3 (1.6–12.6) 0.010 32.6 (30.7–34.5) 12.5 (11.7–13.5) 30.0 (28.7–31.5) <0.000001 R VH 95.1 (89.2–99.1) 26.2 (8.7–52.5) 28.6 (7.4–50.0) 0.415 50.9 (48.9–52.9) 17.8 (16.6–19.4) 30.0 (28.7–31.5) <0.000001 L VOL T 90.9 (83.7–96.2) 9.6 (4.0–21.1) 19.0 (6.2–32.1) 0.127 48.6 (46.5–50.6) 17.2 (15.8–18.8) 30.0 (28.7–31.5) <0.000001 SARRH 97.5 (96.5–98.3) 63.4 (52.1–73.8) 56.8 (48.0–65.0) 0.377 48.4 (46.4–50.4) 17.6 (16.2–18.9) 29.8 (28.3–31.2) 0.000012 IPLMI 82.0 (58.8–98.9) 4.4 (0.2–23.3) 0.0 (0.0–0.0) 0.128 52.2 (50.2–54.2) 19.6 (18.0–21.5) 29.7 (28.3–31.3) 0.000006 T AB_ 89.9 (79.0–96.7) 1.1 (0.2–3.9) 0.0 (0.0–0.0) 0.036 50.7 (48.8–52.6) 17.9 (16.5–19.4) 29.5 (28.1–31.0) 0.000014 Con tinued on next page 34 T able C.1 (con tin ued) Predictor T raditional ECG Mo del LEF Detection (One Predictor) A UROC AUPR C F1 Score Thresh AUR OC AUPR C F1 Score Thresh Predictor performance w as obtained from the traditional automatic AI-ECG mo del. LEF detec- tion p erformance w as deriv ed using the single-predictor approach prop osed in this study . AUR OC, A UPRC, and F1 are reported with 95% confidence interv als. The tw o Thresh columns indicate the thresholds used to maximize the F1 score on the v alidation set for predictor performance and LEF detection, resp ectively . Predictor abbreviations are defined in W agner et al. ( 2020 ). T able C.2: P erformance of LEF detection using the single-predictor approach on the external test cohort (ECG-Note) for all 71 ECG predictors. Predictor LEF Detection (One Predictor) A UR OC A UPRC F1 Score Thresh NORM 78.8 (77.8-79.6) 36.2 (34.5-38.1) 42.3 (41.0-43.5) 0.003641 ILBBB 71.6 (70.5-72.6) 33.2 (31.5-35.0) 38.2 (36.8-39.6) 0.000349 INJAL 73.4 (72.4-74.3) 27.9 (26.6-29.2) 40.9 (39.6-42.1) 0.000386 ISCLA 67.7 (66.6-68.8) 26.7 (25.4-28.4) 34.0 (32.6-35.2) 0.001918 ANEUR 71.0 (70.0-72.0) 32.1 (30.4-33.8) 38.0 (36.7-39.4) 0.001924 ISCAL 65.7 (64.7-66.8) 21.8 (20.8-22.9) 33.2 (32.0-34.3) 0.001148 ASMI 70.7 (69.6-71.7) 29.3 (27.8-31.0) 37.1 (35.8-38.3) 0.023987 SV ARR 64.0 (62.8-65.1) 22.8 (21.7-24.0) 32.1 (30.8-33.4) 0.000218 INJIL 66.7 (65.6-67.7) 22.9 (21.8-24.2) 34.7 (33.5-35.9) 0.000035 CRBBB 66.9 (65.8-67.9) 23.4 (22.2-24.7) 33.2 (32.1-34.3) 0.000005 LAFB 67.5 (66.5-68.5) 22.9 (21.9-24.0) 34.9 (33.7-36.1) 0.001226 ALMI 66.7 (65.5-67.8) 29.7 (28.0-31.4) 34.2 (32.8-35.6) 0.002106 ABQRS 64.4 (63.3-65.5) 24.2 (22.9-25.6) 32.8 (31.6-33.9) 0.012123 CLBBB 69.3 (68.1-70.4) 29.7 (28.1-31.3) 36.0 (34.8-37.2) 0.000006 Con tinued on next page 35 T able C.2 (con tin ued) Predictor LEF Detection (One Predictor) A UR OC A UPRC F1 Score Thresh ILMI 65.8 (64.7-66.9) 25.9 (24.5-27.3) 33.0 (31.8-34.2) 0.000172 INJAS 65.1 (64.0-66.2) 21.9 (20.9-23.1) 33.9 (32.7-35.1) 0.000571 INVT 62.5 (61.1-63.7) 22.9 (21.7-24.2) 31.5 (30.3-32.6) 0.005779 PV C 59.5 (58.1-60.7) 22.6 (21.2-24.0) 29.5 (28.2-30.6) 0.001254 ISCIL 59.2 (58.1-60.3) 18.7 (17.9-19.8) 29.9 (28.9-30.9) 0.000142 1A VB 61.5 (60.3-62.6) 21.5 (20.5-22.7) 30.5 (29.5-31.6) 0.000059 ISC_ 59.6 (58.2-60.8) 22.0 (20.8-23.3) 29.5 (28.3-30.6) 0.007637 IV CD 64.3 (63.0-65.5) 28.7 (27.0-30.5) 33.2 (32.0-34.6) 0.027054 LA O/LAE 62.5 (61.0-63.7) 28.6 (26.7-30.3) 32.9 (31.4-34.3) 0.051849 ISCAN 56.8 (55.5-57.9) 18.3 (17.4-19.2) 27.4 (26.2-28.5) 0.000312 ISCAS 54.8 (53.5-55.9) 16.5 (15.7-17.2) 28.3 (27.3-29.3) 0.000066 AFIB 57.8 (56.6-58.9) 18.3 (17.5-19.2) 29.0 (28.0-30.0) 0.000004 BIGU 58.9 (57.7-60.0) 19.1 (18.2-20.2) 29.5 (28.5-30.6) 0.000013 SVT AC 52.7 (51.5-54.0) 16.7 (15.9-17.6) 24.2 (23.0-25.4) 0.000113 AMI 54.0 (52.9-55.2) 16.7 (15.9-17.5) 28.0 (27.0-29.0) 0.000868 NST_ 46.4 (45.2-47.4) 13.7 (13.1-14.3) 26.1 (25.2-27.1) 0.005043 3A VB 57.7 (56.6-58.9) 19.0 (18.1-20.0) 28.3 (27.4-29.3) <0.000001 IMI 58.9 (57.8-60.0) 19.7 (18.8-20.8) 29.8 (28.7-30.7) 0.003210 LPR 56.4 (55.1-57.5) 17.6 (16.8-18.5) 28.7 (27.7-29.7) 0.000049 2A VB 56.6 (55.4-57.9) 17.6 (16.7-18.6) 28.8 (27.7-29.9) 0.000002 Con tinued on next page 36 T able C.2 (con tin ued) Predictor LEF Detection (One Predictor) A UR OC A UPRC F1 Score Thresh DIG 53.9 (52.7-55.1) 16.2 (15.5-17.0) 28.2 (27.2-29.2) 0.000024 LMI 59.0 (57.8-60.2) 20.2 (19.2-21.4) 29.4 (28.3-30.4) 0.000937 LO WT 49.7 (48.5-50.9) 15.2 (14.5-16.0) 26.3 (25.3-27.2) 0.000556 SR 57.1 (55.9-58.3) 19.1 (18.1-20.2) 29.1 (28.0-30.2) 0.943848 ST ACH 54.5 (53.3-55.6) 16.4 (15.6-17.2) 28.6 (27.6-29.5) 0.000001 LPFB 58.9 (57.7-60.2) 19.4 (18.5-20.5) 29.4 (28.4-30.5) 0.000075 P ACE 57.9 (56.7-59.1) 22.3 (21.0-23.8) 28.3 (27.3-29.2) 0.000033 WPW 55.8 (54.5-56.9) 18.6 (17.6-19.7) 26.5 (25.3-27.7) 0.000028 ISCIN 51.3 (50.1-52.5) 15.6 (14.9-16.4) 25.5 (24.4-26.5) 0.000432 PR C(S) 53.2 (51.9-54.4) 16.6 (15.8-17.5) 26.0 (25.0-27.0) 0.000008 AFL T 48.4 (47.1-49.6) 14.9 (14.1-15.6) 25.1 (24.0-26.0) 0.000010 INJIN 56.6 (55.3-57.8) 18.4 (17.5-19.5) 27.8 (26.7-29.0) 0.000062 P AC 53.9 (52.6-55.0) 17.0 (16.2-18.0) 27.0 (26.0-28.0) 0.000290 IPMI 51.1 (49.8-52.4) 17.1 (16.2-18.3) 26.3 (25.4-27.3) 0.000016 STD_ 53.6 (52.4-54.9) 17.5 (16.7-18.5) 26.6 (25.7-27.5) 0.004936 LNGQT 44.8 (43.5-46.1) 13.9 (13.2-14.6) 26.0 (25.1-26.8) 0.000042 TRIGU 51.2 (49.9-52.6) 16.2 (15.5-17.1) 25.8 (24.8-26.7) 0.000002 NDT 32.4 (31.2-33.6) 11.0 (10.6-11.5) 27.1 (26.2-27.9) <0.000001 L VH 50.8 (49.4-52.2) 18.2 (17.1-19.4) 27.1 (26.3-27.9) 0.000005 PSVT 36.6 (35.4-37.9) 12.0 (11.5-12.6) 27.2 (26.3-28.0) <0.000001 Con tinued on next page 37 T able C.2 (con tin ued) Predictor LEF Detection (One Predictor) A UR OC A UPRC F1 Score Thresh INJLA 52.1 (50.7-53.4) 17.0 (16.1-17.9) 25.2 (24.0-26.2) 0.000020 PMI 37.2 (36.0-38.3) 12.0 (11.4-12.5) 27.2 (26.3-28.0) <0.000001 STE_ 44.9 (43.7-46.2) 14.0 (13.4-14.7) 27.2 (26.3-28.0) <0.000001 SEHYP 51.4 (50.3-52.6) 16.3 (15.5-17.1) 27.1 (26.3-27.9) <0.000001 SBRAD 49.0 (47.8-50.1) 14.7 (14.0-15.3) 27.2 (26.4-28.1) <0.000001 RA O/RAE 48.1 (46.9-49.4) 16.3 (15.4-17.3) 27.2 (26.3-28.0) <0.000001 V CL VH 38.8 (37.5-40.1) 12.4 (11.9-13.0) 27.0 (26.1-27.8) 0.000012 IRBBB 41.8 (40.6-43.1) 12.8 (12.2-13.4) 27.1 (26.3-27.9) <0.000001 QW A VE 52.8 (51.5-54.0) 18.7 (17.7-19.8) 27.1 (26.3-28.0) 0.000018 NT_ 35.4 (34.1-36.5) 11.6 (11.2-12.2) 27.2 (26.3-28.0) <0.000001 EL 39.3 (38.0-40.6) 12.5 (12.0-13.2) 27.1 (26.2-27.9) 0.000001 HV OL T 34.0 (32.8-35.1) 11.2 (10.7-11.7) 27.2 (26.3-28.0) <0.000001 R VH 46.7 (45.5-48.0) 14.1 (13.5-14.8) 27.2 (26.3-28.0) <0.000001 L V OL T 46.6 (45.4-47.8) 14.5 (13.8-15.2) 27.2 (26.3-28.0) <0.000001 SARRH 50.8 (49.5-52.1) 16.3 (15.5-17.2) 27.2 (26.3-28.0) 0.000012 IPLMI 51.1 (49.8-52.3) 16.6 (15.7-17.7) 27.2 (26.3-28.1) 0.000006 T AB_ 49.1 (47.9-50.3) 15.4 (14.6-16.2) 26.6 (25.7-27.5) 0.000014 LEF detection performance w as deriv ed using the single-predictor approac h prop osed in this study . A UROC, AUPR C, and F1 are rep orted with 95% confidence interv als. The Thresh column indicates the thresholds used to maximize the F1 score on the v alidation set for predictor p erformance and LEF detection, resp ectively . Predictor abbreviations are defined in W agner et al. ( 2020 ). 38 C.3 Threshold selection for the single-predictor approach F ollowing ( Rib eiro et al. , 2020 ), we adopt the F1 score as a primary p erformance metric due to its robustness to class imbalance. F or the single-predictor metho d, the classification threshold is selected to maximize the F1 score on the v alidation set, consistent with prior studies ( Diao et al. , 2025 ; Rib eiro et al. , 2020 ). As an alternative op erating point, w e also consider a recall-based threshold that enforces a minim um recall of 90%. Notably , under b oth thresholding strategies, the thresholds used by the single-predictor metho d for detecting LEF are substantially low er than the corresp onding diagnosis thresholds emplo yed in traditional automatic ECG diagnosis mo dels. This trend is observ ed for most predictors with A UR OC greater than 0.60, as summarized in Figure C.1 . 0.0 0.2 0.4 0.6 0.8 1.0 Th. eshold 2al1e NORM ILBBB INJ AL ISCL A ANEUR ISCAL ASMI SV ARR INJIL CRBBB L AFB ALMI ABQRS CLBBB ILMI INJ AS INVT PVC ISCIL 1A VB P . edicto. a Diagnosis th. eshold F1 -max thr esh-ld R ecall-based thr esh-ld 0.0 0.2 0.4 0.6 0.8 1.0 Th. eshold 2al1e ISC_ IVCD L A O /L AE ISCAN ISCAS AFIB BIGU SVT A C AMI NST_ 3A VB IMI LPR 2A VB DIG LMI SR ST A CH W PW P r edictor b Diagno/i/ thr eshold F1 -max thr esh-ld R ecall-based thr esh-ld Figure C.1: Op erating thresholds of individual predictors for LEF detection. a , Left column: thresholds for the first half of predictors (AUR OC ≥ 60%). b , Right column: thresholds for the remaining predictors. F or eac h predictor, three t yp es of thresholds are shown: 1) Diagnosis thr eshold : threshold for determining whether the ECG shows a p ositive signal for that predictor; 2) F1-max thr eshold : threshold for predicting LEF positivity using that predictor, c hosen to maximize F1 score on the v alidation set; 3) R e c al l-b ase d thr eshold : threshold for predicting LEF p ositivity using that predictor, set to ac hieve recall ≥ 90%. Only predictors with A UROC ≥ 60% are sho wn for clarit y; complete results for all predictors are pro vided in T ables 2 and C.1 . C.4 Subgroups analysis for single-predictor approac h The subgroup analysis results for NORM, ILBBB, INJIL, and ISCLA in the in ternal test set are presen ted in T ables C.3 and C.4 . Corresp onding results in the external test set are shown in T ables C.5 and C.6 . 39 T able C.3: Subgroup p erformance of the single-predictor approach for the NORM and ILBBB predictors on the internal test set. NORM ILBBB Subgroup n Prev alence (%) A UROC AUPR C F1 Score AUR OC A UPRC F1 Score Age groups 18–59 2124 13.6 80.9 (78.2–83.8) 41.3 (36.0–48.3) 47.1 (42.6–51.6) 79.3 (76.2–82.1) 44.1 (38.7–49.9) 44.4 (39.6–49.3) 60–69 1318 16.6 81.6 (78.3–84.8) 49.3 (42.2–56.6) 50.4 (45.5–55.0) 78.1 (74.7–81.1) 44.8 (37.9–51.4) 45.8 (40.1–51.0) 70–79 1154 21.7 80.6 (77.6–83.5) 53.6 (47.7–60.5) 55.4 (51.0–59.9) 81.1 (78.0–84.0) 55.7 (49.5–61.8) 58.5 (53.4–62.9) 80+ 846 24.1 77.0 (73.4–80.2) 47.9 (41.9–54.8) 52.8 (48.2–57.1) 77.8 (74.5–81.3) 50.7 (44.8–58.2) 53.8 (48.6–58.9) Sex F emale 2731 12.4 81.2 (78.6–83.6) 39.0 (34.2–44.9) 43.9 (40.4–47.5) 81.9 (79.5–84.1) 43.6 (38.0–49.4) 45.7 (41.3–49.9) Male 2711 23.0 80.2 (78.4–82.0) 53.5 (49.4–57.8) 56.4 (53.6–59.4) 77.5 (75.4–79.6) 52.3 (48.2–56.6) 53.5 (50.5–56.9) Race/ethnicit y Hispanic 1649 16.7 83.3 (80.6–85.9) 52.6 (46.5–58.8) 53.9 (50.0–58.2) 81.1 (78.3–83.7) 50.3 (44.5–56.4) 50.0 (45.0–54.6) White 1569 16.3 81.7 (79.1–84.4) 45.0 (39.1–51.7) 48.3 (43.9–52.4) 79.4 (76.4–82.2) 45.3 (39.1–52.0) 48.9 (43.5–53.6) Blac k 846 19.3 78.1 (74.4–81.8) 47.2 (39.9–56.2) 50.2 (44.5–55.9) 79.8 (75.8–83.3) 50.0 (42.4–58.8) 50.7 (44.2–56.6) Asian 153 16.3 82.7 (74.6–90.2) 50.4 (32.3–69.7) 45.9 (31.0–58.7) 88.1 (81.0–94.2) 65.5 (48.4–81.2) 57.6 (40.8–71.0) Other 457 15.8 81.4 (75.8–86.6) 42.1 (32.3–54.0) 50.9 (42.9–58.9) 77.2 (70.9–83.4) 50.4 (39.3–62.1) 50.0 (41.2–58.6) Unkno wn 768 22.3 77.7 (73.6–81.7) 47.7 (40.4–56.3) 54.7 (49.0–60.0) 78.8 (74.8–82.5) 49.5 (42.0–57.6) 53.2 (46.5–59.1) Clinical context Emergency 1971 15.7 79.9 (77.2–82.4) 40.8 (35.8–46.3) 47.9 (43.9–51.7) 82.0 (79.5–84.2) 48.3 (42.7–53.9) 49.9 (45.2–54.0) Inpatien t 2203 24.4 79.0 (76.7–81.1) 54.9 (50.2–59.7) 56.4 (53.3–59.5) 75.8 (73.5–78.1) 52.5 (48.2–57.0) 52.4 (49.1–55.9) Outpatien t 1059 6.6 83.8 (78.1–88.8) 34.6 (25.1–45.4) 35.7 (28.8–42.7) 82.0 (76.3–87.2) 31.5 (21.8–43.7) 40.9 (32.0–49.3) Pro cedural 209 21.5 81.1 (74.6–87.0) 46.3 (34.1–61.7) 55.2 (42.7–65.2) 75.3 (65.9–83.1) 44.9 (31.8–60.9) 50.0 (35.9–61.0) A UROC, AUPR C, and F1 are rep orted with 95% confidence interv als. NORM, normal ECG; ILBBB, incomplete left bundle branch blo c k. 40 T able C.4: Subgroup p erformance of the single-predictor method for INJIL and ISCLA in the in ternal test set. INJIL ISCLA Subgroup n Prev alence (%) A UROC AUPR C F1 Score AUR OC A UPRC F1 Score Age groups 18–59 2124 13.6 77.4 (74.6–80.2) 32.4 (28.4–37.9) 40.8 (36.6–45.1) 79.0 (76.4–81.8) 36.6 (31.8–43.0) 43.3 (39.3–47.4) 60–69 1318 16.6 75.4 (72.0–78.8) 34.0 (29.0–40.1) 45.5 (40.6–50.2) 74.4 (70.9–78.0) 37.0 (30.8–44.3) 43.0 (38.1–48.0) 70–79 1154 21.7 73.9 (70.6–77.1) 39.5 (34.5–45.6) 48.0 (43.1–52.7) 73.9 (70.6–77.2) 42.5 (36.9–49.3) 45.8 (40.9–50.1) 80+ 846 24.1 69.2 (65.1–72.9) 36.3 (31.2–42.8) 47.5 (42.3–52.5) 70.8 (67.2–74.5) 43.3 (37.2–50.1) 43.8 (38.5–48.8) Sex F emale 2731 12.4 75.2 (72.7–77.7) 26.1 (22.8–30.3) 36.2 (32.8–40.0) 76.8 (74.4–79.3) 29.5 (25.6–34.7) 37.7 (34.1–41.6) Male 2711 23.0 74.3 (72.2–76.3) 41.2 (37.8–45.1) 51.1 (48.2–54.0) 75.3 (73.2–77.3) 46.2 (42.6–50.4) 48.5 (45.6–51.5) Race/ethnicit y Hispanic 1649 16.7 76.1 (73.0–79.2) 34.6 (30.1–40.3) 45.1 (40.8–49.6) 75.5 (72.6–78.4) 39.0 (33.1–45.3) 42.0 (37.5–46.3) White 1569 16.3 76.2 (73.2–79.3) 35.0 (30.2–41.0) 43.8 (39.6–48.3) 76.8 (74.0–79.7) 35.9 (31.1–42.1) 42.1 (37.5–46.7) Blac k 846 19.3 72.0 (68.0–76.1) 32.6 (27.5–39.6) 44.9 (39.1–50.8) 76.1 (72.3–80.0) 41.5 (34.9–49.8) 46.6 (41.2–51.8) Asian 153 16.3 78.1 (67.1–87.2) 47.8 (30.3–64.3) 45.0 (29.6–56.8) 78.0 (67.1–87.5) 41.7 (26.7–63.5) 45.9 (29.8–59.0) Other 457 15.8 76.1 (70.7–81.7) 35.6 (27.4–46.6) 41.6 (32.9–50.0) 75.6 (70.0–80.8) 37.1 (27.4–48.4) 42.3 (33.7–50.0) Unkno wn 768 22.3 74.0 (69.8–77.8) 39.0 (33.2–46.7) 48.7 (43.2–54.2) 74.3 (70.3–78.5) 45.0 (37.9–53.2) 47.9 (42.2–53.5) Clinical context Emergency 1971 15.7 76.6 (73.9–79.2) 31.9 (28.1–36.6) 43.7 (39.6–47.5) 77.6 (75.0–79.8) 38.6 (33.6–44.3) 43.2 (39.3–47.1) Inpatien t 2203 24.4 70.8 (68.3–73.2) 39.3 (35.7–43.4) 48.0 (44.7–51.4) 70.4 (67.9–72.8) 41.7 (37.5–46.2) 46.9 (43.4–50.3) Outpatien t 1059 6.6 75.4 (69.5–81.5) 20.2 (14.4–29.2) 29.2 (21.7–36.2) 81.4 (75.7–86.3) 32.4 (23.2–45.3) 31.1 (23.9–38.3) Pro cedural 209 21.5 77.2 (69.8–84.8) 44.0 (32.0–60.8) 55.4 (43.7–66.7) 76.6 (69.6–83.4) 43.8 (30.8–60.5) 42.3 (29.1–53.9) A UROC, AUPR C, and F1 are reported with 95% confidence interv als. INJIL, sub endocardial injury in inferolateral leads; ISCLA, ischemic in lateral leads. 41 T able C.5: Subgroup p erformance of the single-predictor method for NORM and ILBBB in the external test set. NORM ILBBB Subgroup n Prev alence (%) AUR OC AUPR C F1 Score A UROC AUPR C F1 Score Age groups 18–59 4270 12.2 81.5 (79.4–83.3) 37.1 (33.1–41.5) 44.6 (41.6–47.7) 73.7 (71.3–76.0) 33.7 (30.3–38.0) 40.4 (37.7–43.2) 60–69 3637 15.7 78.5 (76.5–80.3) 40.8 (36.9–45.3) 46.2 (43.4–48.9) 71.6 (69.3–73.9) 37.6 (33.9–41.8) 43.4 (40.5–46.1) 70–79 3761 18.0 77.7 (76.1–79.5) 39.3 (36.1–43.3) 43.7 (41.3–46.0) 70.5 (68.4–72.7) 36.0 (33.1–39.7) 39.9 (37.4–42.5) 80+ 4349 17.2 74.9 (73.1–76.7) 34.2 (31.3–37.6) 38.4 (36.4–40.4) 67.0 (64.9–69.2) 29.3 (26.8–32.4) 36.1 (33.9–38.5) Sex F emale 7222 11.2 74.2 (72.4–75.9) 22.6 (19.9–25.8) 32.4 (30.4–34.5) 65.3 (63.0–67.6) 19.5 (17.1–22.2) 29.8 (27.4–32.0) Male 8795 19.4 77.1 (75.9–78.4) 42.2 (39.6–44.9) 45.4 (43.6–47.2) 71.6 (70.0–73.0) 39.6 (36.9–42.3) 43.1 (41.2–45.0) Race/ethnicit y Hispanic 638 14.7 81.7 (77.2–85.8) 39.0 (30.4–48.8) 45.8 (39.3–52.6) 68.6 (63.1–74.1) 30.3 (23.0–39.1) 39.1 (31.8–46.2) White 11688 15.5 77.0 (75.9–77.9) 35.2 (33.2–37.5) 41.7 (40.4–43.1) 69.4 (68.2–70.6) 31.1 (29.0–33.4) 39.4 (38.0–40.9) Blac k 1845 17.3 79.6 (77.2–81.9) 44.3 (39.1–50.2) 46.2 (42.4–49.8) 72.6 (69.8–75.4) 40.7 (35.7–46.4) 43.4 (39.6–47.2) Asian 414 12.6 74.7 (69.4–80.0) 26.1 (18.4–36.6) 37.2 (28.5–45.1) 69.0 (63.0–75.2) 24.2 (17.4–33.7) 35.0 (27.1–43.4) Other 524 13.0 73.7 (68.1–79.1) 22.5 (15.9–32.1) 34.4 (27.5–41.4) 66.2 (60.3–72.5) 21.2 (15.3–29.9) 32.9 (26.2–39.6) Unkno wn 908 18.8 75.5 (71.4–79.5) 39.5 (33.1–46.7) 41.8 (36.8–46.7) 66.2 (61.8–70.6) 34.9 (29.7–41.1) 38.0 (33.1–42.7) Clinical context Emergency 9170 14.6 76.3 (75.1–77.5) 33.6 (31.4–36.0) 40.6 (39.1–42.2) 69.2 (67.7–70.6) 31.1 (28.9–33.7) 39.2 (37.7–40.7) Urgen t 3028 20.9 78.7 (76.9–80.5) 45.4 (41.5–49.2) 49.0 (46.5–51.6) 71.2 (69.2–73.2) 40.8 (36.9–45.0) 44.9 (42.2–47.5) Observ ation 2173 19.6 79.6 (77.7–81.7) 49.0 (44.0–54.2) 51.3 (48.4–54.1) 73.9 (71.7–76.0) 44.9 (40.0–50.0) 48.6 (45.5–51.5) Surgical Same Day 1022 6.5 70.5 (65.5–75.5) 13.2 (9.2–19.6) 21.7 (16.5–27.2) 67.3 (61.8–72.9) 12.2 (8.4–18.2) 21.2 (16.3–26.5) Electiv e 624 9.3 73.9 (68.7–79.2) 19.1 (13.6–28.4) 28.9 (22.7–35.0) 67.0 (60.9–73.0) 16.0 (11.6–24.0) 26.0 (20.7–31.7) A UROC, A UPRC, and F1 are rep orted with 95% confidence in terv als. NORM, normal ECG; ILBBB, incomplete left bundle branch blo c k. 42 T able C.6: Subgroup p erformance of the single-predictor metho d for INJIL and ISCLA in the in ternal test set. INJIL ISCLA Subgroup n Prev alence (%) AUR OC AUPR C F1 Score A UROC AUPR C F1 Score Age groups 18–59 4270 12.2 73.1 (71.0–75.1) 25.7 (22.4–29.6) 42.2 (39.1–45.5) 61.0 (58.6–63.4) 23.2 (20.3–26.6) 32.4 (30.0–34.8) 60–69 3637 15.7 72.4 (70.4–74.5) 28.9 (25.6–32.8) 45.0 (42.4–47.6) 61.3 (59.0–63.6) 26.2 (23.4–29.6) 35.6 (33.2–38.1) 70–79 3761 18.0 71.6 (69.8–73.5) 29.4 (26.9–32.4) 43.5 (41.1–45.9) 62.7 (60.6–64.9) 25.9 (23.6–29.1) 33.2 (30.7–35.5) 80+ 4349 17.2 69.9 (68.0–71.7) 27.3 (25.1–30.2) 39.2 (36.9–41.3) 62.2 (60.2–64.4) 24.7 (22.7–27.4) 32.0 (29.8–34.3) Sex F emale 7222 11.2 67.1 (65.4–68.8) 18.7 (16.4–21.5) 31.6 (29.7–33.6) 58.1 (55.7–60.5) 16.3 (14.2–18.9) 26.3 (23.9–28.7) Male 8795 19.4 73.8 (72.6–75.1) 34.5 (32.3–36.9) 44.5 (42.7–46.3) 65.6 (64.0–67.1) 33.0 (30.6–35.5) 39.5 (37.7–41.4) Race/ethnicit y Hispanic 638 14.7 75.8 (71.3–80.1) 30.7 (23.9–39.2) 45.9 (39.5–52.2) 62.3 (56.4–68.2) 27.1 (20.3–35.8) 34.8 (27.6–42.2) White 11688 15.5 71.7 (70.7–72.7) 29.8 (28.0–31.8) 41.7 (40.4–43.0) 62.0 (60.8–63.1) 27.3 (25.5–29.2) 34.9 (33.5–36.3) Blac k 1845 17.3 74.6 (72.3–76.8) 37.0 (32.6–41.7) 45.1 (41.5–48.8) 64.5 (61.6–67.5) 34.2 (29.9–38.8) 38.8 (35.0–42.7) Asian 414 12.6 69.2 (63.6–74.6) 20.1 (14.2–28.8) 36.7 (28.0–45.5) 62.9 (56.8–69.4) 20.4 (14.5–29.2) 32.8 (25.1–40.9) Other 524 13.0 68.8 (63.0–74.7) 17.7 (12.6–25.4) 34.5 (27.5–41.7) 61.4 (55.4–68.0) 16.8 (12.3–24.3) 30.8 (23.7–38.1) Unkno wn 908 18.8 69.5 (65.2–73.5) 31.5 (26.4–37.2) 40.8 (35.9–45.7) 60.7 (56.0–65.2) 28.5 (24.2–33.5) 35.9 (30.9–40.7) Clinical context Emergency 9170 14.6 71.6 (70.3–72.8) 28.2 (26.4–30.2) 40.6 (39.1–42.2) 62.2 (60.8–63.6) 26.4 (24.6–28.3) 35.8 (34.3–37.3) Urgen t 3028 20.9 74.6 (72.9–76.3) 39.2 (35.8–42.9) 49.0 (46.5–51.6) 64.2 (62.2–66.3) 36.3 (33.0–39.8) 44.9 (42.2–47.5) Observ ation 2173 19.6 74.7 (72.8–76.5) 42.6 (38.2–47.1) 51.3 (48.4–54.1) 65.4 (63.3–67.7) 39.7 (35.3–44.1) 48.6 (45.5–51.5) Surgical Same Day 1022 6.5 66.9 (61.9–72.0) 12.3 (8.6–18.1) 21.7 (16.5–27.2) 61.5 (56.0–67.1) 12.0 (8.4–17.8) 21.2 (16.3–26.5) Electiv e 624 9.3 70.9 (66.0–75.9) 18.1 (12.9–26.8) 28.9 (22.7–35.0) 62.6 (56.6–68.7) 16.9 (12.1–25.2) 26.0 (20.7–31.7) A UROC, A UPRC, and F1 are rep orted with 95% confidence in terv als. INJIL, subendo cardial injury in inferolateral leads; ISCLA, ischemic in lateral leads. 43 References A ttia, Z. I., Kapa, S., Lop ez-Jimenez, F., McKie, P . M., Ladewig, D. J., Satam, G., Pellikka, P . A., Enriquez-Sarano, M., Noseworth y , P . A., Munger, T. M. et al. (2019a) Screening for cardiac con tractile dysfunction using an artificial intelligence–enabled electro cardiogram. Natur e me dicine , 25 , 70–74. A ttia, Z. I. et al. (2019b) Prosp ective v alidation of a deep learning electro cardiogram algo- rithm for the detection of left v en tricular systolic dysfunction. J. Car diovasc. Ele ctr. , 30 , 668–674. Carter, R. E., Johnson, P . W., Strom, J. B., W aks, J. W., Krumerman, A., F erric k, K. J., DeRaad, R., Stein b erg, B. A., Wieczorek, M. A., Cruz, J. et al. (2026) Multisite, external v alidation of an ai-enabled ecg algorithm for detection of lo w ejection fraction. JA CC: A dvanc es , 5 , 102537. Chen, T. (2016) Xgb o ost: A scalable tree bo osting system. Cornel l University . Cardio v ascular Committee of China Medical W omens Asso ciation, Asia Heart Rh ythm So ci- et y , T. C. o. R. C. M. o. C. A. f. P ., Medical Devices T echnology Exc hange, C. E. C. T. F. o. t. S. f. C. and of the Primary Diagnostic T erminology of Electro cardiogram, E. S. (2023) Chinese exp ert consensus statement on the standardized Chinese and English primary diagnostic terminology of electro cardiogram. Chin. Cir culation J. , 38 , 141–145. Ciampi, Q. and Villari, B. (2007) Role of echocardiography in diagnosis and risk stratification in heart failure with left ven tricular systolic dysfunction. Car diovascular ultr asound , 5 , 34. Diao, X., Xu, W., Cheng, H., Zhou, Y., Liu, Y., Huo, Y., Lu, J., Huang, J., He, J., Liu, F. et al. (2025) Sp eed-tr: a self-distilled and pre-trained transformer mo del for enhanced ecg detection of tricuspid regurgitation. npj Digital Me dicine , 8 , 650. Elias, P . and Finer, J. (2025) Ec honext: A dataset for detecting ec ho cardiogram-confirmed structural heart disease from ecgs. Gao, Z., Y urk, D. and Abu-Mostafa, Y. S. (2025) Machine learning with scarce data: Ejection fraction prediction using PLAX view. In Me dic al Imaging with De ep L e arning . URL: https://openreview.net/forum?id=JEN5FzeFZj . Go w, B., P ollard, T., Nathanson, L. A., Johnson, A., Mo o dy , B., F ernandes, C., Green baum, N., W aks, J. W., Eslami, P ., Carb onati, T., Chaudhari, A., Herbst, E., Moukheib er, D., Berk owitz, S., Mark, R. and Horng, S. (2023) MIMIC-IV-ECG: Diagnostic Electro cardio- gram Matched Subset. PhysioNet . URL: https://doi.org/10.13026/4nqg- sb35 . V ersion 1.0. Hann un, A. Y., Ra jpurkar, P ., Haghpanahi, M., Tison, G. H., Bourn, C., T urakhia, M. P . and Ng, A. Y. (2019) Cardiologist-level arrhythmia detection and classification in ambulatory electro cardiograms using a deep neural net work. Natur e me dicine , 25 , 65–69. Heidenreic h, P . A., Bozkurt, B., Aguilar, D., Allen, L. A., Byun, J. J., Colvin, M. M., Desw al, A., Drazner, M. H., Dunla y , S. M., Evers, L. R. et al. (2022) 2022 aha/acc/hfsa guideline 44 for the management of heart failure: executive summary: a rep ort of the american college of cardiology/american heart asso ciation join t committee on clinical practice guidelines. Journal of the A meric an Col le ge of Car diolo gy , 79 , 1757–1780. Hughes, J. W., Somani, S., Elias, P ., T o oley , J., Rogers, A. J., P oterucha, T., Haggerty , C. M., Salerno, M., Ouyang, D., Ashley , E. et al. (2024) Simple mo dels vs. deep learning in detecting lo w ejection fraction from the electro cardiogram. Eur op e an He art Journal-Digital He alth , 5 , 427–434. ISO Central Secretary . (2009) Health informatics – Standard comm unication proto col – Part 91064: Computer-assisted electro cardiograph y. Jiang, A., Huang, C., Cao, Q., Xu, Y., Zeng, Z., Chen, K., Zhang, Y. and W ang, Y. (2024) Self-sup ervised anomaly detection pretraining enhances long-tail ecg diagnosis. arXiv pr eprint arXiv:2408.17154 . Joglar, J. A., Chung, M. K., Armbruster, A. L., Benjamin, E. J., Ch you, J. Y., Cronin, E. M., Desw al, A., Ec khardt, L. L., Goldb erger, Z. D., Gopinathannair, R. et al. (2024) 2023 acc/aha/accp/hrs guideline for the diagnosis and management of atrial fibrillation: a rep ort of the american college of cardiology/american heart asso ciation joint committee on clinical practice guidelines. Journal of the A meric an Col le ge of Car diolo gy , 83 , 109–279. Johnson, A., Bulgarelli, L., P ollard, T., Gow, B., Mo o dy , B., Horng, S., Celi, L. A. and Mark, R. (2024) MIMIC-IV. PhysioNet . URL: https://doi.org/10.13026/kpb9- mt58 . V ersion 3.1. Johnson, A., Pollard, T., Horng, S., Celi, L. A. and Mark, R. (2023a) MIMIC-IV-Note: Deiden tified free-text clinical notes. PhysioNet . URL: https://doi.org/10.13026/ 1n74- ne17 . V ersion 2.2. Johnson, A. E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Mo o dy , B., Gow, B. et al. (2023b) Mimic-iv, a freely accessible electronic health record dataset. Scientific data , 10 , 1. Khera, R. (2024) Ai-enabled diagnosis from an electro cardiogram image: the next frontier of inno v ation in a cen tury-old technology . Kligfield, P ., Gettes, L. S., Bailey , J. J., Childers, R., Deal, B. J., Hanco c k, E. W., v an Herp en, G., Kors, J. A., Macfarlane, P ., Mirvis, D. M., Pahlm, O., Rautaharju, P . and W agner, G. S. (2007) Recommendations for the standardization and interpretation of the electro cardiogram. Cir culation , 115 , 1306–1324. URL: https://www.ahajournals.org/ doi/abs/10.1161/CIRCULATIONAHA.106.180200 . Kusumoto, F. M., Schoenfeld, M. H., Barrett, C., Edgerton, J. R., Ellen b ogen, K. A., Gold, M. R., Goldschlager, N. F., Hamilton, R. M., Joglar, J. A., Kim, R. J. et al. (2019) 2018 acc/aha/hrs guideline on the ev aluation and managemen t of patients with bradycardia and cardiac conduction delay: a report of the american college of cardiology/american heart asso ciation task force on clinical practice guidelines and the heart rhythm society . Journal of the A meric an Col le ge of Car diolo gy , 74 , e51–e156. 45 K won, J.-M., Lee, S. Y., Jeon, K.-H., Lee, Y., Kim, K.-H., Park, J., Oh, B.-H. and Lee, M.-M. (2020) Deep learning–based algorithm for detecting aortic stenosis using electro car- diograph y . Journal of the A meric an He art A sso ciation , 9 , e014717. Lakshmanan, S. and Mbanze, I. (2023) A comparison of cardio v ascular imaging practices in africa, north america, and europ e: tw o faces of the same coin. Eur op e an He art Journal- Imaging Metho ds and Pr actic e , 1 , qyad005. Li, J., Aguirre, A. D., Junior, V. M., Jin, J., Liu, C., Zhong, L., Sun, C., Clifford, G., Brandon W esto ver, M. and Hong, S. (2025) An electro cardiogram foundation mo del built on ov er 10 million recordings. NEJM AI , 2 , AIoa2401033. Lundb erg, S. M., Erion, G. G. and Lee, S.-I. (2018) Consisten t individualized feature attri- bution for tree ensembles. arXiv pr eprint arXiv:1802.03888 . Lundb erg, S. M. and Lee, S.-I. (2017) A unified approac h to in terpreting model predictions. A dvanc es in neur al information pr o c essing systems , 30 . Na, Y., P ark, M., T ae, Y. and Joo, S. (2024) Guiding mask ed representation learning to capture spatio-temp oral relationship of electro cardiogram. In The Twelfth Interna- tional Confer enc e on L e arning R epr esentations . URL: https://openreview.net/forum? id=WcOohbsF4H . P oterucha, T. J., Jing, L., Ricart, R. P ., A djei-Mosi, M., Finer, J., Hartzel, D., Kelsey , C., Long, A., Ro cha, D., R uhl, J. A. et al. (2025) Detecting structural heart disease from electro cardiograms using ai. Natur e , 644 , 221–230. Qw en, :, Y ang, A., Y ang, B., Zhang, B., Hui, B., Zheng, B., Y u, B., Li, C., Liu, D., Huang, F., W ei, H., Lin, H., Y ang, J., T u, J., Zhang, J., Y ang, J., Y ang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Y ang, K., Y u, L., Li, M., Xue, M., Zhang, P ., Zh u, Q., Men, R., Lin, R., Li, T., T ang, T., Xia, T., Ren, X., Ren, X., F an, Y., Su, Y., Zhang, Y., W an, Y., Liu, Y., Cui, Z., Zhang, Z. and Qiu, Z. (2025) Qwen2.5 tec hnical rep ort. URL: https://arxiv.org/abs/2412.15115 . Rib eiro, A. H., Paixao, G. M., Lima, E. M., Horta Rib eiro, M., Pinto Filho, M. M., Gomes, P . R., Oliv eira, D. M., Meira Jr, W., Schon, T. B. and Rib eiro, A. L. P . (2021) Co de-15%: a large scale annotated dataset of12-lead ecgs. URL: https://doi.org/10.5281/zenodo. 4916206 . Rib eiro, A. H., Rib eiro, M. H., P aixão, G. M., Oliveira, D. M., Gomes, P . R., Canazart, J. A., F erreira, M. P ., Andersson, C. R., Macfarlane, P . W., Meira Jr, W. et al. (2020) Automatic diagnosis of the 12-lead ecg using a deep neural netw ork. Natur e c ommunic ations , 11 , 1760. Sa v arese, G., Becher, P . M., Lund, L. H., Seferovic, P ., Rosano, G. M. and Coats, A. J. (2022) Global burden of heart failure: a comprehensive and up dated review of epidemiology . Car diovascular r ese ar ch , 118 , 3272–3287. Shapley , L. S. et al. (1953) A v alue for n-person games. 46 Stro dthoff, N., Mehari, T., Nagel, C., Aston, P . J., Sundar, A., Graff, C., Kan ters, J. K., Ha verkamp, W., Dössel, O., Lo ewe, A. et al. (2023) Ptb-xl+, a comprehensiv e electrocar- diographic feature dataset. Scientific data , 10 , 279. Stro dthoff, N., W agner, P ., Sc haeffter, T. and Samek, W. (2020) Deep learning for ecg anal- ysis: Benc hmarks and insights from ptb-xl. IEEE journal of biome dic al and he alth infor- matics , 25 , 1519–1528. T ran, H. H.-V., Th u, A., F uertes, A., T wa yana, A. R., Mahadev aiah, A., Meh ta, K. A., James, M., Basta, M., W eissman, S., F rishman, W. H. et al. (2025) Electro cardiogram- based artificial in telligence for detection of low ejection fraction: A contemporary review. Car diolo gy in R eview , 10–1097. V an De Leur, R. R., Bos, M. N., T aha, K., Sammani, A., Y eung, M. W., V an Duijven b o den, S., Lambiase, P . D., Hassink, R. J., V an Der Harst, P ., Do evendans, P . A. et al. (2022) Impro ving explainabilit y of deep neural netw ork-based electro cardiogram interpretation using v ariational auto-enco ders. Eur op e an He art Journal-Digital He alth , 3 , 390–404. W agner, P ., Stro dthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze, F. I., Samek, W. and Sc haeffter, T. (2020) Ptb-xl, a large publicly a v ailable electro cardiography dataset. Scien- tific data , 7 , 1–15. Y ao, X., R ushlo w, D. R., Inselman, J. W., McCo y , R. G., Thac her, T. D., Behnk en, E. M., Bernard, M. E., Rosas, S. L., Akfaly , A., Misra, A. et al. (2021) Artificial intelligence– enabled electro cardiograms for iden tification of patients with lo w ejection fraction: a prag- matic, randomized clinical trial. Natur e me dicine , 27 , 815–819. Zheng, J., Ch u, H., Struppa, D., Zhang, J., Y acoub, S. M., El-Askary , H., Chang, A., Eh w- erhem uepha, L., Abudayy eh, I., Barrett, A. et al. (2020a) Optimal m ulti-stage arrhythmia classification approach. Scientific r ep orts , 10 , 2898. Zheng, J., Zhang, J., Danioko, S., Y ao, H., Guo, H. and Rako vski, C. (2020b) A 12-lead electro cardiogram database for arrhythmia researc h co vering more than 10,000 patients. Scientific data , 7 , 48. Zhou, Y., Y ang, Y., F an, X. and Zhao, W. (2025) Bridging performance gaps for ecg foun- dation mo dels: A p ost-training strategy . arXiv pr eprint arXiv:2509.12991 . 47
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment