MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

MedCL-Benc h: Benc hmarking stabilit y–eﬃciency trade-oﬀs and scaling in biomedical con tin ual learning Min Zeng, Sh uang Zhou, Zaifu Zhan, Rui Zhang 1 Division of Computational Health Sciences, Department of Surgery , Univ ersity of Minnesota, Minneap olis, 55455, MN, USA. *Corresp onding author(s). E-mail(s): ruizhang@umn.edu ; Abstract Medical language mo dels must be up dated as evidence and terminology evolv e, y et sequen tial updating can trigger catastrophic forgetting. Although biomedi- cal NLP has man y static benchmarks, no uniﬁed, task-diverse b enchmark exists for ev aluating con tinual learning under standardized protocols, robustness to task order and compute-a ware reporting. W e in tro duce MedCL-Bench , whic h streams ten biomedical NLP datasets spanning ﬁve task families and ev aluates elev en contin ual learning strategies across eight task orders, rep orting reten- tion, transfer, and GPU-hour cost. Across backbones and task orders, direct sequen tial ﬁne-tuning on incoming tasks induces catastrophic forgetting, causing up date-induced p erformance regressions on prior tasks. Con tinual learning meth- o ds o ccup y distinct retention–compute fron tiers: parameter-isolation pro vides the b est reten tion per GPU-hour, repla y oﬀers strong protection at higher cost, and regularization yields limited b eneﬁt. F orgetting is task-dependent, with m ulti- lab el topic classiﬁcation most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a repro ducible framework for auditing mo del up dates before deploymen t. Large language mo dels (LLMs) are increasingly used to support biomedical question answering, evidence retriev al, relation extraction, and do cument-lev el classi- ﬁcation. Y et biomedical kno wledge is not static: new ﬁndings, revised clinical evidence, and evolving therap eutic guidance con tinually change what mo dels should know. Ho wev er, up dating large mo dels b y full retraining is often computationally imprac- tical, while rep eatedly ﬁne-tuning on new datasets can ero de previously acquired capabilities through c atastr ophic for getting [ 1 , 2 ]. The result is a tension betw een plas- ticity (incorp orating new knowledge) and stability (preserving prior comp etencies) that is especially consequential in biomedical settings, where outdated or inconsistent b eha vior can directly undermine downstream research and decision-supp ort. Con- tin ual learning (CL) has therefore emerged as a promising paradigm for enabling mo dels to acquire new knowledge ov er time while mitigating degradation of prior comp etencies [ 3 ]. In the clinical domain, these c hallenges are sharp ened by three practical con- strain ts [ 4 , 5 ]. First, priv acy and data-gov ernance constraints often limit the po oling of raw patien t records across institutions, creating p ersisten t “data w alls” b et ween sites [ 6 , 7 ]. This motiv ates mo del-to-data workﬂo ws in whic h mo dels are up dated sequen tially across hospitals (e.g., Hospital A → Hospital B) by transferring param- eters rather than sensitive primary data [ 8 , 9 ]. Second, clinical data are inherently sequen tial and longitudinal[ 10 , 11 ]: patien t information accrues ov er time through rep eated encounters, tests, treatments, and ev olving diagnoses, while lab eled datasets and task deﬁnitions often emerge incremen tally rather than as a single static cor- pus. Third, clinical practice exhibits contin ual know le dge drift : diagnostic standards, treatmen t guidelines, and pathogen proﬁles can change ov er time [ 12 – 14 ], requiring mo dels to incorp orate new evidence without rep eated full retraining. These constrain ts translate directly in to deplo yment risk. Sequential updates can in tro duce silent r e gr es- sions [ 13 , 15 ]: a mo del may improv e on newly in tro duced data while degrading on previously v alidated capabilities. In safety-critical w orkﬂo ws, such regressions increase the v eriﬁcation burden for clinicians and can undermine the reliability of downstream decision-supp ort, including tasks suc h as adverse-ev en t monitoring or drug–drug in ter- action detection. This creates a practical need for rigorous ev aluation of contin ual learning b ehavior under rep eated up dates. Despite growing interest in b oth biomedical Natural Language Pro cessing (NLP) and contin ual learning, several questions remain unresolved in realistic biomedical deplo yment settings. In particular, it remains unclear how sev ere catastrophic forget- ting is in biomedical NLP , whether CL methods reliably mitigate forgetting and deliver meaningful b eneﬁts in biomedical NLP , whic h classes of CL strategies are most eﬀec- tiv e for biomedical tasks, whether ﬁndings from general-domain CL transfer to settings with sp ecialized biomedical language and heterogeneous data distributions, whether larger backbones consistently improv e p erformance or instead exhibit non-monotonic, bac kb one-dep enden t trade-oﬀs, and how these metho ds compare in terms of training cost, parameter eﬃciency , and up date-time ov erhead. Biomedical NLP still lacks a uniﬁed, task-div erse b enc hmark for CL under a standardized training and ev aluation proto col to answer these questions rigorously . T o address these gaps, we in tro duce MedCL-Benc h (Fig. 1 ), a uniﬁed contin- ual learning b enc hmark for biomedical NLP . MedCL-Benc h is designed to answer three practical questions: (i) ho w severe catastrophic forgetting is in biomedical NLP , (ii) which con tinual learning strategies oﬀer the b est trade-oﬀs b etw een end- of-stream p erformance and training cost, and (iii) how sensitiv e conclusions are to task order and backbone c hoice under realistic resource constraints. Sp eciﬁcally , MedCL-Benc h pro vides a standardized b enchmark suite and ev aluation pipeline for con trolled comparison of contin ual learning strategies in biomedical NLP . First, w e curate and standardize ten public biomedical NLP datasets into a standard- ized con tinual learning b enchmark suite spanning ﬁve task families: biomedical question answ ering ( PubMedQA , BioASQ ), scien tiﬁc fact c hecking ( SciF act , Pub- Heal th ), relation extraction ( GAD , ChemProt , DDI ), do cumen t-level classiﬁcation ( Pubmed RCT , DRUGLIB ), and multi-label topic classiﬁcation of biomedical liter- ature ( LitCo vid ). Second, w e ev aluate sequen tial updates across eight pre-sp eciﬁed task orders with a uniﬁed preprocessing and ev aluation protocol, enabling matc hed comparisons across contin ual learning strategies and backbone arc hitectures. Third, we b enc hmark representativ e contin ual learning strategies—including naive ﬁne-tuning, m ulti-task learning [ 16 ], regularization [ 17 – 19 ], rehearsal/gradien t pro jection [ 20 – 22 ], generativ e replay [ 23 – 25 ], and parameter-eﬃcient adaptation [ 26 , 27 ]—and rep ort end-of-stream p erformance together with compute- and parameter-eﬃciency (includ- ing GPU-hour cost) to c haracterize stabilit y–eﬃciency trade-oﬀs. Finally , w e release the full b enchmark co debase—preprocessing, training, and ev aluation—to enable repro ducible comparisons and further research. Using this benchmark, w e further ﬁnd that reten tion is not uniform across biomed- ical tasks. T asks with more constrained output structure, suc h as multiple-c hoice question answ ering and m ulti-class relation extraction, are comparatively robust to sequen tial up dates, whereas multi-label topic classiﬁcation with ov erlapping lab el sets is mark edly more vulnerable to forgetting. These patterns suggest that forgetting in biomedical contin ual learning dep ends not only on task diﬃculty , but also on task form ulation and output structure. b) Problem: C atas t rophic Forget ting a) Motivation: Seq u ential Upd at es in Bi omedical dom ain e) Evaluation M et r i cs New Pub med papers New clin ical t rial evidence New dr ug - disease relations PubmedQA Task 1 SciFa ct Task 2 Task 3 Dru glib RE Task 1 l o st Task 2 l o st Catastrophic fo rgettin g Cont inual lea rning preserves prior kn owled ge PubmedQA Task 1 SciF act Task 2 Task 3 Dru glib RE Task 1 n ot lost Task 2 n ot lost Overall task perf ormance Backwar d Tran sfer ( how m uch p revio us ta sk kn owledge is forgot ten ) Forwar d T ransfer ( how much prior task knowled ge is t ran sferre d ) Reali stic s equential updates Across institu t ions W i th i n o n e i n s ti tu ti o n Within one i ns titution da ta can not be p ool ed Iterative up d a tes; costly retraining 20 23 20 24 20 25 2 Method Comparison 3 Order Sensiti vity 4 Compute Ef ficiency 5 Scali ng Behavi or QA FactChe ck RE DocC ls Mu lti Lab el PubMed QA BioAS Q SciFac t PubHe alth GAD ChemPro t DDI PubMed _RCT DRUGLIB Li tCovid c) MedCL-Be nch: Dataset s & T ask Families 10 datasets 5 task families La ngua ge M odel 𝑓 𝜃 0 Task Str eam (𝑇 1 𝑇 2 𝑇 10 ) ⋯ QA QA Fact Fact Tr a in o n 𝑇 1 Tr a in o n 𝑇 2 La ngua ge M odel 𝑓 𝜃 2 Tr a in o n 𝑇 3 La ngua ge M odel 𝑓 𝜃 3 Lan guage Mod el 𝑓 𝜃 10 La ngua ge M odel 𝑓 𝜃 1 Evaluate on 𝑇 1 Evaluate on 𝑇 1 , 𝑇 2 Evaluate on 𝑇 1 , 𝑇 2 , 𝑇 3 11 Methods • EWC • ADAPT ER • OLORA • REPLA Y • GEM 8 Ta s k O r d er s (𝑇 1 𝑇 2 ⋯ 𝑇 3 (𝑇 2 𝑇 1 ⋯ 𝑇 3 Sequent ia l updates d) Overview: Me dCL -Bench f) Key Questions 1 Forget ting Severit y Fig. 1 | Ov erview of MedCL-Benc h. (a) Biomedical knowledge and datasets ev olve contin uously (e.g., new literature and drug–disease relations), creating realistic sequen tial up date streams—b oth across institutions (where data cannot be p o oled) and within an institution ov er time. (b) Sequential sequential ﬁne-tuning can o verwrite previously acquired capabilities (catastrophic forgetting), whereas CL aims to retain prior kno wledge while learning new tasks. (c) MedCL-Benc h comprises ten biomedical NLP datasets group ed into ﬁv e task families (QA, fact chec king, relation extraction, do cumen t classiﬁcation, and multi-label topic classiﬁcation). (d) Benchmark workﬂo w: a pretrained backbone is up dated sequen tially on a task stream under multiple task orders, and ev aluated on all previously seen tasks after eac h stage. (e) CL metrics rep orted in this work: ov erall task p erformance (AP), backw ard transfer (BWT), and forw ard transfer (FWT). (f ) Key questions addressed: forgetting severit y , metho d comparison, order sensitivit y , compute eﬃciency , and scaling/bac kb one dep endence. Icons are sourced from Flaticon.com (full attributions in Supplementary Note 1). Results W e ev aluate contin ual learning on MedCL-Bench, a stream of ten public biomedi- cal NLP datasets spanning ﬁv e task t yp es: biomedical question answ ering, scientiﬁc fact c hecking, relation extraction, do cumen t-level classiﬁcation, and multi-label topic classiﬁcation. A detailed ov erview of all datasets is provided in Extended Data T able 1. T asks are presented sequen tially under eight randomized task orders (Extended Data T able 2). Models are incremen tally up dated without access to future tasks. After eac h training stage, mo dels are ev aluated on all previously encoun tered tasks, enabling systematic assessmen t of b oth knowledge acquisition and reten tion. Unless stated oth- erwise, results use the T5-base bac kb one; additional backbones are considered only in the scaling exp erimen ts. Across all exp eriments, w e rep ort three contin ual learning metrics: av erage p erfor- mance (AP; mean of p er-task accuracies o ver all tasks after completing a task order), bac kward transfer (BWT; less negative indicates less forgetting), and forward transfer (FWT; less negative indicates b etter transfer) . T able 1 summarizes AP/BWT/FWT for each metho d under eac h of the eight task orders. Ov erall p erformance across metho ds Using T able 1 , w e highlight the main trends across methods. Multi-task learning (MUL TI) provides an empirical upp er b ound, achieving consistently high AP ( ∼ 76%) across orders when all tasks are jointly optimized. Among contin ual learning meth- o ds, AD APTER and TCL attain the strongest and most consistent AP (72.01–73.27% and 69.75–71.37%, resp ectively), closely approaching the upp er b ound. GEM achiev es higher AP but is more order-sensitiv e (66.83–73.69% across orders), whereas REPLA Y impro ves o ver na ¨ ıv e sequential ﬁne-tuning (V ANILLA) y et remains low er o verall (58.05–63.55%). V ANILLA denotes standard sequential ﬁne-tuning with no CL mechanism. F or forgetting, BWT v alues reveal a sharp separation b et ween method families. V ANILLA shows severe forgetting with consistently negativ e BWT across all orders, whereas regularization baselines (EW C, L2) only partially mitigate forgetting and still yield negative BWT. In contrast, memory-based approaches—especially GEM and REPLA Y—substan tially mitigate forgetting, with BWT v alues markedly closer to zero. FWT is consisten tly negativ e across metho ds, indicating limited forw ard transfer. V ANILLA forgets most (BWT do wn to − 57 . 69), while replay/constrain t metho ds impro ve retention (e.g., GEM: − 6 . 88 to − 1 . 33). Notably , OLORA remains highly forgetting-prone (BWT − 43 . 92 to − 29 . 33), underscoring that stabilit y requires explicit reten tion mechanisms, not only parameter-eﬃcient up dates. T able 1 | Overall performance on MedCL-Benc h. F or eac h metho d, we report a verage task p erformance (AP), backw ard transfer (BWT), and forward transfer (FWT) across ten biomedical NLP tasks and eight randomized task orders. Higher is b etter; AP is highligh ted in red . Meth od Train. Para. Metric ↑ Order 1 Order 2 Order 3 Ord er 4 Order 5 Order 6 Order 7 Order 8 VAN ILLA 22 2M AP 23 .38 36 .24 19 .37 34 .45 22 .87 34 .30 33 .32 24 .21 BW T - 52 .07 - 31 .72 - 57 .69 - 41 .95 - 49 .95 - 41 .78 - 42 .80 - 50 .65 FW T - 22 . 75 - 21 .22 - 26 .05 - 23 .55 - 28 .19 - 23 .23 - 32 .98 - 27 .80 EW C 22 2M AP 51 .17 50 .96 41 .34 48 .82 48 .13 45 .35 43 .16 56 .19 BW T - 17 .07 - 25 .1 0 - 33 .82 - 27 .31 - 25 .32 - 30 .72 - 31 .65 - 19 .65 FW T - 22 .1 0 - 21. 51 - 25. 92 - 23. 49 - 28. 00 - 23 . 68 - 32 .88 - 27 .81 L2 22 2M AP 53 .31 61 .42 61 .03 55 .73 51 .86 57 .33 62 .19 69 . 89 BW T - 23 .87 - 14 .06 - 14 .01 - 21 .45 - 24 .09 - 19 .82 - 10. 26 - 5.1 9 FW T - 21 .51 - 20 .35 - 26 .16 - 22 .48 - 27 .96 - 22 .10 - 32 .25 - 27 .23 LAMOL 22 2M AP 54 .01 54 .65 47 .66 50 .92 49 .55 54 .33 51 .19 54 . 21 BW T - 20 .73 - 20 .97 - 25 .88 - 25 .28 - 24 .27 - 17 .96 - 19. 66 - 19 .95 FW T - 22 .31 - 20 .96 - 26 .36 - 23 .36 - 28 .09 - 22 .79 - 32 .81 - 27 .76 GEM 22 2M AP 69 .77 71 .35 70 .88 71 .1 0 71 .17 72 .06 66 .83 73 .69 BW T - 6.8 8 - 4. 7 5 - 3. 52 - 4. 4 2 - 4. 15 - 4. 2 9 - 4. 85 - 1. 33 FW T - 21 .76 - 20 .37 - 26 .26 - 22 .93 - 27 .78 - 22 .53 - 32 .30 - 27 .57 AGEM 22 2M AP 45 .13 50 .98 50 .3 0 59 .18 48 .68 52 .88 48 .09 59 .16 BW T - 32 .21 - 24 .05 - 26 .41 - 17 .9 0 - 26 .88 - 24 .13 - 26 .53 - 15 .93 FW T - 22 .59 - 21 .17 - 26 .33 - 23 .33 - 27 .81 - 23 .46 - 32 .81 - 27 .69 REP LAY 22 2M AP 58 .05 61 .93 62 .57 63 .55 58 .11 58 .79 58 .87 60 . 89 BW T - 15 .1 0 - 10 . 81 - 9.4 1 - 9.1 2 - 16 .43 - 5. 2 9 - 12. 68 - 6. 5 2 FW T - 22 .38 - 21 .28 - 26 .31 - 23 .33 - 27 .94 - 23 .23 - 33 .01 - 27 .65 OLOR A 2.5 M AP 39 .06 45 .26 46 .84 35 .05 39 .31 44 .45 39 .32 47 . 55 BW T - 36 .62 - 34 .18 - 30 .98 - 43 .92 - 29 .33 - 33 .55 - 37. 16 - 29 .49 FW T - 22 .45 - 20 .47 - 27 .03 - 22 .57 - 29 .65 - 23 .12 - 35 .41 - 28 .31 ADA PTER 37 .4M AP 72 .55 72 .01 73 .27 72 .43 72 .36 72 .54 72 .69 72 .02 BW T / / / / / / / / FW T - 22 .92 - 21 .31 - 26 .6 0 - 23 .57 - 28 .05 - 23 .72 - 32 .91 - 27 .72 TCL 17 .9 M AP 70 .26 70 .27 70 .62 70 .63 71 .14 71 .37 70 .1 8 69 .75 BW T / / / / / / / / FW T - 22 .71 - 21 .23 - 26 .52 - 23 .38 - 27 .76 - 23 .77 - 32 .64 - 27 .31 MUL TI 22 2M AP 75 .88 75 .76 75 .15 76 .04 76 .50 76 .20 7 6. 97 76. 12 BW T / / / / / / / / FW T / / / / / / / / Order robustness and statistical reliabilit y Figure 2 quantiﬁes sensitivity of ﬁnal AP to task-order p erm utations. Fig. 2a rep orts mean ﬁnal AP across eigh t orders with 95% bo otstrap conﬁdence interv als obtained b y resampling task orders ( n =8); MUL TI provides an empirical upper bound with a narro w interv al. Among CL metho ds, AD APTER and TCL sho w the highest mean AP with the tigh test in terv als, whereas GEM ac hieves comparable mean AP but with wider inter- v als, indicating residual order sensitivity . REPLA Y impro v es o ver V ANILLA but remains low er on a verage. Complemen ting the CI-based view, Fig. 2b summarizes order sensitivit y using the standard deviation of ﬁnal AP across the eight orders. ADAPTER, TCL, and MUL TI sho w the smallest observ ed v ariability , whereas GEM and REPLA Y exhibit in termedi- ate sensitivit y . Sev eral baselines (notably V ANILLA and L2) ha ve the largest standard deviations, indicating that their outcomes can v ary substantially under diﬀeren t task p erm utations. Per-order ﬁnal AP v alues are shown in Supplemen tary Note 2. T reating task order as a matc hed blo ck ( n =8), paired exact sign-ﬂip tests with Holm correc- tion conﬁrm that ADAPTER and TCL outp erform REPLA Y, whereas L2 does not (Supplemen tary Note 3). 0 10 20 30 40 50 60 70 80 F inal AP V ANILL A OL OR A EWC A GEM L AMOL L2 REPL A Y T CL GEM AD APTER MUL TI T5 F inal AP acr oss 8 task or ders (mean ± 95% CI) (a) Mean with 95% b o otstrap CI. Final AP aggregated o ver 8 task orders. AD APTER T CL MUL TI GEM REPL A Y L AMOL OL OR A EWC A GEM L2 V ANILL A Method 0 1 2 3 4 5 6 7 STD(AP_final) acr oss 8 task or ders T5 or der sensitivity (lower = mor e r obust) (b) Order sensitivit y . Standard deviation of ﬁnal AP across orders (low er = more robust). Fig. 2 | Order robustness and statistical reliability on MedCL-Benc h (T5- base). (a) Mean ﬁnal AP across eigh t randomized task orders with 95% bo otstrap conﬁdence in terv als for the mean obtained by resampling task orders (n=8). (b) Order sensitivit y measured as the standard deviation (s.d.) of ﬁnal AP across orders (low er indicates stronger robustness). T ogether, these panels summarize b oth a verage perfor- mance and sensitivity to task-order permutations. F orgetting Dynamics Across T ask Orders T o understand how task order aﬀects stability during training, we examine stage- wise tra jectories. Fig. 3a sho ws AP t , the mean accuracy o v er tasks seen up to stage t , under four task orders (Orders 1–4); tra jectories for Orders 5–8 are pro vided in Supplemen tary Note 4. This metric provides an intuitiv e view of the stability–plasticit y trade-oﬀ: sharp drops indicate substantial forgetting and in terference with previously learned tasks, whereas ﬂat tra jectories indicate robust retention. Across all orders, V ANILLA exhibits recurrent collapses as training progresses. Although the drops o ccur at diﬀeren t transition points dep ending on the permuta- tion, the ov erall pattern remains consistent: p erformance on previously learned tasks deteriorates under long-horizon sequential up dates and only partially recov ers. Memory-based metho ds pro vide the most consistent stabilization in these tra jec- tories. In particular, REPLA Y (exp erience replay) and GEM maintain comparatively smo oth tra jectories across orders, with markedly reduced drops when new tasks arriv e. Notably , REPLA Y in terleav es a memory buﬀer of past samples during train- ing, whereas GEM additionally enforces gradient constrain ts—pro jecting up dates to av oid increasing loss on stored examples—at the cost of extra computation. Regularization-based approac hes (EW C, L2) pro vide only partial protection and still exhibit noticeable degradations at sev eral transitions. T ransition-level diagnosis of order sensitivit y While the forgetting curves in Fig. 3a summarize the stage-wise evolution of AP t , they do not iden tify which task switches cause abrupt changes. W e therefore analyze transition shock, deﬁned as ∆ AP t = AP t +1 − AP t (in percentage p oints), where AP t a verages p erformance o ver all tasks observ ed up to stage t . Fig. 3b rep orts transition sho c ks for Order 1–4; Order 5–8 are provided in Supplemen tary Note 5. Across task orders, memory-based metho ds (REPLA Y, GEM) and parameter- isolation methods (ADAPTER, TCL) tend to reduce the magnitude of negativ e transition sho c ks, whereas sequen tial ﬁne-tuning (V ANILLA) and regularization baselines (EW C, L2) exhibit larger drops at multiple switches, indicating stronger in terference when new tasks are introduced. This transition-level view aligns with the aggregate trends in T able 1 : metho ds with larger and more frequen t negativ e shocks t ypically show more negative BWT (greater forgetting). Finally , while FWT captures transfer to a task b efor e it is trained, ∆ AP t reﬂects the net change after learning the next task (acquisition minus interference), so strong FWT do es not necessarily imply small transition sho c ks. T aken together, these tra jectory- and transition-level diagnostics show that metho d stability can depend on task p ermutations, motiv ating ev aluation across m ultiple orders rather than relying on a single sequence. W e next examine whether forgetting diﬀers systematically across task families. BioASQ GAD Pubmed_R CT SciF act DRUGLIB LitCovid DDI PubMedQA PubHealth ChemProt 0 10 20 30 40 50 60 70 80 90 A verage Performance (AP, %) Order 1 BioASQ GAD PubMedQA PubHealth ChemProt Pubmed_R CT SciF act DRUGLIB LitCovid DDI Order 2 PubMedQA PubHealth DRUGLIB BioASQ LitCovid DDI GAD ChemProt Pubmed_R CT SciF act 0 10 20 30 40 50 60 70 80 90 A verage Performance (AP, %) Order 3 SciF act BioASQ DDI GAD PubMedQA LitCovid PubHealth DRUGLIB ChemProt Pubmed_R CT Order 4 Sequential learning (T ask order →) V ANILLA REPLA Y LAMOL GEM AGEM EWC L2 ADAPTER T CL OLOR A (a) F orgetting curves under four task orders. T ra jectories of seen-task av erage accuracy across stages for Orders 1–4. Fig. 3 | Order-dep endent forgetting dynamics in MedCL-Benc h. (a) F orget- ting curves. (b) T ransition shock heatmaps (next page). BioASQ→GAD GAD→Pubmed_RCT Pubmed_RCT→SciF act SciF act→DRUGLIB DRUGLIB→LitCovid LitCovid→DDI DDI→PubMedQA PubMedQA→PubHealth PubHealth→ChemProt T ask transition (t → t+1) ADAPTER AGEM EWC GEM L2 LAMOL OLOR A REPLAY TCL VANILLA -14.0 0.7 -0.8 1.2 -0.9 1.7 -1.7 0.6 0.8 -41.6 27.1 8.7 -10.9 -10.4 14.5 -0.9 -11.8 -7.6 -44.8 26.0 1.4 7.3 -5.8 -1.4 -5.8 -3.0 -3.3 -9.1 -4.2 4.5 8.6 -1.8 0.3 -2.9 -0.8 -2.8 -7.2 -1.7 0.3 4.2 -5.8 -0.7 -1.6 -10.6 -3.5 -3.3 1.1 -3.3 3.9 -3.8 -7.6 -4.5 -2.2 -0.7 -23.0 -32.3 35.7 -7.4 -22.9 3.4 8.2 -9.6 0.5 -13.8 1.5 -1.0 5.6 -3.8 0.4 -5.6 -0.4 -2.6 -10.5 1.0 -3.0 2.4 -0.7 2.2 -1.7 0.9 1.0 -43.5 23.1 -0.3 12.2 -6.0 -4.2 0.4 -33.2 -5.1 T ransition shock heatmap (ΔAP across task transitions) −40 −30 −20 −10 0 10 20 30 40 ΔAP (percentage points): AP(t+1) − AP(t) Order 1 BioASQ→GAD GAD→PubMedQA PubMedQA→PubHealth PubHealth→ChemProt ChemProt→Pubmed_R CT Pubmed_RCT→SciF act SciF act→DRUGLIB DRUGLIB→LitCovid LitCovid→DDI T ask transition (t → t+1) ADAPTER AGEM EWC GEM L2 LAMOL OLOR A REPLAY TCL VANILLA -13.6 -2.3 3.0 2.3 -0.2 0.0 1.2 -0.5 1.2 -39.4 29.6 -1.2 -34.0 11.0 7.7 -3.2 -5.2 7.9 -45.2 33.7 -9.8 -29.5 -14.8 16.8 8.1 3.7 7.8 -9.1 -1.0 -0.2 4.6 -1.4 1.3 1.5 -2.7 0.3 -7.2 -4.0 -0.2 -1.2 -11.8 6.2 4.4 -4.2 -0.5 -8.6 -4.4 -0.3 0.9 -6.1 -1.3 -2.2 -7.9 1.8 -11.0 -5.0 -13.2 -28.1 -8.9 17.3 -5.9 -5.0 18.0 -8.7 -10.4 5.0 -0.4 2.7 -4.1 5.6 -1.4 -2.9 -10.5 -4.3 3.2 2.5 0.2 -1.3 1.1 -0.6 1.4 -43.1 10.1 1.2 -15.5 -6.0 -2.9 28.2 -19.5 13.9 T ransition shock heatmap (ΔAP across task transitions) −40 −20 0 20 40 ΔAP (percentage points): AP(t+1) − AP(t) Order 2 PubMedQA→PubHealth PubHealth→DRUGLIB DRUGLIB→BioASQ BioASQ→LitCovid LitCovid→DDI DDI→GAD GAD→ChemProt ChemProt→Pubmed_R CT Pubmed_RCT→SciF act T ask transition (t → t+1) ADAPTER AGEM EWC GEM L2 LAMOL OLOR A REPLAY TCL VANILLA 10.6 2.8 3.8 -1.4 1.7 -1.3 0.8 -0.4 -0.4 12.0 -21.7 32.1 -24.8 2.3 -31.3 9.2 3.3 13.6 10.3 -17.1 27.1 -16.6 3.0 -41.7 13.9 14.4 -7.7 -1.3 15.7 6.1 -2.4 2.2 -0.9 1.0 -2.3 -2.9 12.6 10.4 -1.4 -6.8 0.1 -16.4 3.4 -1.3 4.6 10.7 -9.0 1.3 1.7 4.4 -0.5 -1.2 -7.7 -7.8 4.5 -28.6 34.4 -26.5 -16.9 -6.4 23.0 1.6 -0.4 9.7 10.5 -2.1 -5.2 4.4 -6.1 0.1 -4.2 1.7 11.0 3.4 2.8 -1.3 2.0 -2.0 1.0 -0.1 -1.4 4.4 -7.3 17.6 -9.9 -15.3 -33.8 15.5 -9.2 -3.3 T ransition shock heatmap (ΔAP across task transitions) −40 −30 −20 −10 0 10 20 30 40 ΔAP (percentage points): AP(t+1) − AP(t) Order 3 SciF act→BioASQ BioASQ→DDI DDI→GAD GAD→PubMedQA PubMedQA→LitCovid LitCovid→PubHealth PubHealth→DRUGLIB DRUGLIB→ChemProt ChemProt→Pubmed_R CT T ask transition (t → t+1) ADAPTER AGEM EWC GEM L2 LAMOL OLOR A REPLAY TCL VANILLA 8.3 2.2 -4.3 -3.8 -1.1 1.1 1.0 0.8 -0.2 1.4 -9.5 -24.8 17.9 3.0 -7.1 4.2 -1.9 2.4 3.1 -9.7 -25.8 18.9 -9.8 -2.4 -2.2 4.4 1.0 4.0 -0.0 -2.6 -3.3 -1.1 -1.5 5.6 -4.5 1.0 4.0 0.1 -3.4 -5.4 -7.0 -11.5 10.6 -0.4 -7.3 -15.6 6.5 -4.7 1.3 -4.4 -5.7 0.3 0.4 -3.8 -4.9 -1.1 -41.7 17.0 -21.2 -1.4 -18.2 26.4 -5.7 -6.4 2.2 -3.1 -4.0 -12.3 9.5 5.0 -2.9 -0.5 8.9 3.8 -4.9 -3.2 -0.5 1.3 1.2 1.1 -0.0 -24.2 4.7 -45.0 26.7 -19.0 7.9 13.6 -11.5 -1.5 T ransition shock heatmap (ΔAP across task transitions) −40 −30 −20 −10 0 10 20 30 40 ΔAP (percentage points): AP(t+1) − AP(t) Order 4 (b) T ransition sho c k heatmaps under four task orders. ∆ AP t = AP t +1 − AP t (p er- cen tage p oints); co oler indicates in terference. T ask-family diﬀerences in forgetting T o measure ho w muc h performance on eac h task degrades after subsequent updates, w e compute p er-task forgetting from the p er-stage ev aluations underlying Fig. 3a . Sp eciﬁcally , forgetting on task t is deﬁned as ∆ = a post t − a end t (p ercen tage p oin ts). Here, a post t is the accuracy on task t ev aluated immediately after training on t (i.e., after stage t ). a end t is the accuracy on the same task ev aluated after completing the full 10-task stream (i.e., after the ﬁnal stage). W e then group the ten datasets in to ﬁv e task families using a ﬁxed dataset-to-family mapping (Supplementary Note 6) and visualize the resulting forgetting distributions across task orders for a represen tative subset of metho ds (Fig. 4 ). F or clarit y , w e omit MUL TI because it is trained join tly and does not yield a sequen tial tra jectory required to deﬁne forgetting, and we omit AD APTER and TCL b ecause their task-sp eciﬁc parameter isolation can yield near-zero forgetting, whic h would compress the scale and obscure diﬀerences among non-isolation metho ds. Across task orders, w e observ e clear family-level diﬀerences: MultiLab el tends to incur the largest forgetting, whereas QA and RE t ypically exhibit smaller losses, par- ticularly for repla y-based and other stability-orien ted metho ds (e.g., REPLA Y, GEM, L2). Because ∆ can b e negative (e.g., backw ard transfer or ev aluation v ariabilit y), we use the clipp ed measure max(∆ , 0) as our primary forgetting metric, and rep ort the ra w (unclipp ed) distributions in Supplemen tary Note 7. QA F actCheck RE DocCls MultiLabel T ask family 0 20 40 60 80 100 Forgetting (per centage points) F amily-level forgetting distribution by method (clipped at 0) V ANILL A REPLAY L2 GEM AGEM EWC LAMOL OLOR A Fig. 4 | T ask-family forgetting distributions b y metho d. F or each task, forget- ting is measured as ∆ = a post t − a end t (p ercen tage p oints), i.e., the accuracy diﬀerence from immediately after learning the task to the end of the 10-task stream. T asks are group ed into ﬁve families and distributions are aggregated across task orders. Bo xes summarize the distribution ov er (order, task) instances and p oin ts show individual observ ations. W e plot the clipp ed forgetting max(∆ , 0) to isolate p erformance loss without allowing gains (e.g., bac kward transfer) to oﬀset forgetting. Scaling to LLMs W e next examine whether our ﬁndings generalize and scale across backbone archi- tectures, moving from an enco der–deco der mo del ( T5-base ) to deco der-only LLMs of diﬀeren t sizes. W e additionally ev aluate tw o decoder-only LLMs, Qwen-0.6B and Qwen-4B . Unless noted otherwise, scaling results use a single pre-speciﬁed task order (Order 1) and 5 epo chs p er task for tractabilit y; the main b enchmark uses 10 ep o c hs p er task (T able 1 ). W e use the same uniﬁed task formulation and ev aluation proto col as in the main experiments to ensure comparabilit y across bac kb ones. T o con textualize scaling trends, Extended Data T able 3 reports the ﬁnal AP and the n umber of trainable parameters for each metho d under each backbone. Fig. 5a visualizes the corresp onding metho d-wise AP after completing the 10-task stream. T rain counts parameters with requires grad=True . Bac kb one scaling exhibits method-dep endent and non-monotonic eﬀects Extended Data T able 3 and Fig. 5a show that scaling eﬀects are strongly metho d- dep enden t and can be non-monotonic. Replacing T5 with a small deco der-only bac kb one (Qw en-0.6B) improv es sev eral baselines (e.g., V ANILLA: 23.2 → 41.43; EW C: 30.45 → 44.98) but substantially degrades others (e.g., GEM: 68.83 → 38.22; AD APTER: 68.43 → 59.67; TCL: 65.32 → 57.26), indicating that arc hitectural changes alone do not uniformly mitigate catastrophic forgetting. Scaling further to Qw en-4B yields clearer impro vemen ts for man y metho ds, particularly regularization-based approaches (EWC, L2) and replay-based training (REPLA Y, LAMOL). In con trast, the multi-task upp er b ound increases only modestly (MUL TI: 75.08 → 77.89), suggesting that additional capacity primarily reduces in ter- ference in the sequential setting rather than dramatically raising the joint-training ceiling. Ov erall, backbone scaling reshapes the relative ranking of con tin ual learning strategies rather than providing uniform gains across metho ds. P arameter-eﬃcien t metho ds b eneﬁt unev enly from scaling P arameter-eﬃcient approaches exhibit nuanced scaling b eha vior. ADAPTER and TCL impro ve from Qwen-0.6B to Qw en-4B but do not reco v er the relative strength observed on T5, consisten t with the fact that these metho ds freeze most bac kb one parameters and rely on limited task-speciﬁc capacit y . As the backbone scales, the fraction of train- able parameters b ecomes increasingly small, p oten tially turning the adapter module in to a bottleneck that limits eﬀectiv e adaptation. By contrast, the extremely compact OLORA shows only marginal improv emen t and remains substantially w eaker than repla y- or adapter-based methods, suggesting that aggressive parameter compression can undermine robustness to in terference ev en on larger backbones. Gradien t-pro jection metho ds are highly architecture-sensitiv e Gradien t-pro jection-based GEM exhibits pronounced sensitivity to bac kb one archi- tecture. While GEM is among the strongest m ethods on the enco der–deco der T5 backbone, its p erformance drops sharply on deco der-only mo dels (38.22% on Qw en-0.6B and 55.60% on Qw en-4B), falling b elo w repla y- and regularization-based approac hes on the same bac kb ones. A plausible explanation is that gradien t-pro jection constrain ts can b e more restrictive under deco der-only training dynamics, limiting eﬀectiv e up dates compared to the enco der–deco der setting. W e lea ve a mec hanistic analysis of this architecture sensitivit y to future work. Computational cost and eﬃciency trade-oﬀs W e measure training cost in GPU-hours, computed as w all-clo c k time m ultiplied b y the n umber of GPUs used. Fig. 5b rep orts the absolute training cost of V ANILLA for eac h backbone, while Figs. 5c – 5e plot the a verage p erformance v ersus training-cost o verhead normalized b y the corresp onding V ANILLA run. F ull p er-metho d training costs for each backbone are reported in Supplemen tary Note 8. MUL TI incurs a moderate ov erhead relative to V ANILLA, ranging from roughly 2 × on T5-base to ∼ 6 × on Qwen-4B . In contrast, repla y-based metho ds are consis- ten tly more exp ensiv e due to interlea ving memory samples, requiring roughly 4–9 × the GPU-hours of sequential ﬁne-tuning across backbones. On T5-base , GEM is also substan tially costlier ( > 6 × ), reﬂecting its additional p er-step gradient constrain ts. T aken together, these exp erimen ts reveal a nuanced scaling picture. (i) Moving from T5 to a small deco der-only LLM ( Qwen-0.6B ) do es not automatically impro ve p erformance in biomedical NLP and can even degrade p erformance for some meth- o ds. (ii) Scaling further to Qwen-4B substantially mitigates catastrophic forgetting for many metho ds, particularly repla y-based and regularization-based approaches. (iii) Algorithmic design remains crucial: replay-based metho ds and adapter-st yle parame- ter isolation provide consistently strong stabilit y–eﬃciency trade-oﬀs, whereas GEM sho ws pronounced bac kb one sensitivit y in our exp eriments. 0 20 40 60 80 100 V ANILL A REPL A Y GEM A GEM OL OR A AD APTER L AMOL EWC L2 T CL MUL TI A verage P erfor mance T5 Qwen3-0.6B Qwen3-4B (a) Final AP across bac kb ones. T5 Qwen-0.6B Qwen-4B 0 10 20 30 40 50 60 70 GPU-hours 0.79 15.14 63.33 V ANILL A training cost per backbone (b) V ANILLA cost (GPU-hours). 1 2 3 4 5 6 7 R elative training time (× V ANILL A) 20 30 40 50 60 70 80 A verage perfor mance (AP, %) V ANILL A REPL A Y GEM AD APTER T CL L AMOL EWC L2 OL OR A A GEM MUL TI (c) T5-BASE: cost vs. AP ( × V ANILLA time). 0 1 2 3 4 5 6 7 R elative training time (× V ANILL A) 20 30 40 50 60 70 80 A verage perfor mance (AP, %) V ANILL A REPL A Y GEM AD APTER T CL L AMOL EWC L2 OL OR A A GEM MUL TI (d) Qwen-0.6B: cost vs. AP ( × V ANILLA time). 0 2 4 6 8 10 R elative training time (× V ANILL A) 20 30 40 50 60 70 80 A verage perfor mance (AP, %) V ANILL A REPL A Y GEM AD APTER T CL L AMOL EWC L2 OL OR A A GEM MUL TI (e) Qwen-4B: cost vs. AP ( × V ANILLA time). Fig. 5 | Bac kb one scaling reshap es p erformance and eﬃciency trade-oﬀs (Order 1). (a) Final AP after the 10-task stream on T5-base , Qw en-0.6B and Qw en- 4B. (b) Absolute training cost of V ANILLA in GPU-hours for each bac kb one. (c–e) AP versus relative training cost for T5-base , Qwen-0.6B and Qwen-4B, where cost is normalized by the corresp onding V ANILLA run on the same bac kb one. Discussion In this study , we introduced MedCL-Benc h, a con tinual learning benchmark that streams ten biomedical NLP datasets spanning ﬁve task families under a uniﬁed task form ulation and ev aluation proto col. Motiv ated b y the need to up date biomed- ical NLP mo dels o ver time without retaining or retraining on all historical data, w e use MedCL-Bench to c haracterize ho w CL strategies trade oﬀ adaptation to new tasks against retention of previously learned capabilities. Across task-order p ermu- tations, V ANILLA shows substan tial regression on earlier tasks, whereas memory- and parameter-isolation strategies yield more stable b eha vior, with regularization- based baselines providing more limited protection. More broadly , a central question is whether CL metho ds can reliably up date biomedical mo dels ov er evolving datasets while mitigating catastrophic forgetting, and which strategies oﬀer the b est trade-oﬀs among end-of-stream p erformance, order robustness, parameter eﬃciency , and com- pute. MedCL-Benc h is designed to address this question by ev aluating metho ds across m ultiple task orders and complementary diagnostics that capture b oth end-of-stream outcomes and within-stream dynamics; w e summarize the key ﬁndings b elo w. First, MedCL-Benc h exp oses large and systematic diﬀerences in stability across CL strategies. Across eight randomized task orders, V ANILLA exhibits severe catas- trophic forgetting, whereas replay/constrain t-based methods and parameter-isolation approac hes substan tially impro ve retention (T able 1 ). F orw ard transfer remains lim- ited across metho ds on MedCL-Bench, underscoring that mitigating interference is the dominant challenge under long biomedical task streams. Second, robustness to task-order permutations is a necessary rep orting dimension rather than a secondary diagnostic. Higher AP do es not necessarily imply greater order robustness: some metho ds achiev e comp etitiv e av erages yet remain order-sensitive, whereas AD APTER and TCL combine high p erformance with consistently lo w across- order v ariability (Fig. 2 ). Several baselines show pronounced v ariability across orders, so conclusions from a single p erm utation can b e unreliable; ADAPTER and TCL com- bine high av erage p erformance with the narrow est uncertaint y in terv als, indicating stronger order robustness. Paired matched-order tests further show that key metho d diﬀerences persist across p ermutations under correction (Supplementary Note 3), supp orting the use of task orders as matched blo cks for statistical reliability . Third, forgetting can b e concentrated at sp eciﬁc task transitions. W e quan tify these drops with transition sho ck ∆ AP t (Fig. 3b ), and larger or more frequen t negative sho c ks are consistent with more negativ e BWT (T able 1 ). This transition-lev el view complemen ts aggregate metrics b y iden tifying when stability mechanisms succeed or fail during the stream. F ourth, forgetting is heterogeneous across task families, highligh ting clinically rel- ev ant failure modes that can b e masked by ov erall av erages. When grouping datasets in to ﬁve task families, w e observe consisten t family-level diﬀerences in forgetting, with some families exhibiting systematically larger loss across orders (Fig. 4 ). This hetero- geneit y suggests that b enchmark rep orting should include task-family diagnostics in addition to a single a veraged score, esp ecially in biomedical settings where failure on sp eciﬁc task t yp es may b e unacceptable even if the ov erall mean app ears comp etitive. Fifth, scaling to deco der-only LLMs reshap es the relativ e ranking of CL metho ds and reveals pronounced arc hitecture dep endence. Mo ving from an enco der–deco der bac kb one ( T5-base ) to deco der-only bac kb ones ( Qwen-0.6B/4B ) does not uni- formly improv e con tinual learning; instead, scaling eﬀects are metho d-dep enden t and can b e non-monotonic (Extended Data T able 3; Fig. 5a ). Regularization- and replay- based approaches b eneﬁt more consisten tly from increased deco der-only capacit y , whereas parameter-eﬃcient and gradient-pro jection metho ds exhibit stronger back- b one dependence. In particular, GEM degrades on decoder-only mo dels in our setting, indicating that the relativ e ranking of CL strategies can b e backbone-dep enden t and motiv ating backbone-aw are v alidation before deploymen t. Sixth, stability gains come with a clear eﬃciency trade-oﬀ that should b e rep orted alongside p erformance. Repla y/constraint-based methods often deliv er strong reten- tion and low er transition sho c ks, but can incur substantial GPU-hour and memory o verhead due to buﬀer main tenance and additional optimization constraints. In con trast, parameter-eﬃcient approac hes reduce the num b er of trainable parame- ters and can b e more practical under limited compute, yet their b eneﬁts ma y saturate or b ecome capacit y-b ottleneck ed as bac kb ones and task streams scale (Sup- plemen tary Note 8). These results highlight that method selection in biomedical settings should consider not only end-of-stream AP and robustness, but also the stabilit y–compute–parameter trade-oﬀ under realistic up date budgets. T ogether, these results supp ort tw o deplo yment-relev an t recommendations. (i) When contin ual up dates are exp ected, ev aluate and select metho ds under m ultiple task orders and include uncertain ty , since single-order reporting can b e misleading. (ii) Cho ose metho ds based on b oth stability and compute: replay/constrain t-based approac hes oﬀer strong reten tion but ma y incur substan tial GPU-hour ov erhead, while parameter-eﬃcient approaches reduce trainable parameters y et can b ecome capacit y-b ottleneck ed as backbones scale (Supplementary Note 8). This study has several limitations. First, while the main benchmark ev aluates m ultiple task orders, scaling exp erimen ts use a single representativ e order and few er training epo chs for tractabilit y; extending scaling analyses to m ultiple p erm utations and longer horizons will strengthen generalit y . Second, our largest deco der-only model is limited to 4B parameters; ev aluating larger biomedical LLMs may rev eal addi- tional scaling regimes. Third, we fo cus on accuracy under a uniﬁed task formulation; future work should incorp orate additional deploymen t-facing criteria (e.g., calibra- tion, robustness, or clinically w eigh ted errors). Finally , exploring hybrid strategies that com bine selectiv e repla y with parameter isolation, as well as principled memory sizing and sampling, may further impro ve the stability–compute trade-oﬀ. In summary , Me dCL-Benc h ﬁlls a k ey gap in biomedical NLP by providing a uniﬁed contin ual learning b enchmark that spans ﬁv e task families, standardizes train- ing and ev aluation, and ev aluates metho ds under multiple pre-sp eciﬁed task orders with complemen tary diagnostics. Using MedCL-Bench, we answer under-explored questions that motiv ate contin ual up dating in practice: whic h CL strategies b est prev ent regressions under long biomedical task streams, ho w sensitiv e conclusions are to task-order p ermutations, and whether forgetting and interference diﬀer sys- tematically across task families. W e further show that these conclusions dep end on bac kb one arc hitecture and are shap ed b y a clear stabilit y–eﬃciency trade-oﬀ, motiv at- ing bac kb one-a ware and budget-a w are metho d selection. Ov erall, our ﬁndings indicate that CL outcomes are join tly go verned by algorithm design, task-order permutations, task-family heterogeneit y , and backbone architecture, providing practical guidance and more robust ev aluation standards for up dating biomedical language mo dels under realistic constraints. Metho ds T asks and datasets MedCL-Benc h b enchmarks CL on ten publicly av ailable biomedical NLP datasets spanning ﬁve task categories: biomedical question answering (BioASQ, PubMedQA), scien tiﬁc fact c hecking (SciF act, PubHealth), relation extraction (GAD, ChemProt, DDI), document-lev el classiﬁcation (PubMed-R CT, DRUGLIB), and m ulti-lab el topic classiﬁcation of biomedical literature (LitCo vid). W e adopt the released train/v alida- tion/test splits provided b y the original datasets or the Zeno do b enchmark release when applicable, and only construct missing splits when they are not pro vided. F or SciF act, the oﬃcial test split is unlab eled; w e therefore use the original v alidation split as the test and create a new v alidation split from the training data. F or DRUGLIB, whic h pro vides train/test only , we create a v alidation split b y sampling 10% of the training set. F or BioASQ (T ask 7b), w e retain only y es/no questions with gold answers in { yes,no } , use the ﬁrst snipp et as context, and create a deterministic train/v alidation split b y hashing question IDs; the golden enric hed set is used as test. F or PubMed- R CT, we apply light text cleaning and low-information ﬁltering within eac h split, and (for controlled exp eriments) subsample up to 1,000 training sen tences per section lab el. F or PubHealth, w e conv ert the released parquet ﬁles in to our uniﬁed JSONL format, k eeping the claim text and the four-w ay lab els (true/false/mixture/unprov en) after ﬁltering malformed entries. F or GAD, we k eep the pro vided splits and rewrite each example in to a uniﬁed binary MCQ format, where the input asks whether a gene– disease relation is present and provides t wo options (A: has relation, B: no relation); we map the original lab els to A/B and keep the placeholder-marked text (e.g., @GENE, @DISEASE) as-is. F or the remaining datasets (PubMedQA, ChemProt, DDI, and LitCo vid), w e directly use the released train/v alidation/test splits from the Zeno do b enc hmark release (record 14025500) as-is. Exp erimen tal setup Mo del b ackb ones Unless stated otherwise, all main b enchmark exp eriments use T5-base as the pre- trained bac kb one (enco der–deco der). F or scaling exp eriments, we additionally ev aluate t wo deco der-only bac kb ones, Qwen-0.6B and Qwen-4B . Continual le arning pr oto c ol MedCL-Benc h ev aluates con tinual learning b y exposing a pretrained backbone to a stream of T =10 tasks. T asks are presented sequen tially under eigh t pre-sp eciﬁed task- order permutations. W e ﬁx the set of orders and use the same orders for all metho ds to enable matched comparisons across algorithms. F or the main b enchmark results, w e aggregate p erformance o ver the eigh t orders and rep ort order sensitivity/uncertain ty (e.g., s.d. and b o otstrap CIs) to guard against order-sp eciﬁc conclusions. F or the LLM- scaling experiments, we use a single order (Order 1) for computational tractability , as detailed in the Results section (“Scaling to LLMs”). Data cur ation and split c aps T o ensure a con trolled and computationally consistent setting, w e cap each dataset split on-the-ﬂy to at most 1,000 training instances, 500 v alidation instances, and 500 test instances. When a split exceeds the cap, we apply ﬁxed-seed stratiﬁed subsampling within that split to preserv e lab el prop ortions. F or datasets that do not pro vide all oﬃcial splits (e.g., missing v alidation or a lab eled test split), we construct the missing split(s) following the dataset-sp eciﬁc procedures describ ed ab ov e. Uniﬁe d input/output format W e cast all datasets in to a uniﬁed discriminativ e classiﬁcation format. Each instance is represen ted as a single text input (the sentence ﬁeld), and the mo del predicts a lab el from a task-sp eciﬁc closed lab el set. F or QA tasks, the input provides a question and supp orting context (and may optionally enumerate answer options), and the mo del predicts one answ er label (BioASQ: yes / no ; PubMedQA: yes / no / maybe ). F or rela- tion and classiﬁcation tasks (e.g., ChemProt, DDI, GAD, PubMed-RCT, DRUGLIB, SciF act, PubHealth), the model predicts one categorical label. F or LitCovid, lab els are m ulti-v alued and represented as a semicolon-separated set; we ev aluate predictions b y exact set matc h (subset accuracy) after splitting labels b y ; in an order-insensitive manner. Ev aluation metrics MedCL-Benc h spans heterogeneous biomedical task types whose original papers use diﬀeren t ev aluation measures (e.g., accuracy , micro-/macro-F1). Directly aggregating suc h task-sp eciﬁc metrics would mak e CL summaries ill-deﬁned and can introduce scale inconsistencies when a veraging across tasks and task orders. W e therefore cast all datasets into a uniﬁed classiﬁcation setting and use accuracy as the primary metric, enabling consisten t cross-task comparison and statistically coherent contin ual learning aggregates. This design follows common practice in m ulti-task and con tinual learning b enc hmarks [ 26 , 28 , 29 ] that enforce a shared output space (e.g., mapping hetero- geneous tasks to a uniﬁed discriminative ob jectiv e or to end-to-end generation) to supp ort comparable aggregate measures. All results are rep orted as p ercen tages. F or single-lab el tasks, an example is correct if the predicted lab el matc hes the gold lab el. F or LitCovid (m ulti-lab el), w e use subset accuracy (exact match): an example is coun ted as correct if and only if the predicted lab el set exactly matc hes the gold lab el set (order-insensitive, split by ; ). Additionally , to obtain ov erall contin ual learning comparisons, w e follow standard deﬁnitions [ 20 , 26 , 28 , 29 ] based on the task-wise accuracy matrix. Let R t,i denote the accuracy on task i after training up to task t (with i ≤ t ) in the sequence, with T total tasks. In our benchmark, T =10. (1) Av erage Performance (AP) is the av erage accuracy across all tasks after learning all the tasks: AP = 1 T T X i =1 R T ,i . (2) Bac kward T ransfer (BWT) [ 20 , 30 ] quan tiﬁes the impact of new learning on previous tasks: BWT = 1 T − 1 T − 1 X i =1 ( R T ,i − R i,i ) . (3) F orward T ransfer (FWT) measures the inﬂuence of previously learned tasks on future tasks before they are trained, relative to the initial pretrained (zero-shot) baseline: FWT = 1 T − 1 T X i =2 ( R i − 1 ,i − R 0 ,i ) , where R 0 ,i denotes the performance on task i of the initial pretrained model b efore learning any tasks (zero-shot baseline). Baselines W e compare representativ e baselines spanning sequential ﬁne-tuning, m ulti-task learn- ing, regularization, rehearsal/gradient-pro jection, and generativ e replay . All metho ds share the same backbone, preprocessing, task orders, and ev aluation protocol. Se quential ﬁne-tuning (V ANILLA) trains on tasks one by one with standard ﬁne- tuning and no explicit mec hanism to mitigate forgetting. Multi-task le arning : MUL TI jointly trains on the union of all tasks and serv es as a non-contin ual reference. R e gularization-b ase d : EW C [ 17 ] adds a Fisher-based p enalt y to prev ent changing parameters imp ortant for previous tasks. L2 anchors parameters to the previous-task solution via an ℓ 2 p enalt y . R ehe arsal/gr adient pr oje ction : REPLA Y [ 31 ] maintains an episo dic memory of past examples and mixes them with curren t-task data. GEM [ 20 ] and AGEM [ 21 ] use episo dic memory to constrain up dates and reduce interference with past tasks. Gener ative r eplay : LAMOL [ 25 ] p erforms replay by augmen ting training with pseudo-samples generate d from a model prior to learning the curren t task. Par ameter-eﬃcient adaptation : ADAPTER [ 26 ] inserts light weigh t b ottlenec k adapters in to the backbone and freezes all original backbone parameters, training only the adapter (and other explicitly enabled light weigh t) parameters. TCL [ 27 ] and OLORA [ 32 ] are additional parameter-eﬃcient baselines that restrict training to a small set of task-conditioned light w eight parameters (e.g., task em b eddings / lo w-rank up dates) while keeping the bac kb one frozen, following their standard practice. Hyp erp ar ameter c ontr ol Unless otherwise speciﬁed, we k eep shared optimization h yp erparameters (e.g., ep o c hs, learning rate, batc h sizes, early stopping) consistent across baselines for a given back- b one. Method-sp eciﬁc hyperparameters (e.g., episo dic memory size M and LAMOL augmen tation ratio ρ ) are ﬁxed across metho ds whenever they represen t a shared resource budget, to ensure fair comparisons. Hyp erparameters: for rehearsal-based metho ds (REPLA Y/GEM/AGEM), we use an episo dic memory of size M = 5 p er task in our implemen tation. F or LAMOL, w e set percentage LAM0L =0.1, meaning that for eac h task w e generate pseudo-samples amoun ting to 10% of its training set using the mo del snapshot before learning that task. F or AD APTER, w e use a b ottlenec k size of 48 on T5 and up date only adapter- related parameters. F or the T5-based AD APTER, we use a b ottlenec k size of 48. F or Qw en-based AD APTER, we follow the same adapter-only training proto col and use a b ottleneck size of 128/512 for 0.6B and 4B, resp ectively . W e use OLoRA with rank r = 8 and α = 16. Implementation details All metho ds are implemented in PyT orch with HuggingF ace T ransformers and Deep- Sp eed. F or each bac kb one, w e use the same optimizer and training budget (ep ochs, learning rate, and batc h sizes) across methods, and rep ort results from the ﬁnal c hec k- p oin t. After ﬁnishing each task, we ev aluate the mo del on all tasks seen so far to form the accuracy matrix { R t,i } used to compute AP/BWT/FWT. The exp eriments were run on NVIDIA A100 40GB GPUs. Use of lar ge language mo dels ChatGPT was used for language p olishing; all scientiﬁc conten t and ﬁnal text were authored and veriﬁed by the authors. Data Av ailability All datasets used in this work are publicly a v ailable. • PubMedQA: https://zenodo.org/records/14025500 • BioASQ (T ask 7b): the oﬃcial training set (T raining 7b) and test gold annotations (7b golden enriched) are av ailable from the BioASQ Participan ts Area Datasets page (registration required for downloading the training set): h ttps://participants- area. bioasq.org/datasets/ • PubHealth: h ttps://huggingface.co/datasets/ImperialCollegeLondon/health fact • SciF act: https://gith ub.com/allenai/scifact (also mirrored at https://h uggingface. co/datasets/allenai/scifact ). • GAD: h ttps://huggingface.co/datasets/bigbio/gad • ChemProt: https://zenodo.org/records/14025500 • DDI: https://zenodo.org/records/14025500 • Pubmed RCT: h ttps://raw.gith ubuserconten t.com/F ranck- Dernoncourt/ pubmed- rct/master/PubMed 20k R CT • DR UGLIB: https://arc hive.ics.uci.edu/dataset/461/drug%2Breview%2Bdataset% 2Bdruglib%2Bcom • LitCo vid: https://zenodo.org/records/14025500 Co de Av ailabilit y The co de for MedCL-Bench, including dataset prepro cessing scripts, con tinual learning conﬁgurations, and baseline implemen tations, has b een deposited in Zenodo for peer review and will b e made publicly a v ailable upon publication. References [1] McCloskey , M., Cohen, N.J.: Catastrophic in terference in connectionist netw orks: The sequen tial learning problem. In: Psychology of Learning and Motiv ation vol. 24, pp. 109–165. Elsevier, ??? (1989) [2] He, F., F ei, R., Krull, J.E., Y u, Y., Zhang, X., W ang, X., Cheng, H., Gao, M., Su, L., Chen, Y., et al.: Harnessing the p ow er of single-cell large language mo dels with parameter-eﬃcient ﬁne-tuning using scp eft. Nature Machine In telligence, 1–16 (2025) [3] Mendez, J.A., Eaton, E.: How to reuse and compose kno wledge for a lifetime of tasks: A surv ey on con tinual learning and functional composition. T rans. Mach. Learn. Res. 2023 (2023) [4] Kiyasseh, D., Zhu, T., Clifton, D.: A clinical deep learning framework for con- tin ually learning from cardiac signals across diseases, time, modalities, and institutions. Nature Communications 12 (1), 4221 (2021) [5] Lee, C.S., Lee, A.Y.: Clinical applications of con tinual learning machine learning. The Lancet Digital Health 2 (6), 279–281 (2020) [6] Guinney , J., Saez-Ro driguez, J.: Alternative mo dels for sharing conﬁdential biomedical data. Nature Biotechnology 36 (5), 391–392 (2018) https://doi.org/ 10.1038/n bt.4128 [7] A question of trust for AI researc h in medicine. Nature Machine Intelligence 6 , 739 (2024) https://doi.org/10.1038/s42256- 024- 00880- 0 . Editorial [8] Bergquist, T., et al. : Piloting a mo del-to-data approach to enable predictive analytics in health care through patient mortalit y prediction. Journal of the American Medical Informatics Asso ciation 27 (9), 1393–1400 (2020) https://doi. org/10.1093/jamia/o caa083 [9] McMahan, H.B., Moore, E., Ramage, D., Hampson, S., Arcas, B.: Comm unication-eﬃcient learning of deep net works from decentralized data. In: Proceedings of the 20th International Conference on Arti- ﬁcial In telligence and Statistics (AIST A TS), pp. 1273–1282 (2017). https://p ro ceedings.mlr.press/v54/mcmahan17a.html [10] Abul-Husn, N.S., Kenny , E.E.: Personalized medicine and the p ow er of electronic health records. Cell 177 (1), 58–69 (2019) h ttps://doi.org/10.1016/j.cell.2019.02. 039 [11] Huguet, N., Kaufmann, J., O’Malley , J., Angier, H., Hoop es, M., DeV o e, J.E., Marino, M.: Using electronic health records in longitudinal studies: estimating patien t attrition. Medical care 58 , 46–52 (2020) [12] Sahiner, B., Chen, W., Samala, R.K., P etrick, N.: Data drift in medical machine learning: implications and p oten tial remedies. British Journal of Radiology 96 (1150), 20220878 (2023) https://doi.org/10.1259/b jr.20220878 [13] Finlayson, S.G., Subbaswam y , A., Singh, K., et al. : The clinician and dataset shift in artiﬁcial in telligence. New England Journal of Medicine 385 (3), 283–286 (2021) https://doi.org/10.1056/NEJMc2104626 [14] Lasko, T.A., et al. : Wh y do probabilistic clinical mo dels fail to transp ort b etw een sites. np j Digital Me dicine (2024) h ttps://doi.org/10.1038/s41746- 024- 01037- 4 [15] Kore, A., Abbasi Bavil, E., Subasri, V., et al. : Empirical data drift detection exp erimen ts on real-world medical imaging data. Nature Communications 15 , 1887 (2024) https://doi.org/10.1038/s41467- 024- 46142- w [16] Caruana, R.: Multitask learning. Machine learning 28 , 41–75 (1997) [17] Kirkpatrick, J., P ascanu, R., Rabinowitz, N., V eness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabsk a-Barwinsk a, A., et al. : Over- coming catastrophic forgetting in neural netw orks. Pro ceedings of the national academ y of sciences 114 (13), 3521–3526 (2017) [18] Zenke, F., Poole, B., Ganguli, S.: Contin ual learning through synaptic intelligence, pp. 3987–3995 (2017) [19] Aljundi, R., Babiloni, F., Elhoseiny , M., Rohrbac h, M., T uytelaars, T.: Memory a ware synapses: Learning what (not) to forget. In: Pro ceedings of the European Conference on Computer Vision (ECCV), pp. 139–154 (2018) [20] Lop ez-Paz, D., Ranzato, M.: Gradient episo dic memory for contin ual learning. In: Adv ances in Neural Information Pro cessing Systems (NeurIPS) (2017) [21] Chaudhry , A., Ranzato, M., Rohrbach, M., Elhosein y , M.: Eﬃcien t lifelong learning with a-gem. arXiv preprin t arXiv:1812.00420 (2018) [22] Mi, F., Chen, L., Zhao, M., Huang, M., F altings, B.: Contin ual learning for natural language generation in task-oriented dialog systems, pp. 3461–3474 (2020) [23] Zhao, Y., Zheng, Y., Tian, Z., Gao, C., Y u, B., Y u, H., Li, Y., Sun, J., Zhang, N.L.: Prompt conditioned v ae: Enhancing generative repla y for lifelong learning in task-oriented dialogue. (2022) [24] Zeng, M., Y ang, H., Xue, W., Liu, Q., Guo, Y.: Dirich let contin ual learning: T ackling catastrophic forgetting in nlp. In: The 40th Conference on Uncertaint y in Artiﬁcial Intelligence (2024) [25] Sun, F.-K., Ho, C.-H., Lee, H.-Y.: Lamol: Language mo deling for lifelong language learning. arXiv preprint arXiv:1909.03329 (2019) [26] Madotto, A., Lin, Z., Zhou, Z., Moon, S., Cro ok, P ., Liu, B., Y u, Z., Cho, E., F ung, P ., W ang, Z.: Con tinual learning in task-oriented dialogue systems. (2021) [27] Zeng, M., Y ang, H., Chen, X., Guo, Y.: T ask-wrapp ed con tinual learning in task- orien ted dialogue systems. In: Findings of the Asso ciation for Computational Linguistics: NAACL 2025, pp. 3173–3183 (2025) [28] Zhang, Y., W ang, X., Y ang, D.: Con tinual sequence generation with adaptive comp ositional mo dules. In: Pro ceedings of the 60th Ann ual Meeting of the Asso- ciation for Computational Linguistics (V olume 1: Long P ap ers), pp. 3653–3667 (2022) [29] Zeng, M., Chen, X., Y ang, H., Guo, Y.: Sparse adapter fusion for contin ual learning in nlp. arXiv preprin t arXiv:2602.02502 (2026) [30] Zhu, Q., Li, B., Mi, F., Zh u, X., Huang, M.: Contin ual prompt tuning for dialog state tracking, pp. 1124–1137 (2022) [31] Robins, M.J.: Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7 (2), 123–146 (1995) h ttps://doi.org/10.1080/09540099550039318 [32] W ang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., Huang, X.-J.: Orthogonal subspace learning for language mo del cont inual learn- ing. In: Findings of the Asso ciation for Computational Linguistics: EMNLP 2023, pp. 10658–10671 (2023)

MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment