Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images

1 Using Uns upervised Do ma i n Adaptati on Semantic Seg menta t ion f o r Pulmonary Em bo l ism Dete ction in Computed T omog r aphy Pulmonary Ang iogram (CTPA) I mages W en -Liang Lin 1 * , Y un -Chien Che ng 1 *, 1 Depa rtmen t of Mecha nica l Engine ering , College of Engin eering, Nation al Y an g Ming Chia o T un g Univ ersity , Hsin -Ch u, T a iwan *Co rresp onding autho r: lightne il9.en10@ny cu .edu.tw , yc ch eng @nycu.edu.tw 2 Abst ra ct While dee p l ea rning has d emonstrated consider able promise in c omputer-aide d diagnosis for pulmonary embo lism (PE), pra ctica l deploy ment in Comput ed Tomo graphy Pulmonar y Angiography ( CTP A) is often hin dere d by " do main shift" and the prohibiti ve cost of e xpert a nnotations. To a ddress t hese c hallenges, an unsupe rvised do main ada ptat ion (UDA ) fra mewor k is proposed, uti lizing a Tr ansfor mer ba ckbone a nd a Mea n- Tea che r ar chitectur e for c ross- ce nter semantic segmenta tion. The primary f ocus is place d on enha ncing pseudo-la bel re liabili ty by lear ning deep s truc tural informa ti on w ithin the featur e spa ce. S pec ifica lly, three modules a re int egr ated and de signed for this task: ( 1) a Prototype Align ment (PA) mec hanism to reduc e ca tegory-le vel distributi o n d iscre panc ies; (2) Global and Loca l Contrastive Le ar ning (GLCL) to capture both pixel-leve l topol ogica l relationsh ips and global sema ntic re prese ntations; a nd (3) a n Attention -ba sed Auxiliary Loca l Prediction (AAL P ) module designed to reinfor ce sensitivity to sm al l PE l esions by automatica lly extra cting high -infor matio n s l ices fr om Tra nsforme r attention maps. Expe rimental validat ion c onducted on cr oss-cente r datasets (FUMP E and CAD-PE) demonstra tes signific ant per forma nce ga ins. In t he      task, the I oU incre ased from 0.1152 to 0.4153, while t he      task saw an improve ment fr om 0.1705 to 0.4302 . Furthermore , the propose d method ac hieved a 69.9% Dice score in the     cr oss-modality task on the MMWH S datase t without uti lizing any tar get-domain labels for model selec tion, c onfirming its r obustne ss and gene raliza bilit y for diverse c linical environme nts. Keyw ords: Dee p learning, Pulmonary e mbolism, V ision Tra nsformer , Comput ed T omography Pulmona ry Angiogra phy, Unsupervise d do main adapta tion, Seman tic segmentati on 1. Introdu cti on Pulm o nar y e mboli s m ( PE) is a l ife-thre atening ca rdiovascula r condition re quiring prompt diagnosis a nd immediate cli nica l int er ventio n. Computed Tomography Pulm o nar y Angiography ( CTP A) is cur rently regar ded a s the clinica l gold s tanda rd for PE di agnos i s [1] due t o its superior spatial re soluti on and diagno stic acc urac y. Howe ver, manual interpr etation of CTPA scans remains a formidable cha llenge for radiologist s; PE lesions, par ticularly t hose at the sub -segme ntal level, are often re markably sm all and ca n be easily miside nti fie d or over looked amidst complex pul monary vascular structure s. To mitigate the risks of misd iagnosis and re duce the workload of clinicians, deep l ea rning-based Comput er -Aided Diagnosi s (CAD) sys tems have shown significant progre ss in auto mat ing PE detection. Early studies employing Convo lut io nal Neur al Networks (CNNs) , such as those by Cano-Espinosa et al. [2] and Long et al. [3], demon strate d t hat dire ct segmentation strate gies and probabi lity-driven anc hor mecha nisms could effe ctively detect sub -segme ntal emboli wi thi n single-ce nter datasets. Similarly, Tr ongmetheer at et al. [ 4 ] integra ted attention mecha nisms into U-Net a rchitec tures to enhanc e the sensi tivity of detec ti ng minute lesions in small vessels. Howe ver, these supervise d appr oaches rely on the Inde pendent and I dentica lly Distributed (IID) assumption. In clini ca l reality, medica l images ar e subject to significant "d omain shift" ca used by var iations in scanne r protocols, contra st timing, and patient demogra phics [5]. C onseque ntly, m odels t ra ined on source domains often fail to gene ralize to unseen target domains (out-of- distributi on data) . While r etra ini ng models on targe t data is a theor etical solution, t he prohibitive cos t of pi xel-leve l anno tation by radiolog ists re nders fully super vised transfer l earning imprac tical. To addr ess this, Unsupervised Domain Adaptat ion (UDA ) has be come a key r ese arch direc tion tha t se eks to transf er knowledge lea rned from a labeled source dom ain to an unlabele d targe t domain w ithout re quirin g targe t -domain annotations [6]. Extensive resea rch ha s explore d va rious appr oache s for UDA. Th ese methods ca n gene r ally be grouped into three ca tegories: image- level alignment, adve rsa rial lear ning, and self- training. Ima ge-leve l approa ch es span from Cycle GAN-ba sed tra nslation methods [7-9], which ca n be computational ly proh ibitive, to l ig htweight alter natives such a s Fast F ourier Tra nsform (FFT)-based style transf er [10- 14]. In parallel, fe ature - or output-leve l adve rsar ial alignment [15-17] ha s been widely explore d, yet it is of ten diff icult to train r eliably due to the sensitivity of min – max optimization. The se li mi tations have moti vate d increa sing int er est in self-tr aining, whe re Mea n-Tea cher [18-19] f ra meworks are fre qu ently adopted for their relative stabili ty. N ever theless, a centra l bottleneck remains: pseudo-labe ls on t he target domain ar e inevitably noisy, l ea ding t o err or acc umulation in highly class-i mbalanc ed sce narios like PE segmentatio n. Beyond the choice of adaptation strategy, the bac kbone arc hitecture also plays a critical role. Many UDA pipelines were originally deve loped on standar dized CNN bac kbones. Howeve r, C NN- based models have increa singly be en o utperf ormed by St ate- of- the-A rt (SOTA) Transf ormer architectur es [ 20-22 ]. B y lever aging self-attention t o m odel long -r ange semantic depe ndencie s, Transf ormers c an over come t he l oca lity of CNN rec eptive fie lds a nd often exhib it stronge r gene raliza tion [23-24]. B uilding on th is trend, DAFormer [25] was among the first t o integr ate a Tra nsformer backbone into a Mean-Teache r UDA fra mewor k, achieving SOTA perf ormance on the       benc hmark. I n medical imaging , 3 MA - UDA [26] fu rther advance d this direc tion by explicitly alignin g repre sentations at both the pixel and attention levels t o mitiga te cross-modality domain shift. Its key i nnovati on, the M eta Attention mechanism, establishes hier ar chica l co rre lations across mult i -he ad attention maps, enabling eff icient multi-leve l alignment without i ntroducing multiple dis cr imina tors as requir ed in adver saria l approa ches. In the c urre nt landsca pe of medical UDA , MAPS eg [27] represents the pea k of non- adver sarial S OTA, ac hieving an im pre ssive 81.32% Dic e score on     tasks. By a bandoning unstable a dversa rial lea rning for a robust Mean-T eac her m ec hanism, it succe ssfully integra tes ana tomi ca l priors via a Global-Local Coll abor ation (GLC) module . Notab ly, MA PS eg established a new benchmar k by proposing a fair model selec tion mec hanism, a ddressing the co mmon but clinically i mpractica l practice of relying on t ar get- domain labe ls for vali dation. Howe ver, despite these bre akthroughs, significant barrie rs re main. Fi rst, MAPSeg’s re liance on 3D CNNs and 3D VA Es imposes ex cessive c omputational ove rhe ad and m emory demands, ra isi ng the threshold for hardware -c onstrained clini ca l environme nts. Second, and most critically for P E detec tion, t he GLC m odu le employs random cropping strate gies. Given that PE l esions occ upy a minute fr a ction of the volume, random sampling fre quently yields bac kground-only pa tches, l ea ding to inef fec tive ana tomi ca l alignment and t he loss of critica l lesion - re lated conte xt. To bridge these gaps, t hi s study proposes a Tra nsformer -base d Mean-Teacher UDA f r ame work tailored for cr oss-cente r PE detec tion. Unli ke traditional CNNs, a Mix Vision Transfor mer (MiT) backbone [ 22 ] is employed to ca pture long -r ange sema ntic depe ndencie s while preserving high -r esolution details for small lesions. The pri mary contribution l ies in the integra tion and design of three feature- spa ce alignm ent modules to enha nce pseudo-labe l re liability: (1) Prototype Alignment (PA) to m in imi ze categor y -leve l distribution sh ifts; (2) Global and Loca l Cont ra stive Lea rning (GLCL) , augmented with a momentum queue, to captur e topolog ical re lationships a nd globa l semantics without l ar ge-ba tch require ments; and (3) an Attention - base d Auxiliary Loca l Prediction (AALP) mo dule. Unlike ra ndom cropping, AA LP uti lizes Tr ansfor mer attention m aps to exp licitl y extrac t lesion -rich re gions, enf orcing anatomical consistency betwee n local and global views. The main contrib utions of t his study are summarize d as follows: (1) A computational ly eff icient Tra nsforme r-base d Mea n-Tea ch er UDA fr amew ork is deve loped for CTP A PE segmenta tion, balancing per formanc e and resour ce constraints. (2) Three feature space alignm ent modules (PA, GLCL, AALP) are i ntegra ted and designed to ac tively im prove pseudo-labe l quality and addr ess the cha llenge of detec ting sm all P E lesions. (3) An Attention-based A uxili ar y Loca l P re diction (AA LP) module is proposed to re place random cr opping, si g nifica ntly enhanc ing sensitivi ty to tiny objects. (4) Experiments on cross-ce nter (FUMPE, CAD- PE) and cross-modality (M MWHS) datasets demonstra te that the propose d met hod ef fec tively mi tigates domain sh ift and a chieve s robust pe rfor mance unde r a strictly unsupervise d model selec tion setting. 2. Related Wo rk 2.1. Uns u pervise d Doma in Ada ptation UDA aims to minim ize the expe cted risk on the targe t domain by aligning the source and targe t distributions, as theor etically bounded by Ben-David et al. [28] . Current stra tegies fa ll int o t hree ca tegories: (1) Image- Level Alignm ent: These m ethods reduce visual discr epanc ies in the input spa ce. While CycleGA N-base d approa che s [7-9] succe ssfully align texture s, they entail high memory usage and training insta bil i ty due t o the require ment of multi p le gene rators and discr iminators. Alterna tively, F ourier Tra nsform (FF T) based methods [10-14] decompose image s into amplitude (style) and phase (structure ). By swapping the low- fre quenc y ampli tude spectr um of s ourc e images with that of t ar get images, style t ra nsfer c an be ac hieved effic iently without l ea rnable parame ters. Similarly, histogr am matching aligns intensi ty distributions. Given the computational cons traints in clinical sett ings, this s tudy adop ts ef ficient FFT and histogram matc hing t ec hniques for preliminary in put-leve l alignme nt. (2) F ea ture-L evel Adversarial Alignment: Inspired by GANs, m ethods like Adapt SegNet [15] and CyCADA [16] employ domain discr iminators to enf orce f ea ture i nvar iance. Although effe ctive, adve rsar ial tr aining is notoriousl y unstable and sensitive to hype rpar ameters. Consequently, this study opts f or a more stable self-tra ining pa ra digm . (3) S elf- Training and Mea n-Tea ch er : S elf- training gene rate s pseudo-labe ls for the unlabe led tar get domain to guide l ea rning [10, 25, 27, 29 -37]. The Mea n-Tea ch er framew ork [18- 19], utilizing Tempora l Ensembling, has become a standard for stabilizing these ps eudo-la bels. Howeve r, the quality of pseudo-labe ls remains a bot t lenec k. While t hre shold-based filt er ing and uncerta inty estimation [38] are common, they are passive strate gies. R ec ent works have explored m eta- 4 lear ning [39] for noise corr e ction, but computational costs remain high. This study foc uses on improving pse udo-labels through a ctive structura l lear ning in t he f eatur e spac e. 2.2. Structural L ear ni ng in Feat ure Space To enha nce feature discriminability, Pro totype Alignment ( PA) a nd Cont ra stive Lea rning have been introduce d t o UDA. PA align s class-specif ic centroids betwe en domains [12], r educ ing ca tegory-level s hifts. Howe ver, PA alone may fail for cl ass-imbalance d data like PE, where ce ntroids a re biased by bac kground noise. Contrastive lea rning pulls posi ti ve pairs closer while pushing negatives apart. Liu et al. [13] proposed a Dual Contrastive fra mework comb ining gl obal and loc al contra stive los ses to capture both s emantic layout and pixel-leve l t opo logy. A l i mitation of such approa ches is the depe ndency on l arge bat ch sizes fo r suffic ient nega tive samples. To mit igate this, this study incorpor ates a m omentu m queue (Mo Co) [40] in to the GLCL module, enabling eff ective contrastive learning under lim ited har dwar e re source s. 2.3. Context -Awar e Modeling a nd Limitatio ns of Random Crop Integr ating global a nd loca l context i s vi tal for ensuring ana tomi ca l pl ausib ility. MA PSeg introd uce d a Global-Loc al Collaboration (G LC) modu le that enfor ce s consistenc y betwee n global fea tures and l oca l patche s. While ac hieving S OTA results, MAPSeg relies on ra ndom cropping t o selec t loc al patche s. In PE detec tion, wher e lesions ar e sparse and small, ra ndom c ropping pre dominantly s ele cts bac kground re gions, render ing the local-globa l alignment ineffe ctive. To addre ss this cr iti ca l flaw, this study proposes the Attention-base d Auxiliary Loca l P redic tion (AA LP) m odule, which lever age s the T ra nsformer 's inher ent attention maps to intelligently l oca te and crop lesion -infor mative regions, ensuring me aningful context a lignment. 3. Mate rial and Meth ods 3.1. Datase ts for UDA Trai ni n g To evalua te the g ener alizability of the proposed UDA fra mework in clinical sc enar ios, two types of ada ptation tasks were conduc ted: bidirectional cross- ce nter adaptation using C TPA images and cross- modality adapta tion usin g ca rdiac sc ans. 3.1.1. CTPA Datase ts For the cross-ce nter adaptation experiments, two publicly ava ilable C TPA datasets with expe rt semantic annotations for PE wer e ut ilized: t he F er dowsi University of M ashha d’s Pul monar y Embolis m (FUM PE) datase t [4 1] and t he Computer Aided De tection for Pul m onary Embolism Challenge (CAD- P E) dataset [42]. Specific ally, t he experiments in t his resea rc h were conduc ted using image sequenc es from 33 sel ec ted patients from the F UM PE data set and 91 patie nts from the C AD-PE dataset. A critical aspe ct of the experimenta l design is the patient-wise partitioni ng of data, ra ther than slice-w ise par ti tioning, to be tter re flec t authentic clinica l conditions and preve nt data leakage . S pec ifically, the training and validation sets for FUMPE and C AD- PE wer e partitioned i nto 29/4 and 80/ 11 patients, re spectively. In t he UDA training phase, t he model has ac cess t o annotations only from t he source dom ain, while the t ar get dom ai n remains entirely un labele d. For instance , in t he      task, the fra mewor k l earns from the l abe led FUMPE training set and perfor ms domain align ment u sing the unla beled CAD-PE training set, wi th per formanc e f inally e valuate d on the CAD-PE va lidation set. 3.1.2. Cardiac CT-MRI Dataset To ensur e a fair comparison with ex isting medical UDA li ter ature , we conduct cross-modality adapta ti on expe riments on the Multi-Modality Whole H ear t Segmentation ( MMWHS) da taset [ 43]. MMWHS comprise s 20 MRI scans and 20 CT scans with annotations for seven cardia c anatomica l s truc tures. Foll ow ing the pre vale nt st anda rds in UDA researc h, this study focuse s on four ke y structure s: Myocardium of t he Lef t V entricle (MYO) , Left Atrium B lood C avity (LA C), Lef t Ventr icle Blood Cavity (LV C), and Ascending Aorta (AA). To ensure dir ect comparabilit y with S OTA methods, we adopt the same training/val idation split protocol a s specifie d in [8]. 3.2. Ex p er im ental Pr oced ure The overa ll expe rimental proc edure of this st udy is illust ra ted in Fig. 1. Fol l owing the dataset partitioning strate gies detailed in S ec ti on 3. 1, t he pr oposed UDA training fra mework i s impl eme nted to deve lop t he Figure 1. Exper iment proce ss 5 semantic segmentati on model. To comprehe nsively eva luate the perf ormanc e and clinical ut il ity of the system, the fol lowing tes ts wer e conduc ted: (1) Validation of the UDA frame work’s effe ctivene ss on the MMWHS dataset for c ross-moda lity ada ptation. (2) Assessment of the perf ormance and ada ptability of the Tra nsforme r-base d architectur e in UDA ta sks. (3) Validation of the UDA fra mework on CTPA datase ts, spec ifically evalua ting the perf ormanc e gain in detec ting m i nute pu lmonary em bolism lesions through fe ature space structura l learning. (4) Ablation studies on e ach critical UD A component to ver ify their individual con tributio ns. (5) A fair compar ison betwee n the propose d UDA training fr amewor k and e xist ing SOTA methods. 3.3. Data Pr eproc essing The ra w C TPA data is pr ovided in DI COM or Nea rly Raw R aste r Data (NR RD) forma t, whe re intensitie s a re expr essed in H ounsfield Units (HU) t o re flect t issue ra diation a bsorption. The orig i nal HU r ange , typically spann ing from -1000 t o 3000, often re sults in poor contra st betwe en PE l esions and surrounding pulmonary s truc tures. To addre ss this, an emp irica l HU window is app lied to clip t he intensity range to [ -200, 500], foll owe d by normaliza tion to [0, 1]. Furthermore , as the pulmonar y arte ry is typically situated a t the image ce nter, the images are croppe d t o a resolution of     to reduc e computational redunda ncy a nd over head. The visual enha nceme nt achieve d through this HU window ing i s il lust ra ted in Fig. 2. For the MMWH S datase t, the pre proce ssing protocols str ictly adhere to the standar ds established in [8], i ncl uding rando m cropping to a size of     to ensure t he di re ct compar ability of expe rimental results. Figure 2. (a ) Befor e applying a H U window (b) Afte r applying a H U window 3.4. UDA Tr aining Metho ds 3.4.1. Problem De finition Given a source domain image set    󰇝   󰇞    with i ts corre sponding pixe l -leve l ground truth labels    󰇝   󰇞    , and a rela ted targe t domain image set    󰇝   󰇞    , wher e   and   re prese nt the total sample c ounts for ea ch re spec tive domain. Our ob jective is to construct a r obust sema ntic segmentation model   . Conventional models optim ized sol ely on s ourc e-domain data f reque ntly e xpe rie nce a subst antial drop in pre dictive a ccur acy when a pplied to the targe t environme nt, a conseque nce of the domain s hift eff ect. UDA aims to mitigate this discre pancy withou t relying on targe t-domain annota ti ons. By im p lementing mecha nisms such as image-le vel t ra nslation or fea ture- spac e alignment to minimize distributional va rianc es, UDA fac ili tates t he migra tion of sourc e-lear ned knowledge to the targe t domain. This process e ventually enha nces the m odel’ s di agnostic ef fic acy in spe cia lized clinical ta rget-domain a pplications. 3.4.2. Ove rall UDA Trainin g Frame work The conce ptual conf iguration of the deve loped UDA training fra mework is presented i n Fig. 3, where the gre en and red l ines identify the i nfor mation flow for t he source and t ar g et dom ains, respec tively. The overall fra mewor k i s structure d i n to thr ee essential phases. The proc ess st ar ts with style-tra nsfe r data augmenta tion during the in iti al processin g to m it igate visual discre pancie s acr oss the domains. Subseque ntl y , a Mea n- Tea che r self-training sc heme with consiste ncy re gulariza tion is employed ; the tea che r networ k produce s pseudo-la bels for unlabe led tar get image s, providing supervisor y signals to t he st udent model despite the abse nce of targe t-domain annotation s. F inally, to further optimi ze feature repre s entation, t hre e specialize d modules for fea ture space st ruc tural lear ning ar e integra ted i nto t he training proce ss. These modules are designed to m ine semantic structure s wit h in the latent spac e, which im prove s pseudo-labe l reliability and enha nce sensiti vity to the f ine boundarie s of s mall l esions. Deta iled formulations and implementat i on detai ls are provided in the f ollowing sub sec tions. Figure 3. Over all UDA t ra ining fr amew ork 6 3.4.3. Segme ntation Netwo rk Arc hit ec ture As established in [25], Transfor mer-ba sed models exhibit superior pe rfor mance and fle xibility over classic al CNNs when applied to UDA semant ic segmenta tion. Conse quently, the Mix V ision Tra nsformer (MiT-B5), pre tr ained on Image N et [44], serve s as the encode r in our architecture . The decoder utili ze s a Featur e Pyramid Ne twork (FPN) following the fra mewor k outli ned in [45] . The FPN arc hitecture is par ticularly critica l for enhanc ing t he detection of small lesions, as it eff ectively integr ates high-resolution details and rich semantic information from shal low and deep layer s, respe ctively. F inally, these combined fea ture maps are proce ssed by a s egme ntation head to produce pixel-leve l class pre dictions. 3.4.4. Style Tr ansfer and Da ta Augmenta tion In this study , the i mage sets gener ated t hrough data augmenta tion are denoted by 󰇝             󰇞 ,. The sets 󰇝      󰇞 consist of sour ce and t ar get i mage s proc essed via conventional augmenta tion s tra tegies, such as Gaussian blurring, rotations, a nd random pixel dropout. Converse ly , 󰇝      󰇞 denote im age s proc essed via t he style transfe r modules (Four ier-base d augmenta tion or Histogr am Matching) direc tly mini mize input-space domain disparit ies. For Fast Fourier T ransfor m (FFT) Data Augmentat ion : A s establi shed in prior UDA re sear ch [10- 11 ], F ourier -base d style transf er has demons trate d significant effe ctiveness in enha ncing data diversity and improving perfor mance across var ious UDA t asks. This technique re lies on the decompo sition of fr equenc y - domain i mages into their re spec ti ve pha se a nd amp lit u de constituents. Ge nera lly , the a mplitude component ca ptures st ylistic attribu tes like texture and i n tensity , wher ea s the phase compone nt ma intains t he esse nti al semantic a nd structura l layout of the i mage. For a sour ce image        (whe re  ,   and   si gnify height, wi dth, and cha nnel i ndices), the F ourier transf orm  i s ut ili ze d to obtain the am plitude spectr um   and the phase spectr um   as follows: 󰇛   󰇜󰇛    󰇜     󰇛    󰇜   󰇡       󰇢   (1) wher e 󰇛   󰇜 ar e coordinates in t he spa tial doma in, 󰇛   󰇜 ar e coor dinates in the fr equenc y domain, and  is the im agina ry unit (      ) . Sim ilarly , the ampli t ude   and phase   of a target image   are obtained. Subsequently , a binary mask   is defined to i solate the centra l low-fre quen cy components (set to 1) whil e setting all rema ining fre quenc ies to 0, the reby en abling the source im age to be rendered i n the targe t style. The ar ea of this region is controlled by the pa rame ter  . The style transf er ope ration is formulate d as:                     (2) Her e, the sym bol  signifie s the Hadama rd (eleme nt - wise) product. This strategy ensure s t he preser vation of high-fr equenc y source s truc ture while in tegrating l ow- fre quenc y target cha rac teristics, there by expanding the diversity of the training data. Finally , the hybr idized amplitude    is combined with t he original pha se   via the inver se Fourier transform   to rec onstruct the stylized image    :       󰇛       󰇜 (3) For Hi stogr am Matc hing: This classical image proc essing technique ma ps the intensity di strib ution of a source i mage to that of a targe t im age to achie ve alignment. For    and   , the pixel in tensities are first normalize d into  discre te bins within t he r ange 󰇟     󰇠 . Letting   be the intens i ty val ue of the  -th bin, t he proba bili ty d istribution   󰇛   󰇜 is define d as:   󰇛   󰇜     (4) wher e   is the number of pixels w ith intensi ty   , and    i s the im age size. A si m ilar distribut ion   󰇛   󰇜 is calcula ted for the targe t do main. By computing the cumulative distrib ution f uncti ons (CDF) for both domains, a m appi ng func tion is esta blished to adjust the intensity distributi on of the source image to match the targe t, resulting in the stylize d outpu t     . 3.4.5. Me an Teacher and Co nsistency Regulari zati o n To mitigate the domain shift betwee n the s ourc e and targe t domains , we a dopt a self- training s trategy based on the Me an-Te ac h er par adigm [ 18]. This iter ative fra mewor k generates online pseudo-labe ls t o supe rvise lear ning on unlabeled target i mage s. The fra mewor k consists of a tea che r model  󰇛     󰇜 and a student model  󰇛     󰇜 , w hich sha re identica l ar chitectur es. Instead of being updated by bac kpropaga tion, the teac her’ s weights are adjusted at eve ry t ra ining st ep  t hrough the Exponentia l Moving Aver age (EMA) of the student’s we ights :           󰇛   󰇜    (5) wher e  is a hyperpa ra meter s et betwee n 0.99 and 0.999, re spectively, ac cording t o d iffer ent training stages. This mecha nism is designed t o stabil ize t he opti mization proc ess, ensuring the ge ner ation of re liable, hi gh-fidelity pseudo-la bels as the mode l converge s. During the pseudo-labe l gene ration phase, the teac her m odel produces a probability m ap   for th e targe t image. While convent ional met hods often employ a fixed threshold  to conver t these m aps into bi nar y 7 pseudo-la bels, the tea cher model is prone to noise in the ea rly stages of train ing. In the c on text of pulm onary embolism de tection, whe re c lass imbalance is se vere and lesions occ upy a minute fra ction of the image, a fixed threshold ca n l ea d to conf irmatio n bias, as the model tends to classify am biguous re gions as background. To miti gate thi s, an e ntropy-based dyna m ic filtering mecha nism is adopted. The entropy for ea ch pixel is ca lculated a s follows:      󰇛     󰇜       󰇛     󰇜 (6) wher e  󰇛    󰇜 re prese nts the predic ted p robability of targe t i mage   belonging to class  . By mi n imizi ng this entropy, the model is dr iven to m ake high -c onfidenc e pre dictions on unlabele d targe t data . In prac tice, pixels ar e ra nked by their e ntropy value s, and only the top 80% with the lowest entropy (highest confidenc e) are reta ined as valid pseudo-labe ls for consistenc y lo ss calcula tion. The remaining 20% are maske d as unre liable, pre v enting lear ning st agna tion ca used by overly rig id t hre sholds i n ea rly itera ti ons. Furthermore , a consistency regular ization st ra tegy utili zing strong and wea k data augme ntations is integra ted to enha nce performanc e. S pec ifically, the student and t ea che r models receive strongly and wea kly augmente d versions of the i nput, respe ctively, an d ar e re quired to produce consistent pre dictions. Bot h t he source domain se gmentation l oss    and t he t ar g et domain cons istency loss   are ca lculated using Dice L oss, defined a s:                                    (7) wher e   is the pre dicted probability from the student model and   re prese nts the t ar get label. F or   ,   is the source -domain ground truth, w here as for   , it denotes the teac her-ge nerated onli ne pseudo-la bel. Dice Loss is partic ularly ef fec tive for medica l image segmenta tion a s it i s less sens itive to class imbalance , stabilizing the training proc ess dominated by bac kground pixels. Finally, whi le filter ing low -c onfidenc e pse udo- labels is essential, i t may not be suffic ient to e liminate all noise. To ena ble the model to ac tively learn t he st ruc tural informa ti on of the feature spac e and mine corr elations betwe en domains, thi s study introduce s three spe cialize d fe ature lear ning modul es: Prototype Align ment (PA), Global a nd Local Contra sti ve L ear ning (GLCL), a nd Attention-base d Auxiliar y Local Prediction (AALP). These modules op timize fea ture represe ntation from diffe rent perspectives, th ere by improving pseudo-labe l quality and over all segmenta tion e ffica cy. 3.4.6. Proto type Alignme nt (PA) The Prototype Align ment (PA) mecha nism implemented in t h is study ai ms t o resolve t he fea ture distribution sh ift common ly encounte red i n unsuper vised domain adapta tion. By perfor ming class-le vel centroid alignment, the fr amew ork promotes the c lustering of fe atures with i dentica l semantic l abe ls — such as pulmonary vascular structure s — within the l ate nt space ac ross both domains. The proc edure s tar ts with the extrac tion of latent re prese ntations  from the p enultimate l aye r of th e enc oder. For the source and target dom ains, the m odel utili ze s the sourc e ground t ruth    and the targe t onli ne pseudo-la bels   to extra ct class-spec ific fea ture sets Φ  , re spectively. The gl obal categor y prototype   is def ined as the centroid of all pixel -wise f eature vec tors within that ca tegory. Notably, as previously discussed, the 20% of t ar get pixels ide ntified as noi sy via high entropy values a re st ric tly exclude d from t he target prototype c alculation to ensure repr esenta tive acc uracy:              (8) wher e  denotes t he spatial index of the flattened fe ature map, and   re prese nts the  -dimensional fe ature vec tor at that loca tion. G iven that pulm onary embolism lesions are extreme ly sparse, a single training batch ma y not pr ovide s table c ategor y-level infor mation. To mi tigate this, a momentum update strate gy i s employed to main tain the gl obal pr ototype s:        󰇛    󰇜   (9) wher e  is the momentum coe fficie nt (se t t o 0.0 1) and    is the protot ype calculate d from t he curre nt batch. Finally, a cross-domain globa l prototype alignment loss i s establi shed by reduc ing t he Euclidean distance betwee n the sourc e prototype    and the ta rget prototype    :                 (10) This optimiza tion compe ls the model to align the fea ture ce ntroids of both domains, ensuri ng robust c ategor y - level c onsist enc y. 3.4.7. Global and Loc al C on trastiv e Learning (G LCL) As illustra ted in Fig. 4, t hi s study im ple ments a Global and L ocal Contrast ive Le ar ning (GLCL) modu le, building upon the D ual C ontrastive fr amewor k proposed by Liu et al. [13]. The primary obje ctive of GL CL is t o forc e t he n etwork t o dec ouple structur al semantics from superf icial style variations (e.g., C T v s. MRI). Since 8 contra stive lea rning bene fits fr om abunda nt negati ve samples but large ba tches a re memory-intensive, this study incorpora tes Mom entu m Contrast (MoCo) [40] i n the global bra nch to increa se negative-sa mple diversity with modest GPU m em ory cost. For L ocal Con tr astive Learning (LCL): The LCL module focuse s on capturing struc tural informa ti on, such a s the contours a nd f ine deta ils of le sions. Base d on the pre mi se that geome tric rela tions hips betwee n neighboring p ixels s hould rema in invariant ac ross styles, a Loc al Projec ti on He ad consisting of two convolutional layer s is uti lized to map la tent fea tures into a c ontrastive spac e       , wher e     and  denote the spatial re soluti on and pr ojection dime nsion, r espec tively. For a f eature vec tor   at spatial index  i n the source fe ature map   , the posi t ive sa mple    is def ined as the most similar vec tor at the corr esponding location in the stylize d m ap     , determined via a cosine si milarity matri x. All other s patial l oca tions in     ar e treated a s the n ega ti ve sa mple set 󰇝   󰇞 . The source and t ar get loca l contrastive l osses,    and    , ar e for mulated as:             󰇛        󰇜  󰇛        󰇜    󰇛        󰇜  (11)             󰇛        󰇜  󰇛        󰇜    󰇛        󰇜  (12) wher e  is the tempera ture parameter. The t otal local loss   is the ave rage of the two :      󰇛        󰇜 . For Gl obal Contrastive L ear ning (GC L) with MoCo: While LCL targe ts " contour s and details," GCL aims to ca pture the "s kele ton and l ayout" of the im age . Feature s are mapped via a Global Projec tion He ad ( Conv + M ax P oo ling) t o a fea ture vector  . To maximize discriminative power without requir ing exce ssive batch sizes, a Fi rs t - In -First-Out ( FIFO) queue is mainta ined to store hi storica l negative fe atures. To ensure cross -ba tch consistenc y, negative s amp l es are gener ated by the teac her model and stored in the que ue. If  denotes the number of source domain image s wit hin a single batch, the sourc e global c ontrastive loss    is formulated a s:                         (13)                             󰇛           󰇜    (1 4) In this formulation,    serve s as the anc hor (the fe ature vector of the  -th source image), while     is the cor responding s tylized f eatur e, trea ted a s the po sit i ve sample. The nega tive samples consis t of    󰇛    󰇜 , re prese nting ot her s ourc e im age s wit h differ ent semantics within the same ba tch, a nd the set      re prese nting tar get- to - source styl ize d fea tures. Simil ar ly, for t he targe t dom ain, assuming the same number of images  , the targe t global contrastive loss    is ca lculated using t he fea ture sets 󰇝                   󰇞 as t he anchor, posit ive s ample, and negative samples, re spectively. The total global loss is the avera ge of the two:              . F inally, the over all contra stive loss is ca lculated a s:       󰇛    󰇜   (15) wher e  is set to 0.5 . This dual -leve l approa ch ensure s the model remains robust to globa l stylis tic shift s whi le maintaining high sensit ivity to local patholog ical structure s. 3.4.8. Attention-based Auxil iary Local Predic tion (A ALP) To integrate contextual semantic i nfor mation, this study adopt s t he G lobal-Loca l C o llabora ti on (GLC) module proposed in [27] . The GLC mo dule fa cilitates t he lear ning of ana tomi ca l priors by al igning global a nd l ocal pre dictions, compel ling the model to unde rstand long - ra nge spatial relationships ra ther than over-re lying on isolated local featur es. In clini ca l sce narios such as PE detec tion, t his constrain t is para mount; wit hou t global context, a model might er roneously predic t lesions or vasc ular struc tures in an atomica lly i mposs ibl e loc ations, leading to sign ifica nt false positive s. The founda tional a rc hit ec ture of this integration (pre viously ref err ed to as GLC) i s il lustrated i n Fi g. 5 . Let  󰇛      󰇜 be the stude nt network in the Mea n- Tea che r f rame work. The local fe atur es   ar e der ived from a loc al patch  passing through the s tudent networ k. F or gl obal feature s, a bi nar y mask  is used to select cor responding spatial regions :           󰇛  󰇜  , where  is the downsampled Figure 4. Ove rview of the GLCL module 9 global image and  󰇛  󰇜 re prese nts t he f ea ture extra ction oper ation. Subseque ntly, the local fea tures   and the ma sked gl obal f ea tures     ar e c oncate nated along the c hannel di mension. This f used re pre sentation is then proc essed by the decode r  󰇛  󰇜 to produce the final segmenta tion result:  󰇛  󰇜           (16) wher e  denotes the conca tenation opera tion. To pre vent l oca l overf itt ing and ensure anatomical consistenc y, t he dis tanc e betwe en   and     is minimize d using cosine similar ity regular ization:                    󰇡           󰇻    󰇻  󰇢 (17) Despite its theoretic al adva ntages, the original GLC module relies on a random cropping st ra tegy to obtain l oca l patche s. For CTPA datasets where PE lesions ar e extreme ly mi nute, random cropping fre quently yields bac kground-only patche s, failing to provide t he m odel with ef fec tive lesion-rela ted semantics. To r esolve this, this s tudy propo ses Att enti on -ba sed Auxiliary Local P re diction (AALP), wh ich uti l izes the Tra nsformer ’s self -a ttention mecha nism to perfor m salienc y-guided cr opping. The m ethodolo gy for extra cting local patche s through t he attent ion m ec hanism in this study is deve loped with ref ere n ce to the approach descr ibed in [46]. S pec ifica ll y, to ca pture t he contribution of various image blocks to the final semantic segmen tation, thi s module levera ges the self-attention m ec hanism of t he dee p Transf ormer layer s to extr ac t saliency feature s. We foc us on t he attention o ut put s from the final two laye rs of the model. For a Tr ansfor mer bloc k with  heads and a  -dim ensiona l fea ture spa ce , t he attention m atr ix    for the  -th block is co m puted vi a the dot -produc t of quer ies (  ) a nd keys (  ):       󰇧      󰇨 (18) The resulting 󰇛  󰇜 m atr ix describe s the semantic corre lation betwee n  image blocks, where the element    󰇛   󰇜 re prese nts the strength of the re lationshi p be twee n the  -th and  -th blocks. To quantify the spe cific c ontribution of ea ch image bloc k, a dimension aggre gation opera tion is per forme d by fusing fe atures along the sec ond dimension of the matrix to gene rate a 1D vector,    . The final attent ion salienc y m ap  is t hen formed by super imposing the   vec tors from these final two layer s. Bloc ks with attention values exc eeding the mean attent ion   ar e i dentifie d as ef fec tive candida te re gions. F inally, Connecte d Component Analysis (CCA) is applie d to locate the l ar g est conne cted area within these r egions, which serves as the centroid f or c roppin g the m ost informa ti ve loc al patch. The total loss for the AALP module,    , is the sum of the source and targe t domain losses,     and    , re spectively:             (19) The individual losses for eac h domain ar e def ined as:        󰇛  󰇛   󰇜    󰇜            󰇛  󰇜        󰇛 󰇛   󰇜     󰇜            󰇛  󰇜 wher e   denotes the Dice Loss,   is the source - domain l oca l patch annotation, and    is the targe t-domain online pseudo-label. Following the established protocols, the hyper para meters are set to            , re spec tively. 3.4.9. Ove rall Objective Function In summ ar y, the t ra ining proce ss of t he propose d UDA fra mewor k is guided by an overa ll objective func tion that i ntegr ates t he indi vidual contribut ions of the segmenta tion, con sistency, and structura l lear ning modules. The to tal loss    is formu lated a s follows:                             (22) wher e      󰇝   󰇞 denotes t he hype rpa ramete rs that balance t he weight of each l oss component during the optimiz ation proce ss. 3.5. Impleme ntatio n Details The hyper para meters used in this study for UDA Figure 5. Arc hitecture of the GLC module (re dra wn from [27 ]) 10 training ar e 󰇝               󰇞    󰇝           󰇞 , and  is set t o 0.04. The seg mentation network was optimi ze d using A damW, wi t h lear ning rate s of 6e-5 and 6e-4 for t he enc oder and decoder, respectively, and a weight dec ay of 5e-4. We adopted cosine annea ling for lear ning-ra te scheduling. Models were t ra ined for 120 epoc hs with a batch si ze of 8; the first 50 epochs ser ved as a warm- up stage using only source- domain ima ges. To miti gate the eff e ct of r andom initi aliza tion, each expe riment wa s re peate d with five diffe rent random see ds, and results are re ported as the m ea n ± standar d deviation ac ross runs 3.6. Equip ment All exper im ents wer e conducted on a works tation fe aturing an ASUS Z790-A GAMING WIFI 6E motherboa rd, an I ntel Core i9-13900K C PU, and an M SI GeForc e RTX 4090 GA MING X TRIO (24 GB) G PU. 4. Ex p er imenta l R esults 4.1. Evaluatio n Met rics To evaluate the segmentation perfor mance of the proposed mode l on target domain da ta, Inter section over Union (IoU ) and Dice Score are employed as the primary metrics. The ca lculation formulas a re a s follows:         (23)             (24) wher e TP (T rue Posit ive) denotes lesion pixe ls corre ctly pre dicted as lesions ; F P ( F alse Po sitive) denotes bac kground/other-c lass pixel s incorr ectly predicted as lesions; and FN (Fal se Negative) denotes ground-truth lesion pixels tha t the mode l failed to de tect. Regar ding the results for the CTP A da taset, IoU is used a s the evalua tion m etr ic, and onl y the scor es for the pulmonary emboli s m ca tegory ar e r ec orded. This is bec ause the pixel di stribution betwee n pul monar y embolism lesio ns and the bac kground in CTPA i ma ges exhibits significa nt class imba lance ; i ncluding t he bac kground cla ss would fail to ac cura tely ref lect the model's detec tion e ffic acy. C onver sely, for t he MM WH S cr oss-modal hear t dataset, we follow st anda rd practice and use the Dice score . Dice is compu ted f or ea ch o f the selec ted ana tomi ca l structure s, and the fina l perfor mance is repor ted as the me an Dice ac ross ca tegories. Finally, f or the sele ction of the opt imal model, th is study adopts t he approa ch de scribed in [27], which selec ts the best model without acc essing targe t domain labels. By combining the super vised perf ormanc e on the source dom ain with the consistency score betwe en t he targe t domain predictions and ps eudo-la bels, this stra tegy simulates the rea l-world clinical sce nario wher e targe t domain annotations ar e unava ilable. The selec tion s cor e is define d as follows:              (25) wher e     re prese nts the perf o rmanc e on th e source domain validation set (I oU for the CTP A dataset and Dice Score for the MMWHS dataset, respec tively), and     re prese nts the consistency s cor e betwe en the tar g et domain pre dictions and the ge ne ra ted pseudo-la bels (simila rly using IoU or D ice Scor e). 4.2. Ef fective n ess of the Prot otype Al ignment M o d u le on the MM WHS Dat aset (C T → MRI) This section eva luates the Prototype Align ment (PA) module on the MMWHS datase t (CT → MRI) . Exper iments ut i lize an ImageN et-pretr ained Mi T-B5 enc oder, an FP N decoder , and the m odel sele ction criter ia def ined in Section 4.1. Table 1 shows that the source-only (CT) M iT- B5 model achieve s only 19.4% ave rage Dice on the t ar get domain (MRI). C ompare d to the 84.9% super vised upper bound, this si gn ifica nt gap confir ms a s eve re domain shift. This d iscrepa ncy s tems from fundamental diffe rences in physical i maging mec hanisms, r esulting in distinct brightness and texture fea tures t hat hi n der direc t model migrati on. We define the baseline as a Mean Teacher fra mewor k incorpor ating style transf er (FF T or Histogram M atc hing) and con sistency regular ization. Using the FFT-base d baseline, perfor mance rea ches 62.0%, indica ti ng that F F T e ffe ctively aligns input -spac e styles whil e Mean Teac h er establishes i nitial decision boundar ies via entropy- filtered pse udo-labels. Integr ating the PA module furthe r i mproves t he Dice scor e to 64.7% (a 2.7% gain), suppor t ing our hypothesis that proto type align ment function s in the dee p fe ature space . By mi nimizin g inte r-domain prototy pe distance s, f ea tures of the s ame ca tegory c luster more closely. Thi s proce ss ef fec tively cor rec ts noisy pseudo- labels fr om the Mean T eac h er, e nhanc ing c ross-modality segmenta tion precision. Table 1. Effe ctiveness of the PA module on t he MMWHS datase t (CT → MR I) 11 4.3. Effective n ess of t he At tention -based Aux ili ary Local Predictio n Module o n t h e MM WHS Dataset (CT → MR I ) This section i nvestigate s the eff ec tiveness of the Attention-base d Auxil iar y Loca l Predic tion (AAL P) module on the MMWHS datase t (CT → MR I) . To enha nce the m odel's ability t o capture anatom ica l structure s, we introduced the l oca l auxi liary prediction module. Table 2 i ll ustrates t he im pac t of differ ent cr opping st ra tegies on model pe rforma nce. O bserva tions show that the local auxiliary pre diction module using a ra ndom cropping strate gy inc rea sed the a ver age D ice score from 62.0% to 67.2%, yi el ding a 5.2% gain. This indicates that learning gl obal and local contextual cor relations betwee n features allows t he m odel to ef fec tively c apture vit al anatomica l struc tural informa ti on in m ed i ca l i mage s. F urther more, applying cosine constra ints to global and local features compels the model t o di stinguish re lationships betwe en neighboring pi xels, ther eby enha ncing the robustness of pseudo-la bels aga inst interfe re nce. The proposed Attentio n-based Auxili ar y Local Prediction (A AL P) furthe r pushes per forma nce t o 68.1%. The key to this a dditiona l 0.9% gain lies in " pre cise positioning. " As descr ibed in t he Methodology cha pter, we utilize the Transf ormer ’s attention matrix to ca lculate the global mea n a ttent ion   , a utomatically locating high- attention re gions (such as t he ma in hear t st ruc ture) . This ensure s that eve ry l oca l patch fed i nto the auxil iary networ k contains ric h semantic information, avo i ding interf ere nc e f rom bac kground noise a nd allow ing the model to focus on opt imi zing the edge details of ana tomi ca l structure s. Table 2. Effe ctiveness of the AALP module on t he MMWHS data set (CT → MRI ) 4.4. Eff ec tiveness o f the Global and Local Co ntrastive Learning Module on the MM WHS Datase t (C T → MRI) This section i nvestigate s the eff ec tiveness of the Global and Local C ontrastive Lear ning (GLCL) m odu le on t he M MWH S da taset (CT → MRI) . The re sult s are summarize d in Table 3. First, t his study exp lore s t he impac t of di ff ere nt style transf er da ta augmenta tions on contra stive lear ning. Exper imental results show that the Dice score using Histogram M atc hing is 0.6% lower t han tha t using FFT. This dis cr epa ncy is likely beca use Histogram Matching forc ibly perf orms gra yscale mapping through a cumulative distribut ion funct ion ( CDF) to ali gn with the targe t domain. This proce ss often i nduce s a " Stairca se Eff ect" at poi nts of pi xel i ntensi ty discont inuity. In medica l imagi ng, s uch digitization di stortions ca n form ar tifacts simil ar to actua l le sions , nega tively aff ecting the discriminative powe r of f ea tures in contra stive lear ning. In terms of contra stive st ra tegies, in troducing Global Contrast ive Lea rning (GCL) alone im prove s t he Dice score to 69.2%. G CL force s the network t o ignor e surfa ce -lev el st yle deviations and focus on capturing the global s tructure and s ema ntic informati on by pulling together images with " the same seman tics but diffe rent styles." Whi le thi s allows the network to under stand the spatial layout of a natomical struc tures f rom a global per spective a nd more a ccur ately loca te the ove rall exte nt of organs, GCL often projects the entire image i nto a single global fe ature , which tends to lose pi xel-level informa ti on. I n contra st, Loca l Contrastive Le ar ning (LCL) aligns pixels of the same s tyle, ena bling t he model to understa nd the rela tionshi p betwe en ea ch pixe l and its neighbors. This leads t o more ac cura te organ boundary delinea tion a nd improves t he contour cla rity of small objects. The results confirm that combining GC L and LCL yields a complemen tar y eff ect, further incre asing the Dice scor e fr om 69.2% to 69.5%. Finally, the G CL in t hi s exper iment utilize s MoC o technology, which uses a queue to store his t orica l nega tive samples. This approa ch maintains stable fe atur e lear ning quality and model perf ormanc e whil e significantly r educing t he require ments for har dware computational re source s and GPU memor y. Table 3. Effe ctiveness of the GLCL module on the MMWHS datase t (CT → MR I) 4.5. Ablatio n S tu dy of Key Mod ul es on the M MWHS Datase t To qua ntify t he c ontribution of ea ch pr oposed module to cr oss -moda lit y adapta tion and to verify t heir complementa rity, we conduc ted a step-by-step ablation expe riment on the M MWH S dataset. Table 4. presents the avera ge Dice score per formance under diffe ren t 12 component c onfigura ti ons. First, us ing Mean Te ac her as the base line ar chitectur e, t he Dic e scores ar e 62.0% for the CT → MRI t ask and 70. 3% for the MRI → CT t ask. Afte r introducing the P rotot ype Ali gnment (PA) modu le, per formanc e i mprove d to 64.7% (+2.7%) and 72.1% (+1.8%) respe ctively. This re sult confirms that per forming initial categor y centr oid alignme nt in the fe ature s pace effec tively correc ts the confirmation bias gene rate d by the Mean T eacher . Signi fic ant improve ments wer e observe d upon furthe r i ntegr ating the A ttentio n -ba sed Auxiliar y Loca l Prediction (AAL P). In the C T → M RI task, t he Dice score jumped from 64.7% to 69 .3%. In the M RI → CT task, the i ncr ease w as even more subs tantial, soaring fr om 72.1% to 79.5%. This indicates tha t global feature alignment (PA) alone is insufficie nt to ca pture fine ana tomi ca l boundarie s. The AALP module utilize s the attention mec hanism for pre cise l oca lizatio n of the main hea rt structure , forc ing the model to lear n hig h-resolution semantic fea tures within local patche s, there by so lving the deta il l oss co mmon i n cross-modalit y transitions. Finally, the addition of Global and Local Contrastive Lear ning (GLCL) served as the ul ti mate optimi za tion. The model achieve d opti mal per formanc e of 69.9% and 80.2% in the two tasks, respe ctively. B y pulling sim ilar fe ature s closer and pushing dissim ilar fe atures apart, GLCL further eliminates bou ndar y ambiguity, resultin g in segmenta tion results w ith more complete ge ometric struc tures. Table 4. Ablation study of ke y module s on the MMWHS datase t 4.6. Eff ec tiveness of the Proposed UDA Method on the PE Datase t To validate the per formance of the proposed U DA fra mewor k in real-world clinical sce narios, thi s study conduc ted bidirec tional cross-ce nter a daptation expe riments betwe en two pulmonary emboli sm datase ts from diff ere nt source ce nters, FUMPE and CAD -PE. Table 5 prese nts t he perfor mance of the model in terms of the I oU metric. First, observing t he baseline perfor mance wit hout any adaptation (W/o ada ptation), when t he mode l is traine d on FUMPE and di re ctly infer red on CAD-PE, the IoU is on ly 0.1152. C onver sely, when tra ined on CAD - PE and i nfe rre d on FUM PE, the IoU i s onl y 0.1705. Compared to t he theoretica l upper bounds of supervised lear ning (FUMPE: 0.5208 / C AD-PE : 0.4895), thi s significant perf ormanc e degra dation conf irms that eve n within the same moda lit y ( CTP A), signif icant Domain Shift s till occur s due to di ff ere nce s i n cont ra st i njec ti on timi ng , sc anning par amete rs, and patient popu lations betwe en different hospit als, preve nting the m odel from direc t migration. Upon introducing the UDA m ethod proposed in this study, which includes PA, AALP, and GL CL, the model demonstrated remar kable adaptation ca pabiliti es. For FUMPE → CAD-PE, the IoU si gnifica ntly inc rea sed from 0.1152 to 0. 4153, bringing a gain of 0.3001. For CAD- PE → F U MPE, the IoU increa sed from 0.1705 to 0.4302, an incr ea se of 0.2597. The se re sult s indicate that our m ethod s ucc essfully reduce s the dis tribution gap betwe en cr oss-center image s, proving that the model has succ essfully establis hed corr ec t lesion featur e re prese ntations on t he unla beled ta rget domain. Table 5. Effe ctiveness of the proposed UD A method on the PE datase t 4.7. Visualizat ion of Att entio n -based Local Patc h Extrac tion To verif y whether the Att entio n-base d Auxil iar y Loca l P re diction (AALP) module propo sed in this st udy ca n effectively re place random cr opping, F ig . 6 showca ses exa mples of local patche s localize d and cr opped by the AAL P m odule base d o n t he at tention matrix. The gree n areas in the figur e indicate the ground truth locations of t he pulmo nar y embolism lesion s. In t ra diti onal random croppi ng str ategies, beca use pulmonary e mboli s m lesions acc ount for a v ery low proportion of the entire CTPA image, r andomly selec ted patche s have a high probabili ty of contain ing only non - informa ti ve background. This make s it d ifficult for the auxiliary ne twork to lea rn positive sample fe ature s. 13 In c ontrast to these tra ditional metho ds, the re sults in Figure 6 de monst ra te t hat almost ever y local patch extra cted unde r the guida nc e of the attention mec hanism contains pul monary embo li sm lesion s. T h is c onfirms that our stra tegy of utilizing a ttention w eight s f or loca lization ca n effec tively filter out low-information bac kground re gions, ensuring that eve ry pi ec e of data fed i nto the auxiliary ne twork possesse s learna ble infor mation. 4.8. Compariso n with State-of-the-A rt M ethods Table 6 compare s the propose d method with curr ent mainstrea m SOTA me thods on t he MMWHS dataset (CT → MRI). It is notewor thy that the exper imental set up in this s tudy adopts more st rin gent validati on crit er ia (in dicate d by ※ in the t able ), where no targe t domain labels a re use d during the mo del selec tion and val idation phase s. E ven without the a ssistance of labels to selec t the optimal weights, our method ac hieves a S OTA level of 69.9%, demons trating high re liabili t y i n rea l -wor ld clinical sc enar ios with unl abe led data . Although Table 6 shows that MA PS eg (80.3%) and FS UDA- V2 (75.5% ) ac hieve highe r abso lute sc ores than this study, the computational comple xity and hardwa re re source thr esholds behind t hese models must be consider ed. F SUDA-V2 uti l izes a complex ensemble strategy that requir es pre-training multi ple teac her models and combin ing con tra stive lear ning to distill knowledge to a student model, resulting in a cumberso me training process and expo nentially incre ased para meter counts and m emor y r equire ments. MAP Seg is based on a 3D CNN architec ture, which inherently possesses much higher computational demands tha n the 2D arc hit ec ture used i n th is study. Further more, it requir es a Variational Auto-Enc oder (VAE) pre-trained on l ar ge-sca le data sets to extrac t pr ior fe atur es, w hich not only significa ntly incre ases training tim e but also im poses extremely s trict re quirements on har dwar e computing powe r. Regar ding MA- UDA (68.7%), a s a re pr ese ntative of adve rsar ial lear ning, it s training process is highly unstable, a nd li ter ature indi ca tes t hat i t require s high-e nd NVI DIA V100-level GPUs for tr aining, which pre sents a significant har dware burden for general clinical labora tories. In contrast, the method pr oposed in this study require s only a single encode r-dec oder architecture paire d with eff icient fe ature- space structura l learning. Under the hardwa re condition of a single NVID IA GeForc e RTX 4090 ( 24 GB VRAM), we achieve d high- pre cision se gmentation nea r 70% based on a 2D networ k. This i ndica tes t hat this st udy succe ssfully a chieve d a balanc e betwe en UDA segmentation perf ormanc e and computational c ost, provid i ng a so l ution be tter sui ted for deployment in re source- constra ined medica l scena rios. Table 6. Com par ison w it h SOTA methods 5. Conc lusions This study prese nts a Transfor mer-base d UDA fra mewor k integrating Prototype Ali gnment (PA), Attention-base d Auxilia ry Loca l Pre diction (AALP), and Global/Loca l Contrastive Lear ning (GLCL) to mitiga te domain shift in medical im ag i ng. Experimental results show tha t the propose d multi-leve l alignment significantly improve s per formanc e: the M MWH S (CT → MR I) Dice score increased from 19.4% to 69.9%, while the PE cross-cente r IoU demonstrated s ubstantial bidirec tional gains, ris ing from 0.1152 to 0.4153 (+0.3001) for F UMPE → C AD -PE and from 0.1705 to 0.4302 (+0.2 597) for CAD- PE → FUMPE. Notabl y, the AALP module effe ctively localize s minute l esi ons v ia self- attention, surpassi ng tradit ional ra ndom cropping. Compared t o S OTA m ethod s, our 2D architec ture ac hieves competiti ve acc urac y using only a single RTX 4090, offe ring a cost-eff ec tive and robust solution for re al- wo rld clinica l deployment Refer ences [1] M. T. Lu et al., " Axial and r efor matted four- cha mber right ve ntricle – to – lef t ventricle dia meter r atios on pulmonary c t angiography a s predic tors of dea th af ter ac ute pulmonary e mbolism , " America n Journal of Roentgenology, vol. 198, no. 6, pp. 1353-1360, 20 12. [2] C. C ano- Espinosa , M. Cazor la, and G. G onzá lez, "C o mputer a ided detec tion of pul monary emboli sm using multi-sl ice mult i-axial segmenta tion," Appl ied Science s, vol. 10, no. 8, p. 2945, 20 20. [3] K. Long et a l., "Probabili ty -ba sed Mask R-CNN for pulmona ry embolis m detection, " Neuroc omputing, vol. 422, pp. 345-353, 2021. [4] T. Trongmethe era t, K. Sukpraser t, K. Netiwongsa non, T. Lee boonngam, and K. S umetp ipat, Figure 6. Attention-based loc al patc h samp les 14 "S egment-ba sed a nd P atient-ba sed Segmenta tion of CTPA Image in Pul monary Embol ism using CBAM ResU-N et," i n Proce edings of the 13th Inte rnational Confer ence on Advanc es in Inf ormation Tec hnology, 2023, pp. 1-7. [5] H. Guan a nd M. Liu, " Domain a daptation for medica l im age ana lysi s: a surve y, " IEEE Tra nsactions on Biome dical Enginee ring, vol. 6 9, no. 3, pp. 1173 - 1185, 2021. [6] X. Liu et al., "Dee p unsupervised d omain ada ptation: A review of re cent a dvances and per spectives," APSIPA Tr ansac tions on Signal and Infor mation Processing, vo l. 11, no. 1, 2022. [7] C. C hen, Q . Dou, H. Chen, J. Qi n, and P.-A. Heng, "Synergis tic image a nd fea ture a daptation: Towa rds cr oss -moda lit y domai n ada ptation for medic al image se gmentation, " in Procee dings of the AA AI conf ere nc e on a rtificia l int elligenc e, 2019, vol. 3 3, no. 01, pp. 865-872. [8] C. C hen, Q . Dou, H. Chen, J. Qi n, and P. A. Heng, "Unsuper vised bid irec tional cross-modality ada ptation via deeply syne rgistic image and f ea ture alignment for medical image segmentat ion, " IEEE transa ctions on medical i maging, vol. 39, no. 7, pp. 2494-2505, 2020. [9] W. Ji and A. C. Chung, "Unsupervised do main ada ptation for medic al image segme ntation u sing transf ormer w ith m eta attention, " IEE E Transa ctions on Medica l Imaging, vol. 43 , no. 2, pp. 820-831, 2023 . [10] Y. Yang a nd S. S oatt o, "Fda: Fourier domain ada ptation for sema ntic segmentation, " in Proc eedings of the I EEE/CVF confer enc e on computer vision and patter n rec ogniti on, 2020, pp. 4085-4095. [1 1] Z. Zhou, L. Qi, a nd Y. Shi , "Gener aliza ble medica l im age segmenta tion via r andom ampli tude mixup and do main-spec ific image r estora tion," in Europe an Confer enc e on Computer Vision, 2022 : Springer, pp. 420-436. [1 2] W. Feng, L. Ju, L. W ang, K. S ong, X. Zha o, and Z. Ge , " Unsuper vised domai n ada ptation for medic al image se gmentation by selec ti ve e ntropy constra ints and ada ptive semantic a lignm ent, " in Proc ee dings of the AAAI Confer ence on Artificia l Intelligence , 2023, vol. 37, no. 1, pp. 62 3-631. [1 3] S. Liu, S. Yin, L. Q u, M. Wa ng, and Z. S ong, "A structure -aw are fra mework of unsuper vised cross- modality doma in adaptation via f reque ncy a nd spatial knowledge distillation, " IEEE Transa ctions on Medica l Imaging, vol. 42, no. 12, pp. 3919-3931, 2 023. [1 4] S. Liu, S. Yin, L. Q u, a nd M. Wang, " Reducing domain gap in fr eque ncy and spa tial domain for c ross - modality doma in adaptation on medica l ima ge segmenta tion," in Proc ee dings of the AAAI confe re nc e on ar tificial intellige nce, 2023, vol. 37, n o. 2, pp. 1 719- 1727. [1 5] Y. -H. Tsa i, W.-C. Hung, S. Schulter , K. Sohn, M. -H. Y ang, and M. Chandr ake r, "Lear ning to adapt structure d output space for se mantic segmenta ti on, " in Proce edings of the I EEE conf ere nce on computer vision and pa ttern re cognition, 2018, pp. 7472-7481. [16] J. Hoffma n et al., "Cycada: C ycle- consistent adve rsar ial domain adapta tion," in I nterna tional conf ere nc e on ma chine le arning, 2018: Pmlr, pp. 1989- 1998. [17] X. Lai e t al., " De coupleNe t: Dec oupled network for doma in adaptive se mantic segmenta tion," in Europe an Confer enc e on Computer Vision, 2022 : Springer, pp. 369-387. [18] A. Tar vainen a nd H. Va lpola, " Mea n teac hers a r e better role models: We ight -a vera ged c onsist enc y targe ts improve se mi -supe rvised dee p lear ning results, " Advanc es in neur al informa tion processing sys tems, vo l. 30, 2017. [19] K. Sohn et al., "Fixmatch: Simplifyin g semi - supervise d lear ning with consistenc y and c onfidenc e," Advanc es in neur al informa tion processing sys tems, vo l. 33, pp. 596-608, 20 20. [20] A. Dosovitsk iy et al., "An im age is w orth 16x16 words: Tr ansforme rs for image re cognition at scale ," ar Xiv prepr int arXiv:2010.11929, 20 20. [21] Z. Liu et al., " Swin tra nsformer : Hiera rchic al vision transforme r using shifted wind ows," in Proce edings of the I EEE/CVF interna tional confe re nce on computer visi o n, 2021, pp . 10012-10022 . [22] E. Xie, W . Wang, Z . Yu, A. Ana ndkumar, J. M. Alvar ez, a nd P. Luo, " Seg Former: S impl e a nd ef ficient design for semantic segme ntation w ith t ra nsforme rs, " Advanc es in Ne ural Inf ormation Proce ssing S ys tems, vol. 34, pp. 1 2077-12090, 2 021. [23] M. M. Nasee r, K. Ra nasinghe, S. H. Kha n, M. Haya t, F. S hahba z Kha n, and M.-H. Ya ng, "Intrig uing prope rties of vision transfor mers," A dvanc es in Neur al Infor mation Processing Systems, vol. 34 , pp. 2329 6- 23308, 2021. [24] S. P aul and P .-Y. Chen, " Vision tra nsforme rs ar e robust lea rner s," in Procee dings of the AAA I conf ere nc e on A rtificial I ntelli genc e, 2022, vol. 36, n o. 2, pp. 2071-2081. [25] L. Hoyer , D. Da i, and L. Va n Gool, " Da former : Improving ne twork ar chitectur es and tra ining st ra tegies for doma in-adaptive semantic segme ntation," i n Proce edings of the I EEE/CVF Confe renc e on Computer Vision and Pat tern Rec ognition, 2022, pp . 9924-9935. [26] W. Ji and A. C. Chung, "Unsupervised do main ada ptation for medic al image segme ntation u sing transf ormer w ith m eta attention, " IEE E Transa ctions on Medica l Imaging, vol. 43 , no. 2, pp. 820-831, 2023 . [27] X. Zhang e t al., "MAPSeg: U nified U nsupervised Domain Ada ptation for He teroge neous Medic al Image Segmentation Based on 3D Masked Autoe ncodi ng and Pseudo-La beling, " in Procee dings of the IE EE/CVF Confer ence on Comput er Vision and Pattern Recognition, 2 024, pp. 58 51-5862. [28] S. B en- David, J. Blitze r, K. Cra mmer, A. Kule sza, F. P er eira , and J. W. V aughan, "A theor y of lea rning from diff ere nt domains," Mac hine lea rning, vo l. 79, no. 15 1, pp. 151-175, 201 0. [29] H. Ma, X. Lin, Z. W u, and Y. Y u, " Coarse- to -fine domain ada ptive semantic se gmen tation wi th photometric a lignment a nd cate gory-ce nter re gulariza tion," in Proc ee dings of the IEEE /C VF conf ere nc e on c omputer vision and pa ttern re cognition, 2021, pp. 4051-406 0. [30] T. Lei, D. Zha ng, X. Du, X. W ang, Y. W an, a nd A. K. Na ndi, "Semi-supervise d medical i mage segmenta tion using adve rsar ial consistency le arning and dynamic c onvolution networ k, " IEEE transa ctions on medica l im aging, vo l. 42, no. 5, pp. 126 5-1277, 2022. [31] Z. Zhao, F. Zhou, K. Xu , Z. Zeng, C. Guan, a nd S. K. Zhou, "LE-UDA: Labe l-ef ficient unsuper vised domain ada ptation for medic al i mage segme ntation," IEEE T ransa ctions on Medical I maging, vol. 42, no. 3, pp. 633-646, 2022. [32] F. Yu, M. Zhang, H. D ong, S . Hu, B. Don g, and L. Zhang, "Da st: Un super vised domain ada ptat ion in semantic se gmentation based on dis cr imina tor attention and se lf-training," in Proc eedings of the AAAI Confer ence on Artificia l Intelligence , 2021, vol. 35, no . 12, pp. 10754-1076 2. [33] C. S . Perone , P. B al lester , R. C . Barr os, and J. Cohen-Ada d, "Unsupervised do main a daptation for medica l im aging se g mentation wi th self-ensembling, " Neur oImage , vol. 194, pp. 1-11, 2019. [34] L. Hoyer , D. Da i, and L. Van G ool, " HRDA : Context-awa re high-resolution do main-adaptive semantic se gmentation, " i n Europe an Confe renc e on Comput er Vis ion, 2022: Spr inger , pp. 372-391. [35] H. Wu, Z. W ang, Y. Song, L. Ya ng, and J. Q in, "C ross-pa tch dense contra sti ve le arning for semi- supervise d segmentation of c ellular nuc lei in histopatho logic images, " i n Proce edings of the IEEE/CVF conf ere nce on c omputer vision and pa ttern re cognition, 2022, pp. 11666-11675. [36] H. Yao, X . Hu, and X . Li, " Enhanc ing pse udo label qua lit y for semi-supervised do m ain-ge nera lized medica l im age segmenta tion," in Proc ee dings of the AAAI Confer ence on Artificia l Intelligence , 2022, vol. 36, no. 3, pp. 30 99-3107. [37] M. Chen, Z. Zheng, Y. Y ang, a nd T.- S. C hua, "Pipa: Pixe l-and patc h-wise self- supervise d learning f or domain ada ptative semantic se gmentat ion, " i n Proce edings of the 31st A CM Inter national Confer ence on Multimedia , 2023, p p. 1905-1914. [3 8] Y. Wa ng et al., "Semi-supe rvised sema ntic segmenta tion using unre liable pseudo-labe ls," in Proce edings of the I EEE/CVF c onfer ence on computer vision and patter n rec ogniti o n, 2022, pp . 4248-4257. [3 9] X. Guo, C. Ya ng, B. Li, and Y . Yuan, "Metacor rec tion: Domain-a war e meta loss cor re ction for unsupe rvised domain a daptation in semantic segmenta tion," in Proc ee dings of the IEEE/CV F conf ere nc e on c omputer vision and pa ttern re cognition, 2021, pp. 3927-393 6. [40] K. He, H . Fan, Y. Wu, S. Xie, a nd R. G irshick, "Mom en t um contra st for unsu per vised visual re prese ntation learning," in Proc ee dings of the IEEE/CVF conf ere nce on c omputer vision and pa ttern re cognition, 2020, pp. 9729-9738. [4 1] M. Masoudi, H.- R. Pourrez a, M. Saa datmand- Tar zjan, N. E ftekha ri, F. S . Zar gar, a nd M. P. R ad, " A new da taset of computed-tomography a ngiography images f or computer- aided de tection of pulmonar y embolism," Scientific da ta, vol. 5 , no. 1, pp. 1 -9, 2018. [4 2] G. Gonz á lez e t al., " Compute r aide d detec tion for pulmonary e mbolism challe nge (CAD-PE), " arX iv pre print arXiv:2003.13440, 2 020. [4 3] X. Zhua ng and J. Shen, "Multi-scale patch a nd multi - modality at lase s for whole he ar t segmentation of MRI," Me dical image a nalysis, vo l. 31, pp. 77-87, 2016. [4 4] J. Deng, W . Dong, R. Socher, L.-J. Li, K . Li, and L. Fei-Fei, "Imagene t: A large-sc ale hie rarchica l im age databa se," in 2009 IEEE conf er ence on computer vision and pa ttern re cognition, 2009: Ieee , pp. 248-255. [45] A. Kirillov, K. He , R. Gir shick, and P. Dollá r, " A unified a rchitec ture f or instance and sema ntic segmenta tion," in CVP R, 2017. [46] S. Zheng, G. Wa ng, Y. Yua n, and S. Huang, "Fine-gr ained image classifica tion based on TinyVi t object loca tion and gra ph convolut ion network, " Journal of Visua l C om munication and Ima ge Repre sentation, vol. 100, p. 1 04120, 202 4.

Using Unsupervised Domain Adaptation Semantic Segmentation for Pulmonary Embolism Detection in Computed Tomography Pulmonary Angiogram (CTPA) Images

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment