Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation

Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) dis…

Authors: Yuanfan Zheng, Kunyu Peng, Xu Zheng

Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation
Seeing Bey ond: Extrapolativ e Domain Adaptive P anoramic Segmentation Y uanfan Zheng 1 Kun yu Peng 2 , 3 Xu Zheng 4 Kailun Y ang 1 , * 1 Hunan Univ ersity 2 Karlsruhe Institute of T echnology 3 INSAIT , Sofia Uni versity “St. Kliment Ohridski” 4 HKUST(GZ) Abstract Cr oss-domain panoramic semantic se gmentation has at- tracted gr owing inter est as it enables comprehensive 360 ◦ scene understanding for real-world applications. However , it r emains particularly challenging due to sever e geomet- ric F ield of V iew (F oV) distortions and inconsistent open- set semantics acr oss domains. In this work, we formu- late an open-set domain adaptation setting, and pr opose E xtrapolative D omain A daptive P anoramic S egmentation ( EDA - PS e g) frame work that trains on local perspective views and tests on full 360 ◦ panoramic images, explicitly tackling both geometric F oV shifts acr oss domains and se- mantic uncertainty arising fr om pre viously unseen classes. T o this end, we pr opose the Euler-Mar gin Attention (EMA), which intr oduces an angular mar gin to enhance viewpoint- in variant semantic r epr esentation, while performing ampli- tude and phase modulation to impr ove gener alization to- war d unseen classes. Additionally , we design the Graph Matching Adapter (GMA), which builds high-order graph r elations to align shared semantics acr oss F oV shifts while effectively separ ating novel cate gories thr ough structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open- set scenarios demonstrate that EDA - PS eg achieves state- of-the-art performance, r obust generalization to diverse viewing geometries, and resilience under varying en vir on- mental conditions. The code is available at https : / / github.com/zyfone/EDA- PSeg . 1. Introduction Panoramic vision [ 1 , 31 ] provides an omnidirectional perspectiv e with a 360° Field of V ie w (FoV), enabling occlusion-aware and seamless scene perception [ 4 , 5 , 32 ]. Cross-domain Panoramic Segmentation (CPS) [ 37 , 60 , 69 ] addresses the domain shift induced by FoV differences be- tween con ventional pinhole and panoramic images. In the Unsupervised Domain Adaptation (UD A) setting [ 50 ], CPS models are trained on labeled pinhole images (source do- main) and adapted to unlabeled 360° panoramic images * Corresponding author (e-mail: kailun.yang@hnu.edu.cn ). Figure 1. Extrapolative Domain Adaptation (EDA) extends be- yond the training F oV and kno wn semantic categories, facilitating the transfer of knowledge from pinhole supervision to unlabeled 360° perception. It addresses both cross-view geometric distor- tions and the semantic extrapolation to unknown cate gories. (target domain). By alleviating annotation costs and over - coming the limited FoV of pinhole cameras, CPS facilitates robust wide-angle semantic scene understanding for appli- cations in autonomous dri ving [ 37 , 55 ] and robotics [ 2 , 51 ]. Most existing CPS methods [ 62 , 69 , 72 ] operate un- der a closed-set assumption, where all test instances be- long to the same classes observed during training. Under this setting, these methods generally adopt two comple- mentary strategies to handle domain shifts across camera views. The first strate gy focuses on FoV alignment [ 68 , 69 ], aiming to reduce distortions introduced by variations in camera FoV . Representative approaches address this issue through geometric projection adaptation [ 68 ] or sliding- window patch processing [ 62 , 69 , 72 ]. The second strategy focuses on semantic alignment between the source and tar- get vie ws [ 61 , 72 ], adapting the category prototype features to match the category distrib ution in closed-set scenarios. Current CPS methods perform well in controlled en vi- ronments but often struggle in open-world scenarios [ 9 , 10 ]. This limitation is particularly critical in autonomous driv- ing, where vehicles frequently encounter unseen objects outside their field of view , posing significant safety risks. While Open Set Domain Adaptation (OSD A) [ 13 , 15 ] offers potential solutions, existing approaches such as BUS [ 9 ] and UniMAP [ 10 ] rely on constructing cate gory proto- types for pixel-le vel mapping or prototype weight scaling. Panoramic images suffer from style inconsistencies and ge- ometric distortions [ 68 ], making them ill-suited for exist- ing pinhole-domain Open-Set UD A methods designed for pixel-le vel domain alignment. The challenge is further am- plified under di verse weather conditions [ 35 , 57 , 65 ], where adapting from one or multiple weather conditions to diverse adverse weather scenarios disrupts cross-domain alignment. In this paper , we formulate an open-set cross-domain panoramic semantic segmentation, which aims to enable models to generalize to unseen cate gories while adapt- ing to di verse FoV scenes under varying weather condi- tions. T o the best of our knowledge, this is the first study to address this challenging yet practical problem, which is crucial for achie ving comprehensi ve and reliable scene understanding in unconstrained real-world en vironments. T o this end, we propose E xtrapolative D omain A dapti ve P anoramic S egmentation ( EDA - PS eg) as shown in Fig. 1 . This no vel framew ork emphasizes extrapolating kno wledge learned from con ventional pinhole-view images to omnidi- rectional 360° panoramic scenes, thereby facilitating both cross-view generalization and open-set domain adaptation. Specifically , EDA - PS eg consists of two core compo- nents, the Graph Matching Adapter (GMA) and Euler- Margin Attention (EMA). GMA samples and synthesizes graph nodes to model high-order class relations, overcom- ing the limitations of pixel-lev el prototype representations and performing open-set graph matching with regulariza- tion to separate known and unkno wn classes. Euler-Margin Attention (EMA) employs an Euler formula-based trans- formation to project features into an angle-aware embed- ding space. It achie ves amplitude and phase modulation through learnable parameters, enabling adapti ve adjustment of the amplitude distribution and phase scaling accord- ing to the semantic angle for cross-view generalization. The frame work is ev aluated under di verse domain shifts, including camera and weather conditions. W e conduct extensi ve experiments on four benchmarks, cov ering fi ve datasets, which encompass Camera Shift and W eather Shift . Our method consistently outperforms existing approaches, achieving the improvement of 3.39% mIoU in Cityscapes → DenseP ASS benchmark. Furthermore, we systematically ev aluate and benchmark existing CPS methods alongside OSD A methods under the EDA - PS e g setting. These results not only verify the effecti veness of our approach in cross- view generalization but also demonstrate its superior gen- eralization across open-set and adverse weather conditions. Our main contributions are summarized as follo ws: • This paper formulates a practical cross-view setting for Open-Set UD A panoramic semantic segmentation and proposes Extrapolativ e Domain Adaptive P anoramic Segmentation (EDA-PSeg) , a frame work that simulta- neously mitigates FoV -induced geometric distortions and enables semantic extrapolation to unseen vie wpoints. • W e propose a Graph Matching Adapter (GMA) mod- ule that models high-order semantics for aligning known classes while effecti vely distinguishing nov el categories. • W e further propose the Euler-Mar gin Attention (EMA) module, which enhances semantic angle and improves cross-view generalization to FoV shifts and novel classes. • Extensi ve experiments on multiple benchmarks, includ- ing Cityscapes, A CDC, DenseP ASS, GT A, and SynP ASS, demonstrate that ours achie ves state-of-the-art perfor- mance under di verse domain shifts, establishing an ef fec- tiv e baseline for Open-Set UD A panoramic segmentation. 2. Related W ork Cross-Domain Panoramic Segmentation (CPS). Existing CPS methods [ 37 , 60 , 69 ] adopt UD A settings to bridge the domain gap between synthetic or real pinhole im- ages and 360° panoramic images. Most approaches focus on geometric field-of-view alignment and semantic proto- type alignment. For distortion mitigation, CF A [ 68 ] in- troduces distortion-aware attention to capture pixel distri- bution discrepancies, while DPP ASS [ 69 ] employs cross- domain and intra-projection training with tangent projec- tions. T rans4P ASS [ 61 ] further addresses geometric dis- tortions through deformable patch embeddings and MLP- based adaptation. In terms of semantic alignment, Om- niSAM [ 72 ] leverages SAM [ 26 ] for prototype alignment, and GoodSAM [ 62 ] incorporates SAM boundary priors within a kno wledge distillation frame work. Additional studies [ 63 , 69 , 72 ] enhance semantic consistency via mu- tual prototypical adaptation and pseudo-label refinement. The broader frame works [ 22 , 31 , 54 , 56 ] e xtend these ideas to foundation models and representation learning [ 23 , 24 , 61 , 70 ], with OPS [ 67 ] providing further complementary advances. In the conte xt of adaptation to the source-free domain, where source data is unav ailable, recent meth- ods [ 4 , 5 , 62 , 63 , 71 ] emphasize self-training strategies and integration with SAM [ 26 ]. In this paper , we propose ED A- PSeg, the first framework designed for Open-Set UD A CPS. Positional Encoding f or Self-Attention. Positional encod- ing [ 7 ] is a fundamental component of self-attention, al- lowing the model to capture semantic structures in both natural language processing [ 25 , 64 ] and computer vi- sion [ 11 , 17 ]. Depending on how positional information is represented, existing schemes can be broadly cate gorized as absolute [ 49 ] or relativ e [ 41 ]. Absolute positional encoders assign unique position-dependent representations using si- nusoidal or learned embeddings, while relati ve encoders model only positional of fsets between query and ke y to- kens to achie ve translation in variance. CF A [ 68 ], for exam- ple, introduces a distortion-aware attention mechanism for the CPS task that leverages absolute positional encoding. Meanwhile, recent large language models, such as LLaMA, Figure 2. Illustration of the proposed EDA - PS eg, which tackles cross-vie w and semantic extrapolation. The framework incorporates two key components: the Graph Matching Adapter (GMA), which constructs a high-order graph to align domain shared class graph nodes, and the Euler-Mar gin Attention (EMA), which models features with the Euler formula to enhance angle in variance under unseen vie wpoints. adopt Rotary Positional Embeddings (RoPE) [ 46 ], which integrate both absolute and relati ve positional information. Building on this idea, EulerFormer [ 47 ] unifies semantic and positional representations in Euler space. In this work, we propose Euler-Margin Attention, a module for seman- tic e xtrapolation that robustly generalizes across cross-view geometric distortions via angle-margin projection and am- plitude and phase modulation in complex v ector space. Cross-domain Graph Matching. Graph matching es- tablishes topological correspondences among features by jointly optimizing node and edge af finities, thus enabling structured reasoning ov er relational dependencies. Recent advances in graph-based methods have substantially im- prov ed segmentation [ 8 , 33 ] through enhanced semantic alignment and boundary-aware feature interactions. More- ov er , the cross-domain graph matching method has shown remarkable effecti veness across di verse tasks, including cross-domain named entity recognition [ 66 ], medical im- age analysis [ 16 , 36 ], and object detection [ 6 , 28 , 29 , 34 ]. Despite significant progress under the closed-set setting, existing cross-domain graph matching methods remain in their infanc y in the open-set scenario. In this paper , we effecti vely remodel the closed-set graph matching to sup- port open-set graph matching. This enables handling of unknown-class nodes, bridging the gap for graph matching. 3. Method 3.1. Overview The architecture of the proposed EDA - PS eg is illustrated in Fig. 2 , highlighting its main components and processing pipeline. W e consider a labeled source domain with pinhole images D s = { ( x i s , y i s ) } B i =1 , and an unlabeled target do- main with panoramic images D t = { x i t } B i =1 . Follo wing the architectural overvie w , we define the label spaces for both domains. The source label space Y s contains the kno wn classes, while the target label space is defined as Y t = Y s ∪ Y u , where Y u denotes additional unknown classes. Random cropping is applied to source and mixed samples for label and pseudo-label supervision, while source-only and target-only crops are used for graph sampling. The cropped images are then fed into an encoder-decoder net- work, followed by the proposed Euler-Mar gin Attention (EMA). EMA constructs an Euler–Margin projection to em- bed features into a bounded angular space, mitig ating cross- view geometric distortions while preserving the panoramic semantic structure. Moreover , EMA performs amplitude and phase modulation to enhance feature separability be- tween known and unknown classes. The features are passed to the Graph Matching Adapter (GMA), which performs graph matching to geometrically decouple known classes while pushing unknown classes apart, thereby facilitating semantic extrapolation to unseen cate gories. 3.2. Graph Matching Adapter (GMA) Giv en source D s and tar get D t domains, the encoder- decoder network extracts node features V s,t ∈ R B × N × d , where N = H × W and d is the feature channel dimension. The s, t subscripts are omitted unless explicitly specified. Node Sampling. W e perform local node sampling guided by confidence, entropy , and prototype distance to select nodes representing local semantics, and subsequently ag- gregate them into class-wise global prototypes to form ro- bust global semantic representations. Giv en the node fea- tures V from the encoder and decoder (flattened), we first apply the classifier head to obtain the predictions per-pix el. For each pixel, we compute a confidence score p i and an uncertainty measure H ( p i ) . For any base class index b , we define the positi ve set S ( b ) pos = { V i | p i > τ p ∧ H ( p i ) < τ e } and the negati ve set S ( b ) neg = { V i | p i ≤ τ p ∧ H ( p i ) < τ e } , where τ p and τ e denote confidence and entropy thresholds, respectiv ely , determined using percentile strategies with 0.5. W e aggregate prototypes as G ( b ) = ω ( b ) + σ , where σ is Gaussian noise, and ω ( b ) is the mean of the same label feature. T o refine node selection, we retain the K nearest nodes to the prototype, defined as ˆ S ( b ) pos , neg = { V i | d i ≤ d ( K ) , d i = ∥ V i − ω ( b ) ∥} , where d i denotes the distance of the i -th node to the current prototype mean ω ( b ) . F or the nov el class index n , nodes are sampled based on simi- lar confidence but distinct entropy values, specifically those satisfying H ( p i ) < τ m for the positi ve and H ( p i ) > τ m for negati ve samples, where τ m is the median entrop y of the distribution H ( p ) . This procedure yields the sets ˆ S ( n ) pos , neg and their corresponding prototypes G ( n ) . Finally , by con- catenating the base and nov el class nodes along with their corresponding prototypes, we obtain the set S ε as follows: S ε = n ˆ S ( b ) pos , neg , G ( b ) , ˆ S ( n ) pos , neg , G ( n ) o , (1) where S ( b ) , ( n ) pos , neg represents pixel-lev el nodes capturing class div ersity , and G ( b ) , ( n ) denotes the node means represent- ing the global class distribution in the batch data. W e per- form both local and global class sampling to facilitate graph matching for dense prediction, while avoiding excessiv e nodes that could lead to computational redundancy . W e then update the global memory bank M using the candi- date node samples S ε via exponential moving a verage with parameter α , where this process maintains a stable represen- tation by weighting the new mean sample v ector V i and the existing memory , as M ← (1 − α ) M + α 1 | S c | P i ∈ S c V i , applied independently for each class-specific set S c . Graph Generation. The Graph Generation module first identifies shared categories between the source and tar- get domains, fills in missing categories using memory banks M , and then b uilds node and graph af finities via self-attention [ 49 ]. Giv en a candidate node set S ε , missing-class nodes are completed using statistics from both source and tar get domains. A global memory bank M serv es as a proxy , and supplementary nodes are gen- erated as V c = M c + N ( µ, σ ) , where N ( µ, σ ) is a Gaussian with mean µ from the current-domain mem- ory and standard deviation σ from the counterpart do- main. The generated nodes V c are appended to S ε , which is then updated via multi-head self-attention as S ε ← Softmax ( S ε W q ( S ε W k ) ⊤ q 1 d k )( S ε W v ) + S ε to integrate global dependencies within and across domains. The re- sulting node set S ε , together with edge affinity ξ = F drop Softmax [ S ε W q ( S ε W k ) ⊤ ] , form the semantic entities and topological structure of the graph, respectiv ely . Graph Matching and Regularization. This section com- prises three loss components. The open-set graph match- ing loss aligns classes shared across domains, the graph edge affinity loss preserves structural consistency , and the unknown regularization loss encourages separation of un- known classes. From the candidate node set S ε , we extract the source and target nodes, V s and V t , respectiv ely . Based on these nodes, we compute a node matching matrix A = Sinkhorn(InstNorm( ϕ ( V s , V t ))) , where ϕ ( · , · ) is a learn- able affinity function, InstNorm( · ) normalizes the af fini- ties, and Sinkhorn enforces approximate doubly stochastic- ity . Then, we construct the matching labels based on the source domain labels and the target domain pseudo labels. Specifically , we define the open-set matching label while ig- noring the unknown class as M = ( H s H ⊤ t ) ⊙ (( H s e unk = 0)( H t e unk = 0) ⊤ ) , where H s and H t denote the one-hot matrices of the source and target domain node set, e unk is the unit vector corresponding to the unknown class, and ⊙ represents element-wise multiplication. The graph match- ing loss for the node set S ε is defined as follows: ℓ graph = 1 | S ε | X ( i,j ) ∈ S ε ( A ij − M ) 2 | {z } Graph Matching Loss + 1 |A|   ξ s A − A ξ t   1 | {z } Graph Edge Loss + β |K t | |U t |   ˜ V K t t ( ˜ V U t t ) ⊤   2 F + β |K s | |U s |   ˜ V K s s ( ˜ V U s s ) ⊤   2 F | {z } Unknown-aw are Regularization Loss , (2) where K s,t and U s,t denote the known and unknown node sets, ξ represents the edge affinities, ˜ V denotes the unit- normalized node, β is the corresponding weight, and ∥ · ∥ 1 and ∥ · ∥ F denote the ℓ 1 and Frobenius norms, respecti vely . 3.3. Euler -Margin Attention (EMA) W e propose Euler -Margin Attention (EMA) for semantic extrapolation between base and novel classes under unseen viewpoints. EMA projects channel features into a complex Figure 3. (a) EulerFormer [ 47 ] fails to b uild a semantic angle con- straint, limiting generalization to unseen views. (b) Our method employs Euler-Margin Projection to constrain the angle within the interval, and utilizes Amplitude and Phase Modulation to adjust the class distribution for kno wn and unknown class separation. vector space through an angular-mar gin transformation to alleviate field-of-vie w distortions and jointly modulates am- plitude and phase distributions to enhance class separability . Standard Euler Formula. W e first revisit the Euler for- mula and define the relev ant terms for feature representa- tion following existing works [ 46 , 47 ]. Let a feature tensor be V ∈ R B × N × d , represented in rectangular form as r + i s , where r = V ::2 denotes the real components corresponding to e ven channels and s = V 1::2 denotes the imaginary com- ponents corresponding to odd channels. The transformation based on Euler’ s formula is then applied as follows: r + i s = Λ e iθ , Λ = p r 2 + s 2 , θ = atan2( s , r ) , (3) where Λ denotes the magnitude, θ is the phase angle, and atan2( · , · ) is the arctangent function. As sho wn in Fig. 3 (a), Existing methods struggle to generalize across views and in open-world scenarios because the angular distribution of the same class shifts across views, and the angles of kno wn and unknown classes ov erlap. T o address these issues, we propose the Euler-Mar gin Attention (EMA) module, as il- lustrated in Fig. 3 (b). EMA constrains the angle within a limited range, ensuring intra-class compactness to reduce geometric FoV distortions. Moreov er , it introduces learn- able parameters to adjust the amplitude (class importance) and phase (class direction) distributions, thereby improving open-set generalization in unseen viewpoints. Euler -Margin Projection. Giv en an input feature tensor V ∈ R B × N × d , we first apply a channel-wise reorder- ing operator π ( · ) , which sorts channels by values in de- scending order as follows π ( V ) = [ V 1 , V 2 , . . . , V d ] , s.t. V i +1 < V i . The function π ( · ) represents a mapping through the soft permutation matrix, which ensures the backpropagation of the gradient. The reordered channels are then partitioned into real and imaginary components as follows Re( V ) := π ( V ) ::2 and Im( V ) := π ( V ) 1::2 . The Euler -Margin Projection is defined as follo ws F ( V ) = Figure 4. Feature distributions of the car category visualized via t-SNE across multiple datasets, illustrating the effects of Cam- era T ypes (Pinhole, Panoramic, Real, and Synthetic) and varying W eather Conditions (Fog, Rain, Cloudy , Sunn y , Night, and Snow). F ([ π ( V ) ::2 , π ( V ) 1::2 ]) = Λ · e iθ , where Λ and θ de- note the amplitude and phase of the complex feature. The channel reorder constrains the angle θ within the complex vector space, thus mitigating the cross-viewpoint discrep- ancy . Subsequently , we introduce learnable weights to per- form amplitude and phase modulation, thereby adjusting the known class distrib ution and improving class separation. Amplitude and Phase Modulation. W e transform the fea- tures V into a complex vector space, resulting in query V q and key embeddings V k . The resulting self-attention dot product is then expressed via the Euler formula as follo ws: V ⊤ q V k = (Λ q ⊙ Λ k ) ⊤ Re  exp  i ( θ q − θ k )  , (4) where Λ and θ represent the amplitude and phase. Then we introduce learnable scaling and bias factors to modulate both the amplitude and phase components. The resulting modulated attention score, denoted as E Euler , is as follows: E Euler =  e 2 δ 1 (Λ q ⊙ Λ k )  ⊤ | {z } Amplitude Modulation Re    exp    i · [ δ 2 · ( θ q − θ k ) + b ] | {z } Phase Modulation       , (5) where δ 1 is a learnable exponential weight used to scale the amplitude as exp(2 δ 1 ) , and δ 2 is a learnable scale for the semantic angle ( θ q − θ k ) , and b is the bias for the phase. The amplitude characterizes the importance of a feature, while the phase determines its semantic direction. The learnable factors in Eq. ( 5 ) are optimized for open-set generalization. 3.4. Model Optimization The training objectiv e of our method is defined as follows: L total = ℓ seg + ℓ mixup + γ · ℓ graph , (6) where ℓ seg is the supervised loss on the source domain, ℓ mixup is the pseudo-label loss for mixup training between source and target domains, and ℓ graph denotes the loss of the proposed GMA module, with γ serving as the weight. T able 1. Comparisons under Camera Shift { Pin2P an, Real2Real } for Cityscapes → DenseP ASS ( C2D ) open-set domain adaptation in outdoor scenarios. * indicates an experiment based on D AFormer [ 18 ]. Method Road S.walk Build. W all Fence T r .light V eget. T errain Sky Car Bus M.cycle Bicycle Common Private H-Score OSBP* [ 44 ] 39.29 24.94 84.90 34.11 40.23 27.70 78.34 16.00 26.52 53.05 21.89 51.76 34.34 41.00 7.01 11.97 U AN* [ 58 ] 46.78 26.59 82.61 36.57 43.02 18.97 79.09 21.09 43.51 65.06 13.30 48.97 20.31 41.99 0.93 1.81 UniO T*[ 21 ] 63.59 31.37 86.79 42.35 38.04 18.90 79.82 37.86 81.45 68.89 37.91 59.10 31.67 52.13 0.00 0.00 D AFormer [ 18 ] 65.37 34.95 85.75 35.89 37.61 30.46 79.16 36.36 84.46 69.77 31.61 62.26 37.07 53.13 0.00 0.00 MIC [ 20 ] 51.04 15.66 84.28 36.06 35.34 30.55 76.44 19.56 73.52 72.81 28.31 55.48 29.70 46.83 0.00 0.00 HRD A [ 19 ] 68.84 34.89 85.71 38.39 39.64 28.12 79.58 38.74 90.78 68.23 29.37 59.62 32.53 53.42 0.00 0.00 BUS (SAM) [ 9 ] 60.63 28.46 82.04 29.68 32.47 26.11 77.73 24.96 94.46 68.02 39.87 43.37 35.38 49.47 3.10 5.84 Ours (SAM) 74.86 34.20 85.90 40.40 43.52 31.04 79.87 41.27 94.84 78.85 32.83 63.92 37.06 56.81 18.86 28.32 T able 2. Comparisons under Camera Shift { Syn2Real, P an2P an } for SynP ASS → DenseP ASS ( S2D ) open-set domain adaptation in outdoor scenarios. * indicates an experiment based on D AFormer [ 18 ]. Method Road S.walk Build. W all Fence Pole T r.light Tr .sign V eget. T errain Sky Person Car Common Private H-Score OSBP* [ 44 ] 34.68 0.20 23.01 3.17 2.37 0.74 0.00 0.27 25.58 4.69 56.97 0.17 3.91 11.98 5.77 7.79 U AN* [ 58 ] 51.01 2.22 33.48 1.52 7.33 7.73 0.00 0.59 29.75 12.07 58.77 10.0 6.72 17.01 0.83 1.59 UniO T*[ 21 ] 47.14 2.73 32.15 1.64 5.12 7.93 0.00 0.20 27.71 1.01 50.80 21.97 2.24 15.44 0.00 0.00 D AFormer [ 18 ] 52.20 4.44 34.42 3.24 10.16 5.80 0.00 0.12 56.07 6.33 63.67 37.26 2.31 21.23 0.00 0.00 MIC [ 20 ] 48.23 3.6 46.16 0.04 10.52 4.77 0.00 0.20 25.68 3.52 65.18 3.27 2.83 16.46 0.00 0.00 HRD A [ 19 ] 53.18 7.29 36.77 3.04 9.35 5.09 0.00 0.09 46.23 5.41 61.59 17.58 3.09 19.13 0.00 0.00 BUS (SAM) [ 9 ] 42.27 4.27 12.06 0.05 8.40 10.07 0.00 0.00 0.79 1.68 80.69 0.58 0.22 12.39 1.73 3.04 Ours (SAM) 70.19 35.28 58.34 8.02 0.68 20.7 0.00 0.57 75.39 30.67 92.27 45.86 17.95 35.07 7.48 12.33 4. Experiments 4.1. Benchmark Setup Datasets. W e conduct comprehensive e xperiments on mul- tiple datasets to ev aluate the proposed method under both geometric and semantic domain shifts. DenseP ASS [ 37 ] is a real-world panoramic dataset co vering div erse city scenes collected with panoramic sensors. It provides 2 , 000 unlabeled images for transfer optimization and 100 la- beled images for ev aluation. SynP ASS [ 61 ] is a synthetic panoramic dataset rendered under v arious weather condi- tions, including Cloudy , F oggy , Rainy , Sunny , and Night , containing 9 , 080 panoramic images in total. For pinhole- domain studies, we adopt three widely used benchmarks. Cityscapes [ 12 ] is a real-world dataset with 2 , 975 train- ing and 500 v alidation images annotated with 19 seman- tic categories. GT A5 [ 42 ] is a large-scale synthetic dataset consisting of 24 , 966 images generated from the GT A-V game engine, providing pix el-lev el labels consistent with Cityscapes. A CDC [ 45 ] is a real-w orld data set that empha- sizes adv erse weather ( F oggy , Night , Rainy , and Snowy ), including 1 , 600 training and 406 v alidation images. Benchmark Settings. Our domain adaptation ev aluation encompasses two complementary scenarios. Camera Shift addresses transfer between pinhole and panoramic images (Pin ↔ Pan) and between synthetic and real domains (Syn → Real). W eather Shift in vestigates adaptation across six di verse conditions: Fog, Rain, Cloudy, Sunny, Night, and Snow. Furthermore, we assess each setting in an open- set configuration where the common class ratio controls the semantic o verlap ( e.g . , C2D : 68 . 4% , S2D : 48 . 1% , G2S : 48 . 1% , S2A : 48 . 1% ). As shown in Fig. 4 , we ana- lyze the domain gap among the datasets. Real datasets like Cityscapes [ 12 ] and DenseP ASS [ 37 ] exhibit higher v ari- ance than synthetic datasets (GT A [ 42 ], SynP ASS [ 61 ]), while the adverse weather dataset A CDC [ 45 ] has weaker class representation compared to the real datasets. The sup- plementary material provides complete details of the bench- mark protocols, dataset splits, and category settings. Evaluation Metrics. W e adopt the mean Intersection-o ver - Union (mIoU) to e valuate segmentation quality for both base and novel classes for Common and Priv ate. In addi- tion, we adopt the H-Score between the mIoU of the base classes and the IoU of no vel classes, follo wing the common open-set e valuation practice. This metric jointly reflects the performance on both shared and nov el semantic spaces. Implementation Details. Our implementation follows the standard configuration of D AFormer [ 18 ]. W e employ MiT - B5 [ 52 ] as the backbone, initialized with ImageNet-1K pre- trained weights. All models are trained for 40 k iterations using 512 × 512 random crops on both the source (pinhole) and tar get (panoramic) domains. During testing, panoramic images are e valuated at their original resolution. T o improve pseudo-label quality in the target domain, we utilize Mo- bileSAM [ 59 ] for mask refinement, following the same re- finement strategy as in the prior Open-Set UD A method [ 9 ]. T able 3. Comparisons under dif ferent Camera Shift (Pin2P an,Syn2Syn ) and W eather Shift { Sunny }→ { Sunny , Cloudy , F og, Rain, Night } for GT A → SynP ASS ( G2S ) open-set domain adaptation in outdoor scenarios. * indicates an experiment based on D AFormer [ 18 ]. Method Road S.walk Build. W all Fence Pole Tr .light T r .sign V eget. T errain Sky Person Car Common Private H-Score OSBP* [ 44 ] 75.30 35.06 40.29 0.00 11.27 15.14 8.60 21.40 43.69 47.07 57.05 32.83 43.37 33.16 0.00 0.00 U AN* [ 58 ] 67.23 26.45 31.54 0.06 6.25 19.58 20.71 18.37 44.69 39.72 27.94 22.41 25.22 26.94 0.73 1.42 UniO T*[ 21 ] 88.10 57.96 56.18 0.07 14.98 19.95 18.89 14.41 42.46 47.94 87.44 16.95 61.36 40.52 0.00 0.00 D AFormer [ 18 ] 95.69 51.33 58.01 0.02 9.67 22.30 8.77 14.88 53.18 43.83 88.08 32.80 64.98 41.81 0.00 0.00 MIC [ 20 ] 97.38 59.72 53.17 0.52 11.56 20.57 24.20 19.46 56.61 44.62 84.15 28.69 64.90 43.51 0.00 0.00 HRD A [ 19 ] 91.13 54.68 49.92 0.24 10.33 20.03 20.70 13.25 41.23 37.74 80.60 29.12 67.33 39.72 0.00 0.00 BUS (SAM) [ 9 ] 95.72 0.30 74.25 0.98 0.14 20.37 21.78 9.65 54.91 44.92 96.27 28.08 72.51 39.99 7.97 13.29 Ours (SAM) 94.50 50.42 73.20 0.98 20.97 15.86 12.07 17.11 54.33 46.84 96.14 27.75 74.32 44.96 10.20 16.63 T able 4. Comparisons under Camera Shift (Syn2Real, P an2Pin ) and W eather Shift { F og , Rain, Night, Sunny , Cloudy }→ { F og , Rain, Night, Sunny , Snow } for SynP ASS → A CDC ( S2A ) open-set domain adaptation in outdoor scenarios. * indicates DAF ormer-based results [ 18 ]. Method Road S.walk Build. W all Fence Pole Tr .light T r .sign V eget. T errain Sky Person Car Common Private H-Score OSBP* [ 44 ] 52.79 0.89 6.72 0.80 0.41 0.04 0.00 0.00 32.79 0.00 73.7 0.21 0.14 12.96 5.02 7.24 U AN* [ 58 ] 67.50 7.27 26.60 0.97 1.26 14.20 0.00 0.00 50.96 13.10 67.25 0.23 0.34 19.21 0.13 0.26 UniO T*[ 21 ] 57.98 2.00 20.36 0.03 2.09 4.51 0.00 0.00 59.15 0.51 87.66 8.53 4.52 19.03 0.00 0.00 MIC [ 20 ] 62.32 10.60 60.77 0.00 1.17 3.18 0.00 0.00 60.02 4.71 80.46 26.67 11.82 24.75 0.00 0.00 D AFormer [ 18 ] 66.49 6.90 30.34 0.07 3.12 1.96 0.00 0.00 75.47 19.46 93.97 33.34 0.01 25.47 0.00 0.00 HRD A [ 19 ] 66.40 8.37 31.28 2.05 1.66 5.77 0.00 0.00 57.23 19.00 75.79 37.27 14.81 24.59 0.00 0.00 BUS (SAM) [ 9 ] 71.86 10.95 16.77 0.00 0.03 0.03 0.00 0.00 17.25 9.54 82.61 0.00 0.00 16.08 3.70 6.01 Ours (SAM) 52.53 9.24 54.58 10.4 0.28 24.39 0.00 0.00 52.65 18.95 83.16 32.05 54.04 30.17 9.18 14.08 4.2. Perf ormance Analysis across Benchmarks C2D. As shown in T ab . 1 , we report the performance re- sults for the Cityscapes → DenseP ASS under the Camera Shift setting. The results indicate that our proposed method achiev es higher mIoU on both common and pri vate cate- gories, suggesting that the proposed method can distinguish between known and unknown categories. Additionally , the observed improvements in the H-score imply an enhanced capability to identify priv ate categories. These results also suggest that, when transferring from pinhole to panoramic images, the GMA module better captures inter-cate gory re- lationships while remaining rob ust to geometric distortions. S2D. As shown in T ab . 2 , the open-set performance on SynP ASS → DenseP ASS under Camer a Shift reflects the challenging nature of synthetic-to-real transfer . Our method consistently outperforms existing approaches on important object categories such as Person and Car, clearly demon- strating that variations and distortions inherent in different panoramic datasets can significantly degrade the quality of object class representations. The proposed GMA and EMA modules are specifically designed to address these distor- tions and are shown to mitig ate their negati ve impact. Com- pared to the existing SAM-based method (BUS), our pro- posed method demonstrates significant advantages in both mIoU results of both common and priv ate categories. G2S. As shown in T ab . 3 , the open-set adaptation results for GT A → SynP ASS under Camera Shift and W eather Shift highlight the difficulty of transferring from synthetic pin- hole to panoramic real-w orld domains. Our method outper- forms the second-best approach by 3.15% in mIoU, demon- strating that the proposed EMA and GMA modules mitigate image distortions within the same simulation scenario and adverse weather conditions. Our method performs compa- rably to existing open-set methods for the W all class. S2A. As shown in T ab . 4 , the open-set adaptation results for SynP ASS → ACDC under the combined Camera Shift and W eather Shift settings demonstrate model performance in div erse conditions, encompassing both adverse and clear scenes. Despite variations in weather and viewing angles, our method consistently and significantly outperforms ex- isting approaches, particularly in recognizing unkno wn cat- egories. Specifically , it achiev es a 4.16% mIoU improv e- ment on priv ate and a 6.84% H-Score increase over the OSBP method, indicating that the proposed module gen- eralizes robustly under di verse weather conditions and FoV distortions. Moreover , in the Pole category , our method de- liv ers markedly superior results compared to existing open- set and closed-set methods ov erall and consistently . 4.3. Ablation Study Module Ablation. As shown in T ab . 5 , we conduct an ab- lation study to evaluate the contributions of the GMA and EMA modules in our framework. Exp. ① , the baseline with- out either module, achiev es the lowest performance, indi- cating the necessity of both components. Introducing GMA alone (Exp. ② ) improves both common and priv ate feature Figure 5. T -SNE visualization of source and target domains in ED A-PS eg. (a) The prototype exhibits a relatively mixed un- known class. (b) Ours is the relatively separable unkno wn class. representations, demonstrating its effecti veness in aligning shared semantics under FoV shifts. Similarly , incorporat- ing EMA alone (Exp. ③ ) enhances viewpoint-in variant rep- resentation, boosting ov erall segmentation performance. Ablation analysis of GMA. As shown in T ab . 6 , we ab- late the GMA loss components of unkno wn-class learning, graph matching, and affinity , defined in Eq. ( 2 ). Remov- ing the graph matching term notably degrades performance, confirming its importance, while the unknown-class compo- nent further improv es the priv ate class mIoU and H-Score. Sensitivity of GMA. As shown in T ab . 7 , we analyze the sensitivity of the loss weight γ . The results indicate that γ = 1 . 0 fav ors the priv ate class mIoU and improv es the H- score but negati vely impacts the performance on common classes mIoU, whereas γ = 0 . 05 has the opposite ef fect. T o achiev e a balanced trade-off, we set γ = 0 . 1 in Eq. ( 6 ). Effectiveness of EMA. As sho wn in T ab . 8 , we ev aluate the effecti veness of the EMA module against the attention in panoramic segmentation [ 61 ] and Eulerformer [ 47 ]. Our method consistently outperforms these baselines in both common and priv ate class mIoU, with a particularly large improv ement in H-Score, demonstrating its superior ability in nov el class discovery and closed-set class classification. Domain Alignment Analysis. As illustrated in Fig. 5 , we visualize the prototypes and GMA features using T - SNE [ 39 ]. For each class, features are randomly sampled, where circles represent the source domain and triangles denote the target domain. The accompanying histograms clearly sho w the sample frequency across bin interv als, with distinct colors corresponding to different categories. Com- pared with prototype class alignment, GMA consistently exhibits a more distinct separation of unkno wn classes. 5. Conclusion In this work, we inv estigate a nov el and challenging task setting and propose the E xtrapolati ve D omain A daptiv e P anoramic S egmentation ( EDA - PS e g). EDA - PS eg aims T able 5. Ablation study of the components in our frame work. Cityscapes → DenseP ASS ( C2D ) Exp. GMA EMA Common Private H-Score ① - - 52.56 8.57 14.74 ② ✓ 55.15 14.67 23.18 ③ ✓ 56.12 13.00 21.11 ④ ✓ ✓ 56.81 18.86 28.32 T able 6. Ablation analysis of the GMA module. Cityscapes → DenseP ASS ( C2D ) Exp. Method Common Private H-Scor e ① GMA w/o Unkno wn 54.72 7.78 13.62 ② GMA w/o Matching 52.75 4.76 8.73 ③ GMA w/o Af finity 54.06 11.59 19.09 ④ GMA (Full) 55.15 14.67 23.18 T able 7. Sensitivity of the GMA module loss to the factor γ . Cityscapes → DenseP ASS ( C2D ) Exp. W eight γ Common Private H-Score ① - 52.56 8.57 14.74 ② 1.00 47.16 15.45 23.28 ③ 0.10 55.15 14.67 23.18 ④ 0.05 54.57 7.90 13.80 ⑤ 0.01 51.72 4.59 8.43 T able 8. Effectiveness analysis of the EMA module, added at the same position as in the baseline with the same layers. Cityscapes → DenseP ASS ( C2D ) Exp. Method Common Priv ate H-Scor e ① Baseline 52.56 8.57 14.74 ② Self-Attention [ 49 ] 55.45 10.95 18.28 ③ Eulerformer [ 47 ] 55.09 7.20 12.74 ④ Deformable MLP [ 61 ] 55.89 7.68 13.51 ⑤ Euler-Margin (Ours) 56.12 13.0 21.11 to enable local-crop transfer learning from limited pin- hole viewpoints to full 360 ◦ panoramic scenes while si- multaneously discovering nov el classes. T o address the challenges of unseen-viewpoint open-set class learning and field-of-view distribution discrepancies, we introduce two key components, Euler-Margin Attention (EMA) and the Graph Matching Adapter (GMA). Extensi ve experiments on the Camera Shift and W eather Shift benchmarks demon- strate that our approach achieves state-of-the-art perfor- mance across div erse resolutions and viewing geometries. Limitations. Random cropping introduces sampling sen- sitivity , occasionally leading to slight training instability . Graph matching increases both model parameters and com- putational cost, while the Euler -Margin Attention mecha- nism adds further architectural complexity . Future work will explore strate gies to improve ef ficiency and stability . Acknowledgment This work was supported in part by the National Natural Science Foundation of China (Grant No. 62473139), in part by the Hunan Provincial Research and Dev elopment Project (Grant No. 2025QK3019), in part by the State Ke y Laboratory of Autonomous Intelligent Unmanned Systems (the opening project number ZZKF2025-2-10), and in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - SFB 1574 - 471687386. This research was partially funded by the Ministry of Education and Science of Bulgaria (support for INSAIT , part of the Bulgarian National Roadmap for Research Infrastructure). References [1] Hao Ai, Zidong Cao, and Lin W ang. A survey of represen- tation learning, optimization strate gies, and applications for omnidirectional vision. International Journal of Computer V ision , 2025. 1 [2] Alberto Bacchin, Leonardo Barcellona, Sepideh Sham- sizadeh, Emilio Oliv astri, Alberto Pretto, and Emanuele Menegatti. PanNote: an automatic tool for panoramic im- age annotation of people’ s positions. In ICRA , 2024. 1 [3] Holger Caesar , Jasper Uijlings, and V ittorio Ferrari. COCO- Stuff: Thing and stuff classes in context. In CVPR , 2018. 17 [4] Y ihong Cao, Jiaming Zhang, Hao Shi, K unyu Peng, Y uhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, and Kailun Y ang. Occlusion-aware seamless segmentation. In ECCV , 2024. 1 , 2 [5] Y ihong Cao, Jiaming Zhang, Xu Zheng, Hao Shi, K unyu Peng, Hang Liu, Kailun Y ang, and Hui Zhang. Unlocking constraints: Source-free occlusion-aware seamless segmen- tation. In ICCV , 2025. 1 , 2 [6] Chaoqi Chen, Jiongcheng Li, Hong-Y u Zhou, Xiaoguang Han, Y ue Huang, Xinghao Ding, and Y izhou Y u. Relation matters: Fore ground-aware graph-based relational reasoning for domain adaptive object detection. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2023. 3 [7] Pu-Chin Chen, Henry Tsai, Srinadh Bhojanapalli, Hyung W on Chung, Y in-W en Chang, and Chun-Sung Ferng. A simple and effectiv e positional encoding for transformers. In EMNLP , 2021. 2 [8] W enting Chen, Jie Liu, T ianming Liu, and Y ixuan Y uan. Bi- VLGM: Bi-le vel class-se verity-aware vision-language graph matching for text guided medical image segmentation. Inter- national Journal of Computer V ision , 2025. 3 [9] Seun-An Choe, Ah-Hyung Shin, Keon-Hee P ark, Jinwoo Choi, and Gyeong-Moon P ark. Open-set domain adaptation for semantic segmentation. In CVPR , 2024. 1 , 6 , 7 , 15 , 16 , 17 [10] Seun-An Choe, Keon-Hee Park, Jinwoo Choi, and Gyeong- Moon Park. Universal domain adaptation for semantic seg- mentation. In CVPR , 2025. 1 [11] Jean-Baptiste Cordonnier , Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and con- volutional layers. arXiv pr eprint arXiv:1911.03584 , 2019. 2 [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, T imo Rehfeld, Markus Enzweiler , Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR , 2016. 6 , 17 [13] Zhen Fang, Jie Lu, Feng Liu, Junyu Xuan, and Guangquan Zhang. Open set domain adaptation: Theoretical bound and algorithm. IEEE T ransactions on Neural Networks and Learning Systems , 2021. 1 , 14 , 17 [14] Y aroslav Ganin and V ictor Lempitsky . Unsupervised domain adaptation by backpropagation. In ICML , 2015. 14 [15] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. Re- cent adv ances in open set recognition: A survey . IEEE T rans- actions on P attern Analysis and Machine Intelligence , 2021. 1 [16] Zhibin He, W uyang Li, T ianming Liu, Xiang Li, Junwei Han, T uo Zhang, and Y ixuan Y uan. GAGM: Geometry- aware graph matching frame work for weakly supervised gy- ral hinge correspondence. Medical Image Analysis , 2025. 3 [17] Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Y un. Rotary position embedding for vision transformer . In ECCV , 2024. 2 [18] Lukas Hoyer, Dengxin Dai, and Luc V an Gool. D AFormer: Improving network architectures and training strategies for domain-adaptiv e semantic segmentation. In CVPR , 2022. 6 , 7 , 13 , 15 , 16 , 17 [19] Lukas Hoyer , Dengxin Dai, and Luc V an Gool. HRDA: Context-a ware high-resolution domain-adaptiv e semantic segmentation. In ECCV , 2022. 6 , 7 , 15 , 16 [20] Lukas Hoyer , Dengxin Dai, Haoran W ang, and Luc V an Gool. MIC: Masked image consistency for context- enhanced domain adaptation. In CVPR , 2023. 6 , 7 , 15 , 16 [21] JoonHo Jang, Byeonghu Na, Dong Hyeok Shin, Mingi Ji, Kyungw oo Song, and Il-Chul Moon. Unknown-aw are do- main adv ersarial learning for open-set domain adaptation. In NeurIPS , 2022. 6 , 7 , 15 , 16 , 17 [22] Alexander Jaus, Kailun Y ang, and Rainer Stiefelhagen. Panoramic panoptic segmentation: Insights into surround- ing parsing for mobile agents via unsupervised contrastive learning. IEEE T ransactions on Intelligent T ransportation Systems , 2023. 2 [23] Jing Jiang, Sicheng Zhao, Jiankun Zhu, W enbo T ang, Zhaopan Xu, Jidong Y ang, Guoping Liu, T engfei Xing, Pengfei Xu, and Hongxun Y ao. Multi-source domain adapta- tion for panoramic semantic se gmentation. Information Fu- sion , 2025. 2 [24] Jing Jiang, Jiankun Zhu, Zhaopan Xu, Xi Chen, Sicheng Zhao, and Hongxun Y ao. Gaussian constrained diffeomor- phic deformation network for panoramic semantic segmen- tation. In ICASSP , 2025. 2 [25] Guolin Ke, Di He, and Tie-Y an Liu. Rethinking posi- tional encoding in language pre-training. arXiv preprint arXiv:2006.15595 , 2020. 2 [26] Alexander Kirillov , Eric Mintun, Nikhila Ravi, Hanzi Mao, Chlo ´ e Rolland, Laura Gustafson, T ete Xiao, Spencer White- head, Alexander C. Berg, W an-Y en Lo, Piotr Doll ´ ar , and Ross B. Girshick. Segment anything. In ICCV , 2023. 2 , 13 [27] Jingjing Li, Zhiqi Y u, Zhekai Du, Lei Zhu, and Heng T ao Shen. A comprehensi ve surv ey on source-free domain adap- tation. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2024. 18 [28] W uyang Li, Xinyu Liu, and Y ixuan Y uan. SCAN++: En- hanced semantic conditioned adaptation for domain adaptiv e object detection. IEEE T ransactions on Multimedia , 2023. 3 [29] W uyang Li, Xinyu Liu, and Y ixuan Y uan. SIGMA++: Im- prov ed semantic-complete graph matching for domain adap- tiv e object detection. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2023. 3 [30] Jian Liang, Ran He, and Tieniu T an. A comprehensiv e sur- ve y on test-time adaptation under distribution shifts. Inter - national Journal of Computer V ision , 2025. 18 [31] Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang W an, Xianshun W ang, Xiangtai Li, W enjie Jiang, Bo Du, Dacheng T ao, Ming-Hsuan Y ang, and Lu Qi. One flight ov er the gap: A surve y from perspective to panoramic vision. arXiv preprint arXiv:2509.04444 , 2025. 1 , 2 [32] Longliang Liu, Miaojie Feng, Junda Cheng, Jijun Xiang, Xuan Zhu, and Xin Y ang. PriOr -Flow: Enhancing primi- tiv e panoramic optical flow with orthogonal view . In ICCV , 2025. 1 [33] Y ifan Liu, W uyang Li, Jie Liu, Hui Chen, and Y ixuan Y uan. GRAB-Net: Graph-based boundary-a ware network for med- ical point cloud segmentation. IEEE T ransactions on Medi- cal Imaging , 2023. 3 [34] Y abo Liu, Jinghua W ang, Chao Huang, Y aowei W ang, and Y ong Xu. CIGAR: Cross-modality graph reasoning for do- main adaptiv e object detection. In CVPR , 2023. 3 [35] Ziwei Liu, Zhongqi Miao, Xingang P an, Xiaohang Zhan, Dahua Lin, Stella X. Y u, and Boqing Gong. Open compound domain adaptation. In CVPR , 2020. 2 , 14 [36] Xingguo Lv , Xingbo Dong, Liwen W ang, Jiewen Y ang, Lei Zhao, Bin Pu, Zhe Jin, and Xuejun Li. T est-time domain generalization via universe learning: A multi-graph match- ing approach for medical image segmentation. In CVPR , 2025. 3 [37] Chaoxiang Ma, Jiaming Zhang, Kailun Y ang, Alina Roit- berg, and Rainer Stiefelhagen. DenseP ASS: Dense panoramic semantic segmentation via unsupervised domain adaptation with attention-augmented context exchange. In ITSC , 2021. 1 , 2 , 6 , 17 [38] Shijie Ma, Fei Zhu, Xu-Y ao Zhang, and Cheng-Lin Liu. Pro- toGCD: Unified and unbiased prototype learning for general- ized category discovery . IEEE T ransactions on P attern Anal- ysis and Machine Intelligence , 2025. 18 [39] Laurens van der Maaten and Geoffrey Hinton. V isualizing data using t-SNE. Journal of Machine Learning Resear ch , 2008. 8 [40] Sarthak Maharana, Baoming Zhang, Leonid Karlinsky , Rogerio Feris, and Y unhui Guo. B A TCLIP: Bimodal online test-time adaptation for CLIP. In ICCV , 2025. 18 [41] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Y anqi Zhou, W ei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-te xt transformer . Journal of Mac hine Learn- ing Researc h , 2020. 2 [42] Stephan R. Richter , V ibhav V ineet, Stefan Roth, and Vladlen K oltun. Playing for data: Ground truth from computer games. In ECCV , 2016. 6 [43] Luigi Riz, Cristiano Saltori, Y iming W ang, Elisa Ricci, and Fabio Poiesi. Nov el class discovery meets foundation mod- els for 3D semantic segmentation. International Journal of Computer V ision , 2025. 18 [44] Kuniaki Saito, Shohei Y amamoto, Y oshitaka Ushiku, and T atsuya Harada. Open set domain adaptation by backpropa- gation. In ECCV , 2018. 6 , 7 , 15 , 16 [45] Christos Sakaridis, Dengxin Dai, and Luc V an Gool. ACDC: The adverse conditions dataset with correspondences for se- mantic driving scene understanding. In ICCV , 2021. 6 [46] Jianlin Su, Murtadha Ahmed, Y u Lu, Shengfeng Pan, W en Bo, and Y unfeng Liu. RoF ormer: Enhanced transformer with rotary position embedding. Neurocomputing , 2024. 3 , 5 [47] Zhen T ian, W ayne Xin Zhao, Changwang Zhang, Xin Zhao, Zhongrui Ma, and Ji-Rong W en. EulerFormer: Sequential user behavior modeling with complex vector attention. In SIGIR , 2024. 3 , 5 , 8 [48] W ilhelm Tranheden, V iktor Olsson, Juliano Pinto, and Lennart Sv ensson. D A CS: Domain adaptation via cross- domain mixed sampling. In W ACV , 2021. 13 , 15 [49] Ashish V aswani, Noam Shazeer, Niki Parmar , Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS , 2017. 2 , 4 , 8 [50] Garrett Wilson and Diane J. Cook. A surve y of unsupervised deep domain adaptation. ACM T ransactions on Intelligent Systems and T ec hnology , 2020. 1 [51] Sheng W u, Fei T eng, Hao Shi, Qi Jiang, Kai Luo, Kai- wei W ang, and Kailun Y ang. QuaDreamer: Controllable panoramic video generation for quadruped robots. In CoRL , 2025. 1 [52] Enze Xie, W enhai W ang, Zhiding Y u, Anima Anandkumar , Jos ´ e M. ´ Alvarez, and Ping Luo. SegFormer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS , 2021. 6 , 13 [53] Gezheng Xu, Li Y i, Pengcheng Xu, Jiaqi Li, Ruizhi Pu, Changjian Shui, Charles Ling, A. Ian McLeod, and Boyu W ang. Unra veling the mysteries of label noise in source-free domain adaptation: Theory and practice. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2025. 18 [54] Jiayue Xu, Chao Xu, Jianping Zhao, Cheng Han, and Hua Li. Mamba4P ASS: V ision mamba for panoramic semantic segmentation. Displays , 2025. 2 [55] Kailun Y ang, Jiaming Zhang, Simon Reiß, Xinxin Hu, and Rainer Stiefelhagen. Capturing omni-range context for om- nidirectional segmentation. In CVPR , 2021. 1 [56] Kailun Y ang, Xinxin Hu, Y icheng Fang, Kaiwei W ang, and Rainer Stiefelhagen. Omnisupervised omnidirectional semantic segmentation. IEEE T ransactions on Intelligent T ransportation Systems , 2022. 2 [57] Kai Y ao, Zhaorui T an, Zixian Su, Xi Y ang, Jie Sun, and Kaizhu Huang. SCMix: Stochastic compound mixing for open compound domain adaptation in semantic segmenta- tion. IEEE T ransactions on Neural Networks and Learning Systems , 2025. 2 [58] Kaichao Y ou, Mingsheng Long, Zhangjie Cao, Jianmin W ang, and Michael I. Jordan. Univ ersal domain adaptation. In CVPR , 2019. 6 , 7 , 15 , 16 [59] Chaoning Zhang, Dongshen Han, Y u Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster se gment an ything: T ow ards lightweight SAM for mo- bile applications. arXiv preprint , 2023. 6 , 13 , 15 [60] Jiaming Zhang, Kailun Y ang, Chaoxiang Ma, Simon Reiß, Kun yu Peng, and Rainer Stiefelhagen. Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation. In CVPR , 2022. 1 , 2 [61] Jiaming Zhang, Kailun Y ang, Hao Shi, Simon Reiß, Kun yu Peng, Chaoxiang Ma, Haodong Fu, Philip H. S. T orr , Kaiwei W ang, and Rainer Stiefelhagen. Behind e very domain there is a shift: Adapting distortion-aw are vision transformers for panoramic semantic segmentation. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2024. 1 , 2 , 6 , 8 , 15 , 16 [62] W eiming Zhang, Y exin Liu, Xu Zheng, and Lin W ang. GoodSAM: Bridging domain and capacity gaps via seg- ment an ything model for distortion-aware panoramic seman- tic segmentation. In CVPR , 2024. 1 , 2 [63] W eiming Zhang, Y exin Liu, Xu Zheng, and Lin W ang. GoodSAM++: Bridging domain and capacity gaps via seg- ment anything model for panoramic semantic segmentation. arXiv preprint arXiv:2408.09115 , 2024. 2 [64] W ayne Xin Zhao, Kun Zhou, Junyi Li, T ianyi T ang, Xiaolei W ang, Y upeng Hou, Y ingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Y ifan Du, Chen Y ang, Y ushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Y ifan Li, Xinyu T ang, Zikang Liu, Peiyu Liu, Jian-Y un Nie, and Ji-Rong W en. A survey of large language models. arXiv preprint arXiv:2303.18223 , 2023. 2 [65] Y uyang Zhao, Zhun Zhong, Zhiming Luo, Gim Hee Lee, and Nicu Sebe. Source-free open compound domain adaptation in semantic segmentation. IEEE T ransactions on Circuits and Systems for V ideo T ec hnology , 2022. 2 [66] Junhao Zheng, Haibin Chen, and Qianli Ma. Cross-domain named entity recognition via graph matching. In A CL (F ind- ings) , 2022. 3 [67] Junwei Zheng, Ruiping Liu, Y ufan Chen, Kunyu Peng, Chengzhi Wu, Kailun Y ang, Jiaming Zhang, and Rainer Stiefelhagen. Open panoramic segmentation. In ECCV , 2024. 2 [68] Xu Zheng, Tianbo Pan, Y unhao Luo, and Lin W ang. Look at the neighbor: Distortion-aware unsupervised domain adap- tation for panoramic semantic segmentation. In ICCV , 2023. 1 , 2 [69] Xu Zheng, Jinjing Zhu, Y exin Liu, Zidong Cao, Chong Fu, and Lin W ang. Both style and distortion matter: Dual- path unsupervised domain adaptation for panoramic seman- tic segmentation. In CVPR , 2023. 1 , 2 [70] Xu Zheng, Pengyuan Zhou, Athanasios V . V asilak os, and Lin W ang. Semantics, distortion, and style matter: T ow ards source-free UD A for panoramic segmentation. In CVPR , 2024. 2 , 14 [71] Xu Zheng, Peng Y uan Zhou, Athanasios V . V asilakos, and Lin W ang. 360SFUD A++: T owards source-free UD A for panoramic segmentation by learning reliable category proto- types. IEEE T ransactions on P attern Analysis and Machine Intelligence , 2025. 2 [72] Ding Zhong, Xu Zheng, Chenfei Liao, Y uanhuiyi L yu, Jialei Chen, Shengyang W u, Linfeng Zhang, and Xuming Hu. Om- niSAM: Omnidirectional segment an ything model for UDA in panoramic semantic segmentation. In ICCV , 2025. 1 , 2 [73] Fei Zhu, Shijie Ma, Zhen Cheng, Xu-Y ao Zhang, Zhaox- iang Zhang, and Cheng-Lin Liu. Open-world machine learning: A revie w and ne w outlooks. arXiv pr eprint arXiv:2403.01759 , 2024. 18 In this supplementary material, we provide three sections to complement the main manuscript. Section A of fers a de- tailed description of the proposed task setting and compares it with related settings. Section B presents additional ex- periments to demonstrate the effecti veness of the proposed module and includes an analysis of its sensiti vity coeffi- cients. Section C discusses the limitations of our method and provides an outlook on the societal implications. Sec. A : Clarification and Discussion • T ask Clarification • Benchmark Setup • Implementation Details Sec. B : Quantitative Comparison • Further Analysis • Sensiti vity Analysis • Model Ef ficiency • V isualization Analysis Sec. C : Limitations and Outlook • Societal Implications • Future Research Directions • Limitations and Potential Solutions Sec. A: Clarification and Discussion A.1. T ask Clarification Clarification of the setting . As illustrated in Fig. 6 , the fig- ure depicts the conceptual differences among three domain adaptation settings: Closed-Set Domain Adaptation, Open- Set Domain Adaptation, and Panoramic Open-Set Domain Adaptation. (a) Closed-Set Domain Adaptation : The source domain and the target domain share an identical set of classes (for example, cats and dogs). The objectiv e is to reduce the domain discrepancy under the assumption that all labels are fully shared. (b) Open-Set Domain Adapta- tion : The target domain contains additional unknown cate- gories that do not appear in the source domain. The goal is to properly align the shared classes while identifying and fil- tering out target samples belonging to unknown categories. (c) Panoramic Open-Set Domain Adaptation (PODA) : This setting extends the open-set scenario to a cross-modal context, where the source domain provides pinhole images and the target domain provides panoramic images. The tar- get domain has known and unknown classes, causing chal- lenges due to camera and semantic differences. Clarification of the domain shift. From T able 9 , it is clear that con ventional Unsupervised Domain Adaptation (UD A) methods are unable to handle open-set categories or adapt to div erse weather conditions. While P anoramic UD A e xtends standard UD A by explicitly modeling v ariations in the field of vie w (FoV), it still cannot handle unknown categories or environmental shifts. Open-Set Domain Adaptation fo- cuses exclusi vely on unseen classes, and Open Compound Domain Adaptation aims to handle domain shifts caused by varying weather conditions. Howe ver , both neglect the FoV discrepancies inherent to panoramic imagery . In con- trast, the proposed PODA framew ork simultaneously ad- dresses open-set recognition, weather-related domain shifts, and FoV variations. By jointly modeling these three criti- cal factors, PODA achie ves superior generalization and ro- bustness, enabling more reliable cross-domain perception in complex real-world dri ving scenarios. A.2. T echnical Clarifications A.3. Benchmark Setup Dataset and class configuration. T o e valuate open-set cross-domain semantic segmentation in a controlled yet div erse manner , we construct a benchmark spanning real-to-real, synthetic-to-real, and synthetic-to-synthetic transfers. F or source and target domains, we con- duct the domain-shared (base) and domain-specific (private) classes, where the latter appear only in one domain and thus represent unforeseen categories encoun- tered during deployment. W eather conditions follow the same conv ention, capturing whether en vironmen- tal factors are shared or e xclusiv e to a single domain. The benchmark includes four representative transfers as CityScapes → DenseP ASS, SynP ASS → DenseP ASS, GT A → SynP ASS, and SynP ASS → A CDC, covering vary- ing degrees of semantic and environmental mismatch. In the real-to-real case, both domains share sunny scenes but include human and vehicle categories that remain priv ate. Synthetic-to-real transfers introduce additional priv ate traffic participants while maintaining shared weather . Synthetic-to-synthetic transfer expands this gap further by adding priv ate meteorological conditions such as cloud, fog, and rain, along with fine-grained static and dynamic categories exclusi ve to the target. The adverse weather scenario emphasizes extreme condition shifts, where fog and rain remain shared while cloud, sunny , night, and snow are domain-specific. T ogether , these configurations simultaneously expose category- and weather-le vel dis- crepancies, forming a unified and reproducible benchmark for assessing robustness under open-set domain shifts. Input resolution configuration. For all four open-set benchmark settings, the source and target datasets exhibit div erse image resolutions, reflecting their original col- lection protocols. T o ensure consistency during training, all images are cropped to a unified size of 512 × 512 . Specifically , CityScapes images are originally 1024 × 512 while DenseP ASS images are 2048 × 400 , SynP ASS images are 2048 × 1024 , GT A images are 1280 × 720 , and ACDC images are 960 × 540 . This unified crop strategy balances Closed Set Domain Adaptati on Open Set Domain Adaptation 3 6 0 ° Pinhole Image P inhole Ima ge Pinhole Im a ge Panora m ic Image S ource D omain Target Domain (a) (b ) Panoramic Open Set Domain Adaptation Source Domain Target Doma in 3 6 0 ° 9 2 ° Unknown Unknown Unknown Unknow n Figure 6. Conceptual comparison of the domain adaptation setting. the varying aspect ratios and resolutions across datasets, providing the input size for network training while pre- serving suf ficient spatial context for semantic segmentation. A.4. Implementation Details W e present the details in T ab . 12 , including the pipeline, the auxiliary network, and the training configuration. Pipeline. Our proposed frame work, EDA-PSeg , follows a two-branch pipeline designed for open-set unsupervised domain adapti ve semantic segmentation. It consists of a sour ce-domain supervised branch and a tar get-domain self- training branc h , both of which are processed under a uni- fied data augmentation and preprocessing scheme. For the sour ce domain , each image and its annotation are loaded and resized before being cropped to a fix ed resolution of 512 × 512 . Standard data augmentations, such as random horizontal flipping, are applied to enhance robustness. All images are normalized using ImageNet statistics ( mean = [123 . 675 , 116 . 28 , 103 . 53] , std = [58 . 395 , 57 . 12 , 57 . 375] ) and padded to maintain consistent tensor sizes. For the tar- get domain , a similar preprocessing strategy is adopted, but with adapti ve scale resizing and random cropping to accom- modate scale variation and perspecti ve distortion common in real-world scenes. Both source and target domains share identical normalization and padding configurations to en- sure consistent feature distribution across domains. During training, the model alternately samples mini-batches from both domains following the D A CS [ 48 ] (Domain Adaptiv e Cross-domain Self-training) strategy . The source-domain batches provide supervised signals, while the tar get-domain batches are assigned pseudo-labels that are progressively refined using our auxiliary MobileSAM module (described below). At inference time, each test image under goes deter - ministic ev aluation using a single-scale forward pass with resizing and normalization. Multi-scale and flip testing are disabled for computational consistency . This unified pipeline ensures stable and reproducible adaptation perfor- mance while preserving domain-lev el consistency . A uxiliary Network. Follo wing existing open-set domain adaptation approaches, we adopt MobileSAM [ 59 ] as an auxiliary netw ork to refine pseudo-labels during training. MobileSAM is a lightweight and parameter-ef ficient v ariant of the Segment Anything Model (SAM) [ 26 ]. It is employed solely to distinguish foreground and background regions, improving the reliability of pseudo-labels in the target do- main. Specifically , for each tar get image, MobileSAM gen- erates segmentation masks corresponding to potential fore- ground classes. Class frequencies are then computed from these masks, and the dominant category is selected as the valid pseudo-label class. T raining Configuration. W e adopt D AFormer [ 18 ] as the baseline architecture for UDA semantic segmentation. The model employs a MiT -B5 [ 52 ] backbone combined with a DAF ormerHead decoder v ariant, modified to predict 14 semantic categories ( 13 closed-set classes plus one open- set/unknown class). T raining follo ws the D A CS [ 48 ] self- training framework with pseudo-labeling and feature con- sistency losses. The temporal ensembling coefficient is set to α = 0 . 999 to stabilize the teacher-student updates. The feature distance loss is disabled, enabling adaptation to rely primarily on pseudo-label-based consistency rather than ImageNet feature alignment. Pseudo-label generation ignores the top 15 and bottom 120 pixels of the image to re- duce label noise near image borders. A rare-class sampling mechanism is incorporated to alleviate class imbalance, us- ing min pixels = 3000 and class temp = 0 . 01 to prioritize under -represented categories during source sam- pling. The model is optimized using the AdamW optimizer with a base learning rate of 6 × 10 − 5 , while the learning rate of the decoder head is multiplied by 10 . A linear warm-up followed by polynomial decay is applied for stable con ver - gence. Training runs for 40 , 000 iterations, and ev aluation is performed e very 4 , 000 iterations. Performance is ev al- uated using both the H-score and mIoU , which together measure segmentation quality and open-set recognition ac- curacy . The checkpoint achieving the highest mIoU on the T able 9. Comparison of domain adaptation settings with the domain shift. Domain Adaptation Setting Label Shift W eather Shift FoV Shift Unsupervised Domain Adaptation [ 14 ] ✗ ✗ N/A Panoramic Domain Adaptation [ 70 ] ✗ ✗ ✓ Open-Set Domain Adaptation [ 13 ] ✓ ✗ N/A Open Compound Domain Adaptation [ 35 ] ✗ ✓ N/A Panoramic Open-Set Domain Adaptation (Ours) ✓ ✓ ✓ T able 10. Base and priv ate classes used in the P AN2P AN, SYN2SYN, and SYN2REAL experiments. Shared classes ( green ) are common to both the source and target domains, while pri vate classes ( red ) appear only in either the source or the target domain. W eather conditions follow the same con vention: green indicates shared conditions across both domains, and red indicates conditions specific to a single domain. Setting Classes (Base / Private) W eather Open-Set CityScapes → DenseP ASS ‘road’, ‘sidew alk’, ‘building’, ‘w all’, ‘fence’, ‘traffic light’, ‘ve getation’, ‘terrain’, ‘sky’, ‘car’, ‘b us’, ‘motorcycle’, ‘bicycle’ . CityScapes/DenseP ASS: ‘pole’, ‘traffic sign’, ‘person’, ‘rider’, ‘truck’, ‘train’ . Sunny → Sunn y Open-Set SynP ASS → DenseP ASS ‘road’, ‘sidew alk’, ‘building’, ‘w all’, ‘fence’, ‘pole’, ‘traffic light’, ‘traffic sign’, ‘v egetation’, ‘terrain’, ‘sky’, ‘person’, ‘car’ . SynP ASS: ‘other’, ‘ground’, ‘bridge’, ‘railtrack’, ‘groundrail’, ‘static’, ‘dynamic’, ‘water’ . DenseP ASS: ‘b us’, ‘truck’, ‘train’, ‘motorcycle’, ‘bic ycle’, ‘rider’ . Sunny → Sunn y Open-Set GT A → SynP ASS ‘road’, ‘sidew alk’, ‘building’, ‘w all’, ‘fence’, ‘pole’, ‘traffic light’, ‘traffic sign’, ‘v egetation’, ‘terrain’, ‘sky’, ‘person’, ‘car’ . GT A: ‘rider’, ‘truck’, ‘bus’, ‘train’, ‘motorcycle’, ‘bic ycle’ . SynP ASS: ‘other’, ‘ground’, ‘bridge’, ‘railtrack’, ‘groundrail’, ‘static’, ‘dynamic’, ‘water’ . Sunny → Sunn y , Cloud, Fog, Rain, Night Open-Set SynP ASS → A CDC ‘road’, ‘sidew alk’, ‘building’, ‘w all’, ‘fence’, ‘pole’, ‘traffic light’, ‘traffic sign’, ‘v egetation’, ‘terrain’, ‘sky’, ‘person’, ‘car’ . SynP ASS: ‘other’, ‘ground’, ‘bridge’, ‘railtrack’, ‘groundrail’, ‘static’, ‘dynamic’, ‘water’ . ACDC: ‘rider’, ‘truck’, ‘b us’, ‘train’, ‘motorcycle’, ‘bic ycle’ . Fog, Rain, Night, Sunny , Cloud → F og, Rain, Night, Sunny , Snow T able 11. Image resolutions for dif ferent datasets in volved in the four open-set settings. Crop size is unified to 512 × 512 for training. Setting Source Image Size (W × H) T arget Image Size (W × H) Crop Size (W × H) Open-Set CityScapes → DenseP ASS CityScapes: 1024 × 512 DenseP ASS: 2048 × 400 512 × 512 Open-Set SynP ASS → DenseP ASS SynP ASS: 2048 × 1024 DenseP ASS: 2048 × 400 512 × 512 Open-Set GT A → SynP ASS GT A: 1280 × 720 SynP ASS: 2048 × 1024 512 × 512 Open-Set SynP ASS → A CDC SynP ASS: 2048 × 1024 A CDC: 960 × 540 512 × 512 T able 12. K ey Settings of the Proposed ED A-PSeg Framework Component Setting Framework T wo-branch open-set UDA: source-supervised + tar get self-training. Backbone DAF ormer with MiT -B5 encoder and Graph decoder . Auxiliary Net MobileSAM for pseudo-mask refinement. Input Size 512 × 512 (crop, resize, flip, normalize). Normalization ImageNet mean=[123.7,116.3,103.5], std=[58.4,57.1,57.4]. Pseudo-labels Refined by MobileSAM [ 59 ]. T eacher Update EMA with α = 0 . 999 . Optimizer AdamW , LR= 6 × 10 − 5 (decoder × 10 ). Schedule W armup + polynomial decay, 40k iterations. Sampling D ACS [ 48 ] with rare-class focus ( min pixels=3000 ). Evaluation H-score & mIoU. validation set is selected as the best. All experiments are performed on a single-GPU setup. T able 13. Ablation analysis of the EMA module. Cityscapes → DenseP ASS ( C2D ) Exp. Method Common Private H-Score ① Baseline 52.56 8.57 14.74 ① Self-Att 55.45 10.95 18.28 ① Self-Att+Margin Projection 55.64 15.06 23.71 ② EMA w/o Margin Projection 55.03 7.56 13.30 ③ EMA w/o Learnable Scale 55.50 13.07 21.16 ④ EMA w/o Learnable Bias 55.59 7.35 12.99 ⑤ EMA w/o Learnable Magnitude 54.04 5.20 9.48 ⑥ EMA (Full) 56.12 13.00 21.11 Sec. B: Quantitative Comparison B.1. Further Analysis As shown in T ab . 13 , we ev aluate the effecti veness of the EMA module. The core of our design is the Margin Projec- tion, an enhancement to the self-attention mechanism that consistently improv es performance. In the initial compar- ison (experiment ① ), progressi vely adding attention-based components consistently improves both Common and Pri- vate results, indicating that attention enhances domain- shared representations. Introducing Margin Projection no- tably increases the Priv ate accuracy and achiev es the highest H-Score among non-EMA variants, confirming the benefit of margin-based separation for target-specific learning. In the EMA ablation, removing Margin Projection ( ② ) leads to a clear performance drop, suggesting its importance in maintaining target-domain feature separability . The ab- sence of the Learnable Scale ( ③ ) has a minor effect, while omitting the Learnable Bias ( ④ ) causes a marked decline, highlighting the need for bias correction to address domain shifts. Finally , removing the Learnable Magnitude ( ⑤ ) re- sults in the most sev ere degradation, demonstrating that the learnable magnitude is crucial for stable feature representa- tion and ov erall performance. T able 14. Comparison of the number of parameters (M), FLOPs (G), MACs (G), and test time per image (ms). Experiments are conducted on a single R TX 4090 and AMD Ryzen 9 5950X CPU. Method #Params FLOPs MA Cs Time OSBP [ 44 ] 85.15 M 59.45 G 29.73 G 26.86 ms U AN [ 58 ] 85.29 M 59.45 G 29.73 G 26.67 ms UniO T [ 21 ] 85.22 M 59.45 G 29.73 G 27.03 ms DMLP [ 61 ] 90.15 M 59.45 G 29.73 G 27.12 ms MIC [ 20 ] 85.68 M 14.86 G 7.43 G 26.62 ms D AF [ 18 ] 85.15 M 59.45 G 29.73 G 27.76 ms HRD A [ 19 ] 85.69 M 14.86 G 7.43 G 26.25 ms BUS [ 9 ] 85.68 M 14.86 G 7.43 G 26.06 ms Ours 86.47 M 59.45 G 29.73 G 28.09 ms T able 15. Sensitivity analysis of the self-attention layer in the EMA module. Note that, since the Euler encoding in EMA re- quires both real and imaginary components, the dimensionality must be ev en. Cityscapes → DenseP ASS ( C2D ) Exp. Layer Common Priv ate H-Score ① - 52.56 8.57 14.74 ② 2 56.12 13.00 21.11 ③ 4 54.80 9.94 16.83 ⑤ 8 54.99 6.04 10.88 T able 16. Sensitivity analysis under different threshold v alues. Re- sults are reported on Cityscapes → DenseP ASS ( C2D ). Cityscapes → DenseP ASS ( C2D ) Exp. Threshold Common Private H-Scor e ① 0.4 53.03 1.66 3.22 ② 0.5 56.02 6.17 11.11 ③ 0.6 56.81 18.86 28.32 ④ 0.7 54.59 9.66 16.41 B.2. Sensitivity Analysis Sensitivity of attention layer . As shown in T ab. 15 , we conduct the sensiti vity analysis to ev aluate the effect of self- attention layers in the EMA module. A moderate atten- tion depth notably improv es adaptation: two layers yield the highest H-Score, with consistent gains across both common and priv ate classes. This configuration strengthens the an- gle space coupling induced by the Euler encoding, enabling the model to capture domain-in variant and domain-specific patterns more effecti vely . Howe ver , deeper layers lead to ov erfitting to source-domain priors, reducing target adapt- ability and overall balance. These results indicate that the Input Image Ground Truth DAFormer BUS Ours UniO T Figure 7. Qualitativ e comparison between our method and existing OSD A and UDA approaches. Figure 8. Illustration of the match matrix deriv ed from the graph affinity and the matching ground-truth in the GMA module. EMA module performs best with limited yet sufficient at- tention depth, while excessi ve layers hinder generalization. Sensitivity of threshold. As shown in T ab . 16 , we analyze the sensiti vity of the threshold used in generating pseudo- labels for pri vate classes during source and target domain mixup training. W ith the threshold set to 0 . 4 ( ① ), the model exhibits the lowest mIoU and H-Score for both common and priv ate classes, indicating se vere confusion between the two categories. Increasing the threshold to 0 . 5 ( ② ) leads to a clear improvement in segmentation performance, par- ticularly reflected in a higher H-Score. Further raising the threshold to 0 . 6 ( ③ ) enhances the recall of unkno wn classes while also improving the accuracy on known class mIoU. Howe ver , raising the threshold to 0 . 7 ( ④ ) reduces both com- mon and priv ate mIoU, suggesting that too strict a confi- dence filter can impair pseudo-label generation. B.3. Model Efficiency As shown in T ab . 14 , we ev aluate the efficienc y of the pa- rameters (M), FLOPs (G), MA Cs (G), and test time per im- age (ms), and compare these metrics with existing methods. OSBP [ 44 ], UAN [ 58 ], UniO T [ 21 ], and DMLP [ 61 ] have about 85 M to 90 M parameters and 59 . 45 G FLOPs, requir- ing around 27 ms per image, indicating relatively high com- plexity . In contrast, MIC [ 20 ], D AF [ 18 ], HRDA [ 19 ], and BUS [ 9 ] are more lightweight, with only 14 . 86 G FLOPs and faster inference. Our method achiev es a similar com- putational scale to the above models, with a slightly higher Road and S idewalk Wall and F ence Seen Unseen Sky and C ar Car and B us Building and T errain Car and M otorcycle Motorcycle and B icycle Trafficlight and V egetation Road and S ky Figure 9. T -SNE visualization of data from seen and unseen vie wpoints. parameter count ( 86 . 47 M ) and inference time ( 28 . 09 ms ). B.4. V isualization Analysis Qualitative Results. As illustrated in Fig. 7 , we present a qualitativ e comparison of open-set UDA segmenta- tion results using the OSD A [ 13 ] methods BUS [ 9 ] and UniO T [ 21 ], as well as the UD A method DAF ormer [ 18 ], to ev aluate the effecti veness of our proposed approach. Com- pared with existing methods, our approach deliv ers im- prov ed segmentation performance across both known and unknown classes. In particular , it achiev es superior open- set performance relative to D AFormer [ 18 ], demonstrates greater robustness to small open-set objects than BUS [ 9 ], and substantially mitigates closed-set class and pri vate class foreground objects misidentification errors frequently ob- served in UniO T [ 21 ]. Graph Match Results. As shown in Fig. 8 , we visualize the matching matrix from the learnable affinity and its cor- responding ground truth within the proposed GMA module for both the source and target domains. The highlighted re- gions in the ground truth indicate the nodes that should be matched. The predicted matching matrix closely aligns with the ground truth, demonstrating the effecti veness of the pro- posed GMA module in open-set graph matching. Seen and Unseen Viewpoints. As sho wn in Fig. 9 , we perform visualization experiments on both the seen dataset Cityscapes [ 12 ] and the unseen dataset DenseP ASS [ 37 ]. For the stuff classes [ 3 ] ( e.g. , road, side walk, sky) and the thing classes [ 3 ] ( e.g . , car , bus, motorcycle). For stuf f classes and thing classes, categories within the same group tend to share certain similarities, which is reflected in their ov erlapping or intersecting distributions in the t-SNE visu- alization. In contrast, when comparing categories across stuff and thing classes, their t-SNE embeddings typically show a significant separation, indicating substantial feature- lev el dif ferences between the tw o groups. For objects viewed from both seen and unseen perspecti ves, domain shift occurs across categories due to variations in the field of view . Qualitative results for EMA module. As sho wn in Fig. 10 , we perform qualitativ e analysis of the EMA module using Principal Component Analysis (PCA) to identify regions with consistent color or brightness, while the L2 norm high- lights areas receiving model attention. For the PCA pro- jection, the proposed EMA method better highlights thing classes such as buildings, maintains smoother and less noisy representations in stuff classes like the sky , and achie ves higher cross-vie w consistency in vegetation regions. For the Euclidean (L2) norm visualization, the baseline suffers from limited activ ation and reduced brightness in geomet- rically distorted regions, whereas our method effecti vely ov ercomes this issue and better highlights objects at long ranges and under wide viewpoints. Sec. C: Limitations and Outlook C.1. Societal Implications As illustrated in Fig. 11 , our proposed Extrapolati ve Do- main Adapti ve P anoramic Segmentation frame work de- liv ers significant societal benefits across four domains: quadruped robotics, autonomous driving, assisti ve naviga- tion for people with visual impairments, and drone-based perception. By enhancing semantic and panoramic percep- tion, our method enables quadruped robots to operate reli- ably in hazardous en vironments, supports autonomous ve- hicles in making safer decisions under div erse conditions, assists visually impaired users with real-time spatial a ware- ness, and strengthens drone perception for environmental monitoring, precision agriculture, and infrastructure inspec- tion. Euler - Margin Attention , EMA (Ours) MobileSAM + DAFormer , Baseline P C A P ro jectio n P C A P ro jectio n Input Panoramic Image R GB I nput R GB I nput P C A P ro jectio n P C A P ro jectio n R GB I nput P C A P ro jectio n P C A P ro jectio n R GB I nput P C A P ro jectio n P C A P ro jectio n R GB I nput R GB I nput R GB I nput R GB I nput Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Euc l i dea n N o rm (L2 N o rm ) Figure 10. Qualitative results for EMA. W e visualize feature representations using Principal Component Analysis (PCA) and the L2 norm. C.2. Future Resear ch Directions In the future, we plan to pursue two main research direc- tions. The first focuses on methodological improvements, with the aim of improving the performance, efficienc y , and robustness of our current approach. The second direction in volves application-oriented extensions, where we intend to adapt and apply our methods to broader or more complex real-world scenarios. Building upon the first research direc- tion, we note that in pseudo-label training, the priv ate class is currently threshold-based. T o advance this, future work suggests improving it to a threshold-free priv ate class mech- anism, in line with existing research [ 38 , 43 , 73 ]. Despite progress in related areas, test-time adaptation [ 30 , 40 ] and source-free domain adaptation [ 27 , 53 ] hav e recei ved lim- ited attention in open-world vision for panoramic images. In the future, we will expand the applicability of this method further . Experiments sho w that the technique maintains sta- ble performance under different fields of view , which con- firms its adaptability to various imaging conditions. Subse- quent work will explore its applications in panoramic, fish- eye, wide-angle, and pinhole images and further extend it to real-world scenes. C.3. Limitations and Potential Solutions In edge computing scenarios or environments with limited computational resources, our model still e xhibits a limit for parameter scalability and inference efficiency . This limita- tion primarily arises from the adoption of the cross-domain open-set graph matching mechanism and the Euler-Margin attention module. Although these components enhance the generalization in unseen vie ws, they inevitably introduce additional parameter ov erhead and computational costs dur- ing inference. W e provide two of the abov e issues. For model architecture, we consider adopting a lightweight self- attention mechanism inte grated with our proposed Euler- interval projection and amplitude & phase modulation, aim- ing to preserve representational expressi veness while reduc- ing parameter complexity . T o enhance computational effi- ciency , we plan to conduct the quantization to accelerate inference and reduce memory consumption. Figure 11. Societal implications of panoramic vision technology .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment