GeoPE:A Unified Geometric Positional Embedding for Structured Tensors

Reading time: 5 minute
...

📝 Original Info

  • Title: GeoPE:A Unified Geometric Positional Embedding for Structured Tensors
  • ArXiv ID: 2512.04963
  • Date: 2025-12-04
  • Authors: ** - Yupu Yao, University of Electronic Science and Technology of China, Chengdu, China (yypseek123@gmail.com) - Bowen Yang, Fudan University, Shanghai, China (bwyangseek@gmail.com) **

📝 Abstract

Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

💡 Deep Analysis

Deep Dive into GeoPE:A Unified Geometric Positional Embedding for Structured Tensors.

Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants a

📄 Full Content

GEOPE:A UNIFIED GEOMETRIC POSITIONAL EMBEDDING FOR STRUCTURED TENSORS Yupu Yao University of Electronic Science and Technology of China Chengdu, China yypseek123@gmail.com Bowen Yang Fudan University Shanghai, China bwyangseek@gmail.com ABSTRACT Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Po- sitional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, ob- ject detection, and 3D semantic segmentation demonstrate that GeoPE consis- tently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure. 1 INTRODUCTION Transformer (Vaswani et al., 2017) has emerged as the backbone of large language models due to its capacity to capture global dependencies and generalize across modalities. However, Transformer lacks an inherent mechanism for sequence order (Devlin et al., 2019; Raffel et al., 2020; Shaw et al., 2018). Conventional positional encodings like Absolute Positional Encodings (APE) (Devlin et al., 2019; Chen et al., 2021) and Relative Positional Encodings (RPE) (Liu et al., 2021; Park et al., 2022; Wu et al., 2021) inject position information but often face trade-offs between flexibility and complexity. Rotary Positional Encoding (RoPE) (Su et al., 2024) overcomes these limitations by rotating query and key vectors in a 2D plane, providing attention with strong length generalization (Jiang et al., 2023; Touvron et al., 2023; Yao, 2024). With Transformer increasingly applied to vision tasks, researchers have explored extending RoPE to two dimensions (Fang et al., 2024; Lu et al., 2024a;b). However, standard Vision Transformers (ViT) (Dosovitskiy et al., 2020) process images by flattening 2D grids into 1D sequences. This operation creates a geometric mismatch where spatially distant patches (e.g., at row edges) become immediate sequence neighbors. Existing 2D methods often adopt axis-wise designs, processing horizontal and vertical encodings independently or via mixed frequencies (Chu et al., 2024). For instance, Heo et al. (2024) partitions the embedding space to allow independent rotations per axis. Nevertheless, because these axes are not geometrically coupled, such approaches struggle to decouple the false sequential proximity created by flattening from true spatial locality, effectively leaving the weak cross-axis interaction of high-dimensional RoPEs unresolved. The challenge of modeling this coupling is amplified in multi-modal learning (Dao et al., 2024; Yin et al., 2025; Shu et al., 2023). Some works extend RoPE to higher dimensions via Lie group/algebra frameworks (Appendix B). For example, Liu & Zhou (2025) formalizes RoPE using a maximal abelian subalgebra (MASA) and introduces cross-dimensional interactions through orthogonal basis changes. However, this can overly constrain representations or incur high computational costs. Comminiello et al. (2024) argues that hypercomplex algebras provide essential inductive biases for multidimensional structures. Alternatively, Ostmeier et al. learn dense skew-symmetric matrices to 1 arXiv:2512.04963v1 [cs.CV] 4 Dec 2025 build rotation operators, yet this remains computationally expensive and lacks theoretical guarantees for efficient spatial reconstruction. We propose Geometric Positional Embedding (GeoPE), which extends RoPE’s 2D complex-plane rotations to 3D Euclidean space using quaternions to strictly model coupled rotations in structured tensors (Section 3.3). Unlike independent axial methods, GeoPE treats spatial dimensions as a uni- fied geometric entity. To overcome the non-commutativity of quaternion multiplication and ensure a consistent spatial prior, we construct a unified rotational operator by computing the symmetric mean in the logarithmic tangent space (Section 3.2). We also propose a linear variant for direct relative encoding (Section 3.4). This method enriches self-attention with a geometrically meaningful un- derstanding of space, thereby fostering superior spatial reasoning and shape awareness (Section 4). Experiments (Section 5) show that GeoPE achieves significant performance gains in classification, detection, and segmentation, while retaining strong extrapolation properties. 2 RELATED WORK Position

…(Full text truncated)…

📸 Image Gallery

Frame.png Frame.webp Implementation.png Implementation.webp acc_resolution.png acc_resolution.webp attention_distance.png attention_distance.webp heatmap.png heatmap.webp tangent.png tangent.webp texture.png texture.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut