📝 Original Info
- Title: GeoPE:A Unified Geometric Positional Embedding for Structured Tensors
- ArXiv ID: 2512.04963
- Date: 2025-12-04
- Authors: ** - Yupu Yao, University of Electronic Science and Technology of China, Chengdu, China (yypseek123@gmail.com) - Bowen Yang, Fudan University, Shanghai, China (bwyangseek@gmail.com) **
📝 Abstract
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.
💡 Deep Analysis
Deep Dive into GeoPE:A Unified Geometric Positional Embedding for Structured Tensors.
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non-commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants a
📄 Full Content
GEOPE:A UNIFIED GEOMETRIC POSITIONAL
EMBEDDING FOR STRUCTURED TENSORS
Yupu Yao
University of Electronic Science and Technology of China
Chengdu, China
yypseek123@gmail.com
Bowen Yang
Fudan University
Shanghai, China
bwyangseek@gmail.com
ABSTRACT
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting
the natural spatial topology. While Rotary Positional Embedding (RoPE) excels
in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at
row edges) as sequence neighbors. Existing 2D approaches typically treat spatial
axes independently, failing to decouple this false sequential proximity from true
spatial distance. To restore the 2D spatial manifold, we introduce Geometric Po-
sitional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean
space using quaternions. To overcome non-commutativity and ensure symmetry,
GeoPE constructs a unified rotational operator by computing the geometric mean
in the Lie algebra. This creates a geometrically coupled encoding that effectively
separates spatial dimensions. Extensive experiments on image classification, ob-
ject detection, and 3D semantic segmentation demonstrate that GeoPE consis-
tently outperforms existing 2D RoPE variants and significantly enhances shape
bias, confirming its ability to capture true geometric structure.
1
INTRODUCTION
Transformer (Vaswani et al., 2017) has emerged as the backbone of large language models due to
its capacity to capture global dependencies and generalize across modalities. However, Transformer
lacks an inherent mechanism for sequence order (Devlin et al., 2019; Raffel et al., 2020; Shaw et al.,
2018). Conventional positional encodings like Absolute Positional Encodings (APE) (Devlin et al.,
2019; Chen et al., 2021) and Relative Positional Encodings (RPE) (Liu et al., 2021; Park et al.,
2022; Wu et al., 2021) inject position information but often face trade-offs between flexibility and
complexity. Rotary Positional Encoding (RoPE) (Su et al., 2024) overcomes these limitations by
rotating query and key vectors in a 2D plane, providing attention with strong length generalization
(Jiang et al., 2023; Touvron et al., 2023; Yao, 2024).
With Transformer increasingly applied to vision tasks, researchers have explored extending RoPE to
two dimensions (Fang et al., 2024; Lu et al., 2024a;b). However, standard Vision Transformers (ViT)
(Dosovitskiy et al., 2020) process images by flattening 2D grids into 1D sequences. This operation
creates a geometric mismatch where spatially distant patches (e.g., at row edges) become immediate
sequence neighbors. Existing 2D methods often adopt axis-wise designs, processing horizontal and
vertical encodings independently or via mixed frequencies (Chu et al., 2024). For instance, Heo
et al. (2024) partitions the embedding space to allow independent rotations per axis. Nevertheless,
because these axes are not geometrically coupled, such approaches struggle to decouple the false
sequential proximity created by flattening from true spatial locality, effectively leaving the weak
cross-axis interaction of high-dimensional RoPEs unresolved.
The challenge of modeling this coupling is amplified in multi-modal learning (Dao et al., 2024; Yin
et al., 2025; Shu et al., 2023). Some works extend RoPE to higher dimensions via Lie group/algebra
frameworks (Appendix B). For example, Liu & Zhou (2025) formalizes RoPE using a maximal
abelian subalgebra (MASA) and introduces cross-dimensional interactions through orthogonal basis
changes. However, this can overly constrain representations or incur high computational costs.
Comminiello et al. (2024) argues that hypercomplex algebras provide essential inductive biases for
multidimensional structures. Alternatively, Ostmeier et al. learn dense skew-symmetric matrices to
1
arXiv:2512.04963v1 [cs.CV] 4 Dec 2025
build rotation operators, yet this remains computationally expensive and lacks theoretical guarantees
for efficient spatial reconstruction.
We propose Geometric Positional Embedding (GeoPE), which extends RoPE’s 2D complex-plane
rotations to 3D Euclidean space using quaternions to strictly model coupled rotations in structured
tensors (Section 3.3). Unlike independent axial methods, GeoPE treats spatial dimensions as a uni-
fied geometric entity. To overcome the non-commutativity of quaternion multiplication and ensure a
consistent spatial prior, we construct a unified rotational operator by computing the symmetric mean
in the logarithmic tangent space (Section 3.2). We also propose a linear variant for direct relative
encoding (Section 3.4). This method enriches self-attention with a geometrically meaningful un-
derstanding of space, thereby fostering superior spatial reasoning and shape awareness (Section 4).
Experiments (Section 5) show that GeoPE achieves significant performance gains in classification,
detection, and segmentation, while retaining strong extrapolation properties.
2
RELATED WORK
Position
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.