A Computationally Efficient Multidimensional Vision Transformer

A COMPUT A TIONALL Y EFFICIENT MUL TIDIMENSIONAL VISION TRANSFORMER A. ELICHI ∗ , K. JBILOU ∗ , AND F . DUFRENOIS † Abstract. V ision T ransformers ha ve achiev ed state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper , we introduce a novel tensor -based framew ork for V ision T ransformers built upon the T ensor Cosine Pr oduct (c-product). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables ef ﬁcient attention mechanisms and structured feature repre- sentations. W e dev elop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new c-product-based V ision T ransformer architecture (TCP-V iT). Numerical e xperiments on standard classiﬁcation and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1 / C parameter reduction (where C is the number of channels) while maintaining competitiv e accuracy . 1 Introduction V ision T ransformers (V iTs) [3] have recently attracted considerable interest as a strong alternative to conv olutional neural networks, owing to their ability to capture long-range depen- dencies via self-attention. Unlike con volutional architectures, which operate on local receptive ﬁelds with shared weights, V iTs facilitate global information exchange across the entire image from the earliest net- work layers. This global modeling capability has enabled V iTs to achiev e state-of-the-art performance across a wide range of computer vision tasks, including image classiﬁcation, object detection, and semantic segmentation. Despite these successes, conv entional V iT models suffer from se veral intrinsic limitations. Images are ﬁrst divided into patches that are subsequently ﬂattened into vectors and processed as independent tokens. This vectorization step disrupts the inherent multidimensional structure of image data and neglects corre- lations across spatial and channel dimensions. In addition, the self-attention mechanism exhibits quadratic computational and memory complexity with respect to the number of tokens, rendering V iTs costly to train and deploy on high-resolution images. These factors signiﬁcantly constrain their scalability and practicality in resource-limited en vironments. T o alle viate these issues, extensi ve research has been de voted to the development of efﬁcient V ision T ransformer variants. Proposed approaches include sparse and axial attention mechanisms, low-rank and kernel-based attention formulations, as well as hybrid architectures that combine conv olutional and trans- former components [28, 29, 30]. Although these methods reduce computational overhead, they generally retain vectorized representations and do not explicitly exploit the multilinear structure of visual data. Con- sequently , important spatial and spectral relationships may remain underutilized. Recent adv ances in tensor numerical linear algebra provide an alternati ve and mathematically princi- pled approach to addressing these challenges. T ensors offer natural representations for multidimensional data and support algorithms that preserve structural properties while reducing redundancy [11, 14, 17]. In particular , tensor products based on orthogonal transforms, such as Fourier and cosine transforms, have demonstrated strong effecti veness in large-scale inv erse problems, regularization of ill-posed systems, and high-dimensional scientiﬁc computing [1, 5]. These techniques beneﬁt from solid theoretical foundations and efﬁcient implementations enabled by f ast transform algorithms. Motiv ated by these dev elopments, we propose an ef ﬁcient V ision T ransformer framew ork based on the tensor cosine pr oduct (c-product). By operating directly on tensor-valued representations and employing cosine-transform-based tensor products, the proposed method preserves spatial structure, reduces compu- tational complexity , and av oids the use of complex-v alued arithmetic. The cosine transform is particularly well suited for vision applications due to its strong energy compaction properties and fav orable boundary behavior [21]. The tensor cosine product enables a structured formulation of attention mechanisms in which interac- tions between image patches are computed within a multilinear transform domain. This formulation leads to more efﬁcient attention computation, improv ed memory utilization, and a natural integration of numerical linear algebra techniques into modern deep learning architectures [7]. The main contributions of this w ork are summarized as follows: • W e formally deﬁne the tensor cosine product and establish its fundamental algebraic and orthogonality- preserving properties. ∗ Univ ersit ´ e du Littoral Cote d’Opale, LMP A, 50 rue F . Buisson, 62228 Calais-Cedex, France. † LISIC, 50 rue F . Buisson, Univ ersit ´ e du Littoral Cote d’Opale, 62228 Calais-Cedex, France. 1 • W e introduce a c-product-based self-attention mechanism that preserves tensor structure while reduc- ing computational and memory complexity . • W e design a cosine-product V ision Transformer (TCP-V iT) that operates directly on tensor-v alued image patches, achieving a uniform 1 / C parameter reduction across all linear components. • W e provide theoretical complexity analysis and numerical experiments demonstrating the efﬁcienc y and competitiv eness of the proposed approach. The remainder of this paper is structured as follo ws. Section 2 re views related work on efﬁcient V ision T ransformers and tensor methods in deep learning. Section 3 introduces the discrete cosine transform and the tensor cosine product with its algebraic properties. Section 4 recalls the standard V ision Transformer architecture and establishes the notation used throughout. Section 5 presents the tensorized TCP-V iT ar- chitecture. Section 6 provides the parameter efﬁcienc y and computational complexity analyses. Section 7 reports numerical experiments. Section 8 concludes the paper . 2 Related W ork 2.1 Efﬁcient V ision T ransformers. The original V iT [3, 13] demonstrated that pure transformer ar- chitectures can match or exceed CNNs on image classiﬁcation when trained on large-scale datasets. How- ev er, its quadratic attention cost motiv ated numerous efﬁcient v ariants. DeiT [28] introduced knowledge distillation and data-efﬁcient training strategies. Swin Transformer [19] employs shifted window attention to achie ve linear complexity in the number of tokens. Axial attention [29] factorizes the full attention along spatial axes. These approaches reduce computational cost b ut retain the v ectorized token representation and do not explicitly le verage the multilinear structure of image data. 2.2 T ensor Methods in Deep Learning . T ensor decompositions hav e been widely applied to com- press neural networks. T ucker decomposition [16] and tensor-train (TT) [23] formats hav e been used to factorize weight matrices in fully connected and conv olutional layers, achieving signiﬁcant parameter re- ductions. LASER [25] sho wed that applying SVD to individual weight matrices and removing components corresponding to the smallest singular values can improve LLM reasoning, particularly when compressing the feed-forward network. More recently , the t-product framework [14, 15] has enabled structured tensor operations based on the discrete Fourier transform (DFT), with applications to tensor regression, principal component analysis [10, 4], and inv erse problems [6, 5]. The present work adopts the cosine-transform vari- ant of the t-product, which operates entirely in the real domain and av oids the complex arithmetic inherent to DFT -based methods. 2.3 T ensor-based Attention Compression. The most closely related work is T ensorLLM [9], which tensorises multi-head attention weights by stacking the per-head query , ke y , v alue, and output matrices into a 4D tensor and applying Tuck er decomposition with shared factor matrices across heads. While both our approach and T ensorLLM exploit the multi-head structure of attention, the two methods differ fundamen- tally in three respects. First, T ensorLLM is a post-hoc compression technique: it decomposes pre-trained weights via an approximate lo w-rank factorization, which introduces approximation error controlled by the chosen multilinear ranks In contrast, our c-product formulation is a native algebraic r eformulation : the model is trained directly with the c-product , and the resulting factorization is exact , with parameter re- duction arising from the block-diagonal structure rather than from rank truncation. Second, T ensorLLM requires hyperparameter selection for the multilinear ranks and produces variable compression ratios de- pending on the chosen ranks, whereas the c-product yields a uniform, ar chitectur e-independent parameter reduction that is determined solely by the number of channels. Third, the cross-channel coupling mecha- nism dif fers: T ensorLLM learns shared factor matrices at additional computational cost, while the c-product couples channels implicitly through the ﬁxed orthogonal DCT matrix Φ C at zero parametric cost. In a related but distinct direction, Su et al. [26] proposed DctV iT , a hybrid CNN–Transformer archi- tecture that employs the DCT for feature map compression within a dedicated con volutional block (D AD block). Their approach applies the DCT as a signal processing tool to reduce spatial redundancy in inter- mediate representations, while retaining a standard T ransformer encoder with conv entional matrix multipli- cations. In contrast, the present work uses the DCT as the algebraic foundation of a tensor product ( ⋆ c ) that replaces all linear projections in the T ransformer encoder , yielding a principled and uniform 1 / C parameter reduction across the entire architecture without any con volutional component. 2 3 The Cosine Product and Its Properties W e begin by introducing the notation and deﬁnitions that will be used throughout the paper . 3.1 Notation Scalars are denoted by lowercase letters ( a ), vectors by bold lowercase letters ( x ), matrices by uppercase letters ( X ), and third-order tensors by calligraphic letters ( X ). The notation and symbols used throughout this paper are summarized in T able 3.1. T A B L E 3 . 1 Summary of notation used thr oughout the paper . Symbol Description H img , W img Image height and width C Number of image channels (tube dimension) P Patch side length N = H img W img / P 2 Number of patches (tokens) d = P 2 Spatial embedding dimension per frontal slice d eff = d · C Effecti ve (ﬂattened) embedding dimension H Number of attention heads d h = d / H Per-head dimension (TCP-V iT); d std h = d eff / H (Std-V iT) r ff FFN expansion ratio ( d ff = r ff · d ) L Number of Transformer layers ⋆ c T ensor cosine product X ( k ) k -th frontal slice of X c X ( k ) k -th frontal slice of DCT 3 ( X ) X ⊤ c T ensor c-transpose (Deﬁnition 3.2) A third-order tensor is denoted by X ∈ R m × n × C , where C denotes the size along the third mode (the tube dimension). The k -th frontal slice is written X ( k ) ∈ R m × n for k = 1 , . . . , C . 3.2 Discrete Cosine T ransform The orthogonal DCT -II matrix Φ C ∈ R C × C has entries deﬁned by [8] ( Φ C ) j , k =          r 1 C , j = 0 , r 2 C cos  π ( 2 k + 1 ) j 2 C  , j = 1 , 2 , . . . , C − 1 , (3.1) for j , k = 0 , 1 , . . . , C − 1. It satisﬁes the orthogonality property Φ ⊤ C Φ C = I C , Φ C Φ ⊤ C = I C . (3.2) For a third-order tensor X ∈ R m × n × C , the DCT along the third mode is deﬁned as c X = DCT 3 ( X ) = X × 3 Φ C , (3.3) where × 3 denotes the mode-3 product. The k -th frontal slice of the transformed tensor is denoted c X ( k ) for k = 1 , . . . , C . The in verse transform is X = IDCT 3 ( c X ) = c X × 3 Φ ⊤ C . (3.4) The DCT possesses sev eral properties that make it particularly suitable for tensor-based vision trans- formers: 3 • Ener gy compaction: Most of the signal energy is concentrated in the low-frequenc y coefﬁcients, en- abling effecti ve lo w-rank approximation and compression. • Real-valued orthogonality: Unlike the Fourier transform, the DCT produces real-valued orthogonal coefﬁcients, simplifying computations in neural netw orks. • F ast computation: Using the fast cosine transform, the DCT can be computed in O ( C log C ) operations per tube, and O ( mn C log C ) for a third-order tensor along the third mode. • In vertibility: The DCT is perfectly in vertible, ensuring no information loss when transforming between the spatial and cosine domains. 3.3 T ensor Cosine Product D E FI N I T I O N 3 . 1 ( T ensor Cosine Product ) . Let A ∈ R m × n × C and B ∈ R n × ℓ × C . The tensor cosine product (c-pr oduct) is deﬁned as C = A ⋆ c B = IDCT 3  DCT 3 ( A ) · DCT 3 ( B )  ∈ R m × ℓ × C , (3.5) wher e the matrix product “ · ” is performed slice-wise in the transform domain: b C ( k ) = c A ( k ) b B ( k ) for k = 1 , . . . , C. This product allows efﬁcient computation via fast DCT algorithms, orthogonality-preserving opera- tions for numerical stability , and structured low-rank approximations. 3.4 Basic T ensor Operations W e summarize the fundamental deﬁnitions of third-order tensor oper - ations under the c-product framew ork, extending the t-product algebra introduced by Kilmer et al. [14]. D E FI N I T I O N 3 . 2 ( T ensor c-transpose ) . Let A ∈ R m × n × C . The c-transpose A ⊤ c ∈ R n × m × C is the tensor whose fr ontal slices in the DCT domain satisfy  d A ⊤ c  ( k ) =  c A ( k )  ⊤ , k = 1 , . . . , C . (3.6) Equivalently , A ⊤ c = IDCT 3  DCT 3 ( A ) ⊤ slice  , wher e ⊤ slice transposes eac h fr ontal slice independently . D E FI N I T I O N 3 . 3 ( Identity tensor , orthogonality , inv ertibility) . Let A ∈ R m × m × C . Then: (i) The identity tensor I m , C ∈ R m × m × C is deﬁned by r equiring that its ﬁrst fr ontal slice equals the m × m identity matrix I m and all r emaining fr ontal slices are zer o. Property: All fr ontal slices of b I m , C = DCT 3 ( I m , C ) coincide with I m . (ii) A tensor Q ∈ R m × m × C is c-orthogonal if Q ⊤ c ⋆ c Q = Q ⋆ c Q ⊤ c = I m , C . (iii) A tensor Q is f-orthogonal if each fr ontal slice b Q ( k ) of its DCT transform is an orthogonal matrix. (iv) A tensor D ∈ R m × m × C is f-diagonal if all fr ontal slices of b D are dia gonal matrices. (v) A tensor A ∈ R m × m × C is in vertible if ther e exists A − 1 ∈ R m × m × C such that A ⋆ c A − 1 = I m , C . D E FI N I T I O N 3 . 4 ( f-symmetry and f-positiv e-deﬁniteness) . A square tensor A ∈ R m × m × C is called f-symmetric if all fr ontal slices c A ( k ) ar e symmetric matrices. It is f-positiv e-deﬁnite (r esp. f-positive- semideﬁnite) if each c A ( k ) is symmetric positive deﬁnite (r esp. symmetric positive-semideﬁnite). These concepts enable a consistent generalization of classical matrix operations to third-order tensors while preserving properties such as orthogonality , symmetry , and positiv e deﬁniteness in the DCT domain. They form the algebraic foundation for the TCP-V iT architecture developed in Section 5. 4 Standard V ision T ransformer Before introducing the tensorized architecture, we recall the stan- dard V ision T ransformer [3] and establish the notation that will serv e as the basis for the c-product general- ization. All operations in this section are matrix-based and operate on vectorized patch representations. 4.1 Patch Embedding Let I ∈ R H img × W img × C denote an input image with C channels. The image is partitioned into N = H img W img / P 2 non-ov erlapping patches of size P × P . Each patch is ﬂattened into a vector of dimension d eff = P 2 C and projected to an embedding space via a learnable matrix W E ∈ R d eff × d eff : X 0 =  x cls ; x 1 ; . . . ; x N  + E pos ∈ R ( N + 1 ) × d eff , (4.1) where x cls ∈ R d eff is a learnable classiﬁcation token and E pos ∈ R ( N + 1 ) × d eff is a learnable positional embed- ding. 4 4.2 Scaled Dot-Product Attention Giv en an input X ∈ R ( N + 1 ) × d eff , the query , key , and value matri- ces are obtained via linear projections: Q = X W Q , K = X W K , V = X W V , (4.2) where W Q , W K , W V ∈ R d eff × d h are learnable weight matrices and d h = d eff / H is the per -head dimension. The scaled dot-product attention is deﬁned as Attention ( Q , K , V ) = softmax  QK ⊤ √ d h  V ∈ R ( N + 1 ) × d h . (4.3) The softmax is applied row-wise, so that each ro w of the attention matrix sums to one. 4.3 Multi-Head Self-Attention W ith H attention heads, the multi-head self-attention (MHSA) is MHSA ( X ) =  O 1 ; . . . ; O H  W O ∈ R ( N + 1 ) × d eff , (4.4) where each head output O h = Attention ( X W Q , h , X W K , h , X W V , h ) ∈ R ( N + 1 ) × d h , the concatenation [ · ] is along the column dimension, and W O ∈ R d eff × d eff is a learnable output projection. 4.4 Feed-F orward Netw ork and Layer Normalization Each Transformer layer applies a tw o-layer feed-forward network (FFN) with e xpansion ratio r ff and GELU activ ation: FFN ( X ) = φ ( X W 1 ) W 2 , (4.5) where W 1 ∈ R d eff × d ff , W 2 ∈ R d ff × d eff , d ff = r ff · d eff , and φ denotes the element-wise GELU activ ation. Layer normalization (LN) is applied with learnable scale and shift vectors γ , β ∈ R d eff . A full T ransformer block maps X ℓ − 1 to X ℓ via: Y ℓ = X ℓ − 1 + MHSA  LN ( X ℓ − 1 )  , (4.6) X ℓ = Y ℓ + FFN  LN ( Y ℓ )  . (4.7) 4.5 Parameter Count For a single T ransformer layer operating at dimension δ = d eff , the parameter count (excluding biases) is: Θ Std layer ( δ ) = 4 δ 2 |{z} MHSA: W Q , W K , W V , W O + 2 r ff δ 2 | {z } FFN: W 1 , W 2 + 4 δ |{z} LN: γ , β × 2 = ( 4 + 2 r ff ) δ 2 + 4 δ . (4.8) This O ( δ 2 ) scaling is the fundamental cost that the c-product framework reduces by a factor of C , as shown in Section 6. 5 The TCP-V iT Architectur e W e no w lift each component of the standard V iT to the tensor setting by replacing matrix multiplication with the tensor cosine product ⋆ c . The key idea is to represent each patch as a third-order tensor X i ∈ R d × C (with d = P 2 ) rather than a ﬂattened vector x i ∈ R d eff (with d eff = P 2 C ). All operations are then performed via the c-product, which couples the C channels through the ﬁxed orthogonal DCT at zero parametric cost. 5.1 T ensor Operations for Vision T ransformers D E FI N I T I O N 5 . 1 ( t-Linear ) . Given an input ten- sor X ∈ R N × d × C and a weight tensor W ∈ R d × d ′ × C , the tensor linear pr ojection is t - Linear ( X ; W ) = X ⋆ c W ∈ R N × d ′ × C . (5.1) This operation generalizes the matrix multiplication X W to the tensor setting. In the DCT domain, it decomposes into C independent matrix multiplications: c X ( k ) c W ( k ) for k = 1 , . . . , C . The cross-channel coupling is handled implicitly by the DCT/IDCT pair at zero parametric cost. D E FI N I T I O N 5 . 2 (t-Softmax ) . Given a tensor S ∈ R N × N × C , the tensor softmax is deﬁned slice-wise in the transform domain: t - Softmax ( S ) i j k = exp ( c S i j k ) N ∑ ℓ = 1 exp ( c S i ℓ k ) , ∀ i , k , (5.2) 5 wher e c S ( k ) denotes the k-th fr ontal slice in the DCT domain. The normalization is performed independently for each token i and each frontal slice k in the transform domain . Since the C slices are decoupled in the DCT domain, the attention weights of each frequency component are computed independently , yielding a block-diagonal structure in the attention operator . D E FI N I T I O N 5 . 3 (t-Attention ) . Given query , key , and value tensors Q , K , V ∈ R N × d h × C , the tensor scaled attention is t - Attention ( Q , K , V ) = t - Softmax  1 √ d h Q ⋆ c K ⊤ c  ⋆ c V ∈ R N × d h × C , (5.3) wher e K ⊤ c ∈ R d h × N × C is the c-transpose (Deﬁnition 3.2). D E FI N I T I O N 5 . 4 ( t-MHSA) . Given an input X ∈ R N × d × C , H attention heads with d h = d / H , and learnable weight tensors W Q , h , W K , h , W V , h ∈ R d × d h × C , h = 1 , . . . , H , the tensor multi-head self-attention is t - MHSA ( X ) =  O 1 ; . . . ; O H  ( 2 ) ⋆ c W O ∈ R N × d × C , (5.4) wher e W O ∈ R d × d × C , [ · ] ( 2 ) denotes concatenation along mode 2, and each head output is O h = t - Attention  X ⋆ c W Q , h , X ⋆ c W K , h , X ⋆ c W V , h  ∈ R N × d h × C . Correspondence with the standard V iT . In the standard V iT (Section 4.3), each projection W Q , h ∈ R d eff × d h operates on the full ﬂattened vector of dimension d eff = dC , requiring d eff · d h = dC · d h parameters per head per projection. In the c-product formulation, the weight tensor W Q , h ∈ R d × d h × C contains only d · d h · C entries, but crucially these entries operate on the r educed dimension d per slice rather than the full dimension dC . This yields a 1 / C parameter reduction per projection, as formalized in Section 6. D E FI N I T I O N 5 . 5 (t-LayerNorm ) . Given X ∈ R N × d × C and learnable tensors γ γ γ , β β β ∈ R 1 × d × C , the tensor layer normalization is applied slice-wise in the transform domain: t - LN ( c X ) i j k = γ 1 j k · c X i j k − b µ ik b σ ik + ε + β 1 j k , (5.5) wher e, for eac h token i and fr ontal slice k in the DCT domain: b µ ik = 1 d d ∑ j = 1 c X i j k , b σ ik = v u u t 1 d d ∑ j = 1 ( c X i j k − b µ ik ) 2 . Normalization is computed for each token–sl ice pair ( i , k ) independently in the transform domain, preserv- ing the full tensor structure. Since the DCT is an orthogonal transform, ∥ Φ C x ∥ 2 = ∥ x ∥ 2 , the norms of the activ ations are preserved across domains, ensuring comparable conditioning in each slice. D E FI N I T I O N 5 . 6 ( t-FFN) . Given X ∈ R N × d × C and weight tensors W 1 ∈ R d × d ff × C , W 2 ∈ R d ff × d × C with d ff = r ff · d , the tensor feed-forwar d network is t - FFN ( X ) = φ  X ⋆ c W 1  ⋆ c W 2 ∈ R N × d × C , (5.6) wher e φ denotes the element-wise GELU activation. 5.2 Patch T ensorization Let I ∈ R H img × W img × C be an input image. W e partition it into N = H img W img / P 2 non-ov erlapping patches and represent each patch as a matrix X i ∈ R P 2 × C , where P 2 = d is the number of spatial positions within the patch and C is the number of channels. The full set of patches is stacked as a third-order tensor: X ∈ R N × d × C , d = P 2 . (5.7) Mode 1 indexes the N patches, mode 2 the d spatial positions within each patch, and mode 3 the C channels. Comparison with the standard V iT . The standard V iT ﬂattens each patch to a vector x i ∈ R d eff with d eff = dC = P 2 C , yielding a matrix X ∈ R N × d eff . The TCP-V iT preserves the two-dimensional structure d × C within each patch, enabling the c-product to exploit cross-channel correlations algebraically . 6 5.3 T ensor Embedding A learnable classiﬁcation tensor T cls ∈ R 1 × d × C is prepended along mode 1, and a positional tensor E ∈ R ( N + 1 ) × d × C is added: X 0 =  T cls ; X  ( 1 ) + E ∈ R ( N + 1 ) × d × C . (5.8) 5.4 TCP-V iT T ransformer Block Each of the L TCP-V iT blocks maps X ℓ − 1 ∈ R ( N + 1 ) × d × C to X ℓ ∈ R ( N + 1 ) × d × C via: Y ℓ = X ℓ − 1 + t-MHSA  t-LN ( X ℓ − 1 )  , (5.9) X ℓ = Y ℓ + t-FFN  t-LN ( Y ℓ )  . (5.10) Every operation maps R ( N + 1 ) × d × C to R ( N + 1 ) × d × C , ensuring that the tensor structure is preserved through all L layers without any ﬂattening or reshaping. Algorithm 1 TCP-V iT Forward P ass Require: Image I ∈ R H img × W img × C , L blocks, weight tensors { W ℓ Q , h , W ℓ K , h , W ℓ V , h , W ℓ O , W ℓ 1 , W ℓ 2 } ℓ, h 1: X ← Patchify ( I ) ∈ R N × d × C ▷ d = P 2 , N = H img W img / P 2 2: X 0 ← [ T cls ; X ] ( 1 ) + E ▷ ∈ R ( N + 1 ) × d × C 3: for ℓ = 1 to L do 4: ¯ X ← t-LN ( X ℓ − 1 ) ▷ Normalize once, reuse for all heads 5: for h = 1 to H do 6: Q h ← ¯ X ⋆ c W ℓ Q , h ▷ ∈ R ( N + 1 ) × d h × C 7: K h ← ¯ X ⋆ c W ℓ K , h 8: V h ← ¯ X ⋆ c W ℓ V , h 9: S h ← 1 √ d h Q h ⋆ c K ⊤ c h ▷ ∈ R ( N + 1 ) × ( N + 1 ) × C 10: O h ← t-Softmax ( S h ) ⋆ c V h ▷ ∈ R ( N + 1 ) × d h × C 11: end for 12: Y ℓ ← X ℓ − 1 + [ O 1 ; . . . ; O H ] ( 2 ) ⋆ c W ℓ O ▷ Residual + output projection 13: H ← φ  t-LN ( Y ℓ ) ⋆ c W ℓ 1  ▷ ∈ R ( N + 1 ) × d ff × C 14: X ℓ ← Y ℓ + H ⋆ c W ℓ 2 ▷ ∈ R ( N + 1 ) × d × C 15: end for 16: retur n X ℓ 5.5 Algorithmic Description R E M A R K 1 (Lossless factorization) . Unlike pruning, quantization, or low-rank approximation methods, the c-pr oduct factorization intr oduces no appr oximation error . Since Φ ⊤ C Φ C = I C , the mapping X 7→ IDCT 3  c W ( k ) · c X ( k )  is exact for any choice of learned weights c W ( k ) . The parameter r eduction arises fr om structur al constraints on the weight tensor (block-dia gonal structure in the DCT domain), not fr om information discarding . 6 Theoretical Analysis 6.1 Parameter Efﬁciency W e compare the parameter counts of the TCP-V iT and the standard V iT . The TCP-V iT operates on tensors X ∈ R N × d × C , while the standard V iT operates on the ﬂattened represen- tation X ∈ R N × dC . Both architectures process the same amount of information per token. 6.1.1 Multi-Head Self-Attention. W ith H heads and per-head dimension d h = d / H , T able 6.1 com- pares the weight dimensions. TCP-V iT count: The H heads contribute 3 H · d d h C = 3 d 2 C parameters (since H d h = d ), plus d 2 C for W O : Θ c-V iT MHSA = 4 d 2 C . (6.1) Standar d count: 3 H · ( dC )( d h C ) = 3 d 2 C 2 , plus ( dC ) 2 = d 2 C 2 for W O : Θ Std MHSA = 4 d 2 C 2 . (6.2) Ratio: Θ c-V iT MHSA / Θ Std MHSA = 1 / C . 7 T A B L E 6 . 1 MHSA weight dimensions. Each pr ojection is listed per head; the total count sums over all H heads. Projection TCP-V iT weight tensor Std-V iT weight matrix W Q , h R d × d h × C R dC × d h C W K , h R d × d h × C R dC × d h C W V , h R d × d h × C R dC × d h C W O R d × d × C R dC × dC 6.1.2 Feed-F orward Network. With expansion ratio r ff and hidden dimension d ff = r ff · d , T able 6.2 compares the weight dimensions. T A B L E 6 . 2 FFN weight dimensions for TCP-V iT and standar d V iT . Projection TCP-V iT weight tensor Std-V iT weight matrix W 1 R d × d ff × C R dC × d ff C W 2 R d ff × d × C R d ff C × dC TCP-V iT count: Θ c-V iT FFN = 2 r ff d 2 C . (6.3) Standar d count: Θ Std FFN = 2 r ff d 2 C 2 . (6.4) Ratio: Θ c-V iT FFN / Θ Std FFN = 1 / C . 6.1.3 Layer Normalization. Each t-LayerNorm has learnable tensors γ γ γ , β β β ∈ R 1 × d × C . A TCP-V iT block uses two normalizations: Θ c-V iT LN = 2 × 2 · dC = 4 dC . (6.5) The standard V iT has γ , β ∈ R dC per normalization: Θ Std LN = 2 × 2 · dC = 4 dC . (6.6) Ratio: 1 (normalization parameters scale linearly , so no quadratic gain). T otal per layer . Θ c-V iT layer = ( 4 + 2 r ff ) d 2 C + 4 dC , (6.7) Θ Std layer = ( 4 + 2 r ff ) d 2 C 2 + 4 dC . (6.8) Over L layers, the dominant quadratic terms yield: Θ c-V iT Θ Std d ≫ 1 − − → 1 C . (6.9) This result is independent of the depth L , the number of heads H , and the FFN expansion ratio r ff : the c-product framew ork yields a uniform 1 / C parameter reduction across all linear components. The 1 / C reduction originates from a fundamental algebraic property of the c-product. In the standard V iT , ev ery learnable linear map operates on the full ﬂattened vector of dimension dC , so that each weight 8 T A B L E 6 . 3 P er-layer parameter comparison between TCP-V iT and standard V iT . Input tensor X ∈ R N × d × C , standard input X ∈ R N × dC . Expansion ratio r ff , H heads, d h = d / H , d ff = r ff d . Component TCP-V iT tensor size TCP-ViT params Std params Ratio W Q , h ( × H ) R d × d h × C d 2 C d 2 C 2 1 / C W K , h ( × H ) R d × d h × C d 2 C d 2 C 2 1 / C W V , h ( × H ) R d × d h × C d 2 C d 2 C 2 1 / C W O R d × d × C d 2 C d 2 C 2 1 / C MHSA total — 4 d 2 C 4 d 2 C 2 1 / C W 1 R d × d ff × C r ff d 2 C r ff d 2 C 2 1 / C W 2 R d ff × d × C r ff d 2 C r ff d 2 C 2 1 / C FFN total — 2 r ff d 2 C 2 r ff d 2 C 2 1 / C γ γ γ , β β β ( × 2) R 1 × d × C each 4 dC 4 dC 1 T otal per layer — ( 4 + 2 r ff ) d 2 C ( 4 + 2 r ff ) d 2 C 2 1 / C matrix has O ( d 2 C 2 ) entries. The c-product framework replaces each such matrix by a third-order weight tensor W ∈ R d × d × C with O ( d 2 C ) entries. The cross-channel coupling that, in the standard setting, requires O ( C ) additional parameters per connection is handled implicitly by the ﬁxed orthogonal DCT matrix Φ C and its in verse Φ ⊤ C , which introduce zero learnable parameters. Since Φ ⊤ C Φ C = I C , this factorization is exact: no information is lost, and no approximation is introduced. 6.2 Computational Complexity The FLOPs for a single TCP-V iT layer consist of C independent matrix multiplications at dimension d , plus the DCT/IDCT overhead: F TCP-V iT layer = C · F layer ( N , d ) + O ( N d C log C ) , (6.10) where F layer ( N , d ) = ( 8 d 2 + 2 r ff d 2 ) N + 4 N 2 d is the FLOPs of a standard layer at dimension d . The standard layer operates at dimension dC : F Std layer = F layer ( N , dC ) . (6.11) Let α = ( 8 + 2 r ff ) d denote the per-token projection cost coef ﬁcient. The exact FLOPs ratio, neglecting the O ( C log C ) transform overhead, is: F c-V iT F Std = α + 4 N α C + 4 N . (6.12) This ratio approaches 1 / C in the projection-dominated re gime ( α ≫ N , i.e., large embedding dimension d ) and approaches 1 in the attention-dominated regime ( N ≫ α , i.e., long sequences). R E M A R K 2 (Attention complexity) . The attention computation within each fr ontal slice remains O ( N 2 d h ) , i.e., quadratic in the number of tokens N . The c-product framework r educes complexity with r espect to the embedding dimension (fr om dC to d per slice), not the sequence length. This distinction is important: the TCP-V iT does not achie ve sub-quadratic attention in N . 6.3 T raining Considerations In TCP-V iT , each learnable weight tensor W ∈ R d × d ′ × C is updated through the c-product structure. Consider a single t-Linear layer (Deﬁnition 5.1): Y = X ⋆ c W , where X ∈ R N × d × C and Y ∈ R N × d ′ × C . By the chain rule applied to the c-product (Deﬁnition 3.1), the gradients of the 9 loss L with respect to the weight and input tensors are ∂ L ∂ W = X ⊤ c ⋆ c ∂ L ∂ Y ∈ R d × d ′ × C , (6.13) ∂ L ∂ X = ∂ L ∂ Y ⋆ c W ⊤ c ∈ R N × d × C , (6.14) where X ⊤ c and W ⊤ c denote the c-transposes (Deﬁnition 3.2). Since the c-product is b uilt upon the orthog- onal DCT matrix Φ C . Consequently , the Frobenius norms of the gradient tensors (6.13)–(6.14) are identical in the spatial and transform domains. The conditioning of the optimization landscape is therefore preserv ed across all frontal slices, ensuring that no slice suffers from vanishing or exploding gradients relativ e to the others. By the slice-wise property of the c-product (Deﬁnition 3.1), the C frontal slices of the weight tensor are updated independently in the transform domain, with cross-channel coupling occurring implicitly through the ﬁxed orthogonal matrices Φ C and Φ ⊤ C at the encoder boundaries. This algebraic structure is analogous to group con volutions, where independent ﬁlter groups process channel subsets. The ke y dif ference is that the c-product coupling is exact and parameter-fr ee : the Φ C / Φ ⊤ C pair ensures that inter-channel information is preserved globally despite per -slice independence, without introducing an y additional learnable parameters. 7 Numerical Experiments The primary contribution of this work is the algebraic framew ork de vel- oped in Sections 3 – 6. The experiments in this section serve as a pr oof of concept , with two objectiv es, verify that the theoretical 1 / C parameter reduction (Eq. (6.9)) is realized in practice, and assess whether the ten- sorized model retains competiti ve accuracy despite the structural constraints imposed by the block-diagonal c-product formulation. W e ev aluate on classiﬁcation and segmentation tasks using controlled, small-scale conﬁgurations that isolate the effect of the c-product tensorization. Large-scale validation (e.g., ImageNet, deeper architectures) is left to future work. An important methodological point must be emphasized. The c-product tensorization is not a speciﬁc architecture: it is a general algebraic strate gy that can be applied to any T ransformer-based model. Giv en any standard Transformer operating on ﬂattened tokens X ∈ R N × dC , the c-product reformulation replaces ev ery linear projection by the tensor cosine product on the unﬂattened representation X ∈ R N × d × C , yielding a 1 / C parameter reduction with no other architectural change. This strategy applies identically to DeiT , Swin, or any other T ransformer variant. For this reason, comparing TCP-V iT against a differ ent architecture (e.g., DeiT or Swin) would conﬂate two independent factors: the base architecture and the tensorization strategy . The scientiﬁcally rigorous comparison is between a giv en Transformer and its c-product counterpart, with all other design choices held constant. This is exactly the protocol we adopt. Concretely , we deﬁne Std-V iT as a vanilla V ision T ransformer with the same depth L , number of heads H , expansion ratio r ff , and training protocol as TCP-V iT . The only difference is the token representation: Std-V iT ﬂattens each patch to x i ∈ R d eff with d eff = P 2 C , while TCP-V iT represents each patch as a tensor X i ∈ R d × C and processes it via the c-product ⋆ c . An y observed performance difference therefore isolates the effect of the c-product tensorization. 7.1 Image Classiﬁcation Datasets. W e ev aluate on three standard benchmarks: • CIF AR-10 [18]: 60,000 colour images (32 × 32) across 10 categories (50K train / 10K test). • SVHN [22]: over 600,000 digit images (32 × 32) from Google Street V iew (10 classes). • STL-10 [2]: 13,000 labelled images (96 × 96, resized to 32 × 32) across 10 classes (5K train / 8K test). T o validate architectural properties under controlled conditions, both models are also trained on identically subsampled subsets: 10,000 training and 2,000 test samples per dataset. Architectur e. Both models process 32 × 32 × 3 images with patch size P = 4, yielding N = 64 patches. Each patch is a tensor X i ∈ R 16 × 3 . • Std-V iT ﬂattens each patch to d eff = P 2 C = 48 and processes through a single encoder . • TCP-V iT represents each patch as X i ∈ R d × C with d = P 2 = 16 and C = 3, and processes it via the c-product ⋆ c . Both use L = 4 layers, H = 4 heads, and expansion ratio r ff = 4. This yields 119,194 parameters for Std-V iT and 43,114 for TCP-V iT , a ratio of 0 . 362 × (theoretical: 1 / C = 0 . 333 × ). Both models are trained for 150 epochs (subsampled) or 200 epochs (full) using AdamW [20] (lr = 0 . 01, 10 weight decay = 0 . 01) with cosine annealing, batch size 256, gradient clipping at norm 1.0, and mixed- precision training. Data augmentation consists of random cropping (padding 4) and random horizontal ﬂipping. Images are normalized using per-dataset statistics for CIF AR-10 and ImageNet statistics for SVHN and STL-10. W e report top-1 classiﬁcation accuracy on the test set. Throughout all experimental tables, ∆ denotes the absolute performance difference ∆ = metric ( TCP-V iT ) − metric ( Std-V iT ) , (7.1) expressed in percentage points. A positiv e ∆ indicates that TCP-V iT outperforms Std-V iT ; a negati ve ∆ indicates a performance loss. Best results per row are sho wn in bold . T A B L E 7 . 1 Classiﬁcation accuracy on subsampled datasets (10K train / 2K test). TCP-V iT uses 0 . 362 × the parameters of Std-V iT (theor et- ical: 1 / C = 0 . 333 × ). Dataset Std-V iT (%) TCP-V iT (%) ∆ ∆ ∆ Std Params TCP-V iT Params CIF AR-10 61.8 63.6 + 1 . 8 119,194 43,114 SVHN 49.8 63.8 + 14 . 0 119,194 43,114 STL-10 52.5 55.9 + 3 . 4 119,194 43,114 Parameter ratio 0 . 362 × (theoretical: 0 . 333 × ) T able 7.1 reports the results. Under the subsampled protocol, TCP-V iT matches or exceeds Std-V iT on all three datasets: + 1 . 8% on CIF AR-10, + 14 . 0% on SVHN, and + 3 . 4% on STL-10. These gains are attributed to implicit regularization: Std-V iT , with 2 . 76 × more parameters, exhibits signiﬁcant ov erﬁtting (training loss near zero, di ver ging v alidation loss), while TCP-V iT’ s reduced capacity constrains the model and yields a more fa vorable bias–v ariance trade-off under limited data. The large gain on SVHN ( + 14 . 0%) warrants particular attention. SVHN images contain cluttered backgrounds and variable lighting, which amplify o verﬁtting in the over -parameterized Std-V iT . T A B L E 7 . 2 Classiﬁcation accur acy on full CIF AR-10 (50K train / 10K test, 200 epoc hs). TCP-V iT r etains 93.3% of performance with 36.2% of the parameters. Dataset Std-ViT (%) TCP-V iT (%) ∆ ∆ ∆ P aram ratio CIF AR-10 (full) 78.7 73.4 − 5 . 3 0 . 362 × T able 7.2 reports the full-scale result. On the complete CIF AR-10, Std-V iT achie ves 78.7% while TCP- V iT reaches 73.4% ( ∆ = − 5 . 3%). This represents 93.3% performance r etention with only 36.2% of the parameters , a trade-off that is consistent with standard model compression methods, which typically report 3–8% accuracy degradation for 2–3 × compression. The accurac y gap conﬁrms that when suf ﬁcient training data is a v ailable, the strict block-diagonal structure in the DCT domain may discard useful cross-frequency interactions that a dense layer can capture. The c-product decomposes the channel dimension using the DCT , replacing the P 2 C -dimensional joint representation with C independent P 2 -dimensional representations. Each frontal slice captures a distinct frequency component: slice k = 1 corresponds to the channel mean (lo w-frequency), while subsequent slices encode inter -channel contrasts (higher-frequency). This preserv es discriminati ve features while eliminating the redundancy of the standard ﬂattening ﬂatten ( P 2 , C ) → R P 2 C . 7.2 Semantic Segmentation W e ev aluate on the Oxford-IIIT Pet dataset [24], which contains 7,349 images of 37 cat and dog breeds with pixel-le vel trimap annotations. F ollo wing standard practice, we con vert to binary segmentation (foreground vs. background), discarding boundary pixels (label 255). The dataset is split 80/20 into training (5,879) and validation (1,470) sets. Both models share an identical CNN decoder (4 upsampling stages). The backbone processes 128 × 128 images with patch size P = 8, yielding N = 256 patches. Std-V iT uses d eff = P 2 C = 192; TCP-V iT operates 11 on C = 3 slices of dimension d = 64. Both use L = 4 layers, H = 4 heads, r ff = 2. After IDCT reconstruction, TCP-V iT outputs features of dimension dC = 192, matching the Std-V iT decoder input. T raining protocol. W e used: 150 epochs, AdamW (lr = 5 × 10 − 4 , weight decay 0 . 01), cosine annealing, cross-entropy loss (ignore index 255), batch size 16, gradient clipping at norm 1.0, mixed-precision. Input resized to 128 × 128 with ImageNet normalization. Best validation mIoU reported. Mean IoU (mIoU), mean Dice (mDice), pixel accurac y (P A), and per-class IoU/Dice. T A B L E 7 . 3 P arameter br eakdown for segmentation. The T ransformer encoder achieves 0 . 338 × , closely matc hing the theor etical 1 / C = 0 . 333 × . Component Std-V iT TCP-V iT Ratio T otal 1,570,978 751,586 0 . 478 × Backbone 1,275,072 455,680 0 . 357 × T ransformer Encoder(s) 1,188,480 402,048 0 . 338 × Embeddings & T okens 86,592 53,632 0 . 619 × CNN Decoder 295,906 295,906 1 . 000 × T able 7.3 presents the parameter breakdown. The T ransformer encoder achie ves a compression ratio of 0 . 338 × , closely matching the theoretical 1 / C = 1 / 3 ≈ 0 . 333 × from Eq. (6.9). The slight deviation arises from bias terms and LayerNorm parameters, which scale linearly . The CNN decoder is identical (1 . 000 × ), conﬁrming it operates on reconstructed features after IDCT . Figure 7.1 provides a visual comparison. F I G . 7 . 1. P arameter distribution. (a) T otal parameter s: TCP-V iT uses 0 . 478 × of Std-V iT . (b) Stacked br eakdown. (c) Component-level comparison. T A B L E 7 . 4 Se gmentation performance on Oxfor d-IIIT P et. TCP-V iT r etains 98.9% of the mIoU with 2 . 09 × fewer parameter s. ∆ denotes the absolute differ ence (TCP-V iT − Std-V iT). Metric Std-V iT TCP-V iT ∆ mIoU (%) 87.8 86.8 − 1 . 0 mDice (%) 93.5 92.9 − 0 . 6 Pixel Acc. (%) 94.2 93.6 − 0 . 6 IoU: Background 84.0 82.7 − 1 . 3 IoU: Fore ground (Pet) 91.7 90.9 − 0 . 8 Dice: Background 91.3 90.5 − 0 . 8 Dice: Foreground (Pet) 95.7 95.2 − 0 . 4 Parameters 1,570,978 751,586 0 . 478 × T able 7.4 summarizes the results. TCP-V iT achieves 86.8% mIoU versus 87.8% for Std-V iT ( ∆ = − 1 . 0%) with 2 . 09 × fe wer parameters. The degradation is small and consistent across all metrics. The gap is smaller on the foreground class ( ∆ IoU = − 0 . 8%) than on the background ( ∆ IoU = − 1 . 3%), suggesting 12 that the c-product processing preserves object-level semantics effecti vely while losing marginal precision on less structured regions. Figure 7.2 shows the training dynamics over 150 epochs. Both models con ver ge at comparable rates, with TCP-V iT exhibiting slightly higher losses. The mIoU gap stabilizes around 1% after epoch 50. Neither model shows signiﬁcant o verﬁtting. F I G . 7 . 2 . T raining dynamics. (a) T raining loss, (b) validation loss, (c) mIoU. TCP-V iT (blue) closely trac ks Std-V iT (r ed) with a stable ∼ 1% gap. Figure 7.3 shows the normalized confusion matrices. Both models achieve high true positive rates ( > 90% foreground, > 82% background). F I G . 7 . 3 . Normalized confusion matrices. Both models show str ong diagonal dominance. Figure 7.4 presents predictions on v alidation images. Both models produce visually similar masks. TCP- V iT occasionally shows slightly less precise boundaries in re gions with complex backgrounds. Figure 7.5 visualizes error maps. Errors are concentrated along object boundaries for both models, with TCP-V iT showing mar ginally more boundary errors. Figure 7.6 presents the parameter–performance trade-off. TCP-V iT reduces parameters by 2 . 09 × while sacriﬁcing only 1.0% mIoU, yielding a compression efﬁcienc y of η = ∆ mIoU / ∆ params = 1 . 0% / 52 . 2% ≈ 0 . 019, i.e., less than 0.02% mIoU lost per 1% parameter reduction. 7.3 Discussion W e synthesize the experimental ﬁndings along four axes: parameter efﬁciency , data regime sensiti vity , task generality , and architectural interpretation. The theoretical analysis of Section 6.1 predicts a uniform 1 / C parameter reduction in all linear projections. The experiments conﬁrm this prediction closely: the Transformer encoder compression ratio is 0 . 362 × for classiﬁcation and 0 . 338 × for segmentation, both near the theoretical 1 / C = 1 / 3 ≈ 0 . 333 × (T able 6.3 and T able 7.3). The small deviations are fully accounted for by bias terms and LayerNorm parameters, which scale linearly in dC and thus do not beneﬁt from the 1 / C reduction. When task-speciﬁc components are shared (e.g., the CNN decoder in se gmentation), the over all compression ratio is diluted to 0 . 478 × , but the T ransformer backbone ratio remains near its theoretical value. This conﬁrms that the c-product reduction is exact and architecture-agnostic within the encoder . The results reveal a clear interaction between model capacity and data availability . Under the subsam- pled protocol (10K training samples), TCP-V iT outperforms Std-V iT on all three datasets ( ∆ = + 1 . 8% 13 F I G . 7 . 4. Qualitative results. Columns: input, gr ound truth, Std-V iT , TCP-V iT . Green overlay: fore gr ound. P er-imag e mIoU shown on each pr ediction. F I G . 7 . 5 . Err or maps. Gr een = correct, r ed = err or , gray = ignor ed. Err ors ar e concentrated at boundaries for both models. on CIF AR-10, + 14 . 0% on SVHN, + 3 . 4% on STL-10; T able 7.1). This is consistent with classical bias– variance analysis: the ov er-parameterized Std-V iT (119K parameters for 10K samples) ov erﬁts sev erely , while TCP-V iT’ s reduced capacity acts as an implicit regularizer , constraining the hypothesis space and improving generalization. The effect is most pronounced on SVHN, where cluttered backgrounds and vari- able lighting amplify o verﬁtting. By contrast, on the full CIF AR-10 (50K samples), Std-V iT achiev es 78.7% versus 73.4% for TCP-V iT ( ∆ = − 5 . 3%; T able 7.2). With suf ﬁcient data, the additional 2 . 76 × parameters of Std-V iT enable it to capture cross-frequency interactions that the block-diagonal c-product structure can- not represent. Nev ertheless, TCP-V iT retains 93.3% of the accuracy with only 36.2% of the parameters, a trade-off comparable to standard compression methods that typically report 3–8% degradation for 2–3 × compression. T ask generality: classiﬁcation vs. dense prediction. The segmentation experiment demonstrates that the c-product framework extends beyond classiﬁcation to dense prediction. On the Oxford-IIIT Pet benchmark, TCP-V iT achiev es 86.8% mIoU versus 87.8% for Std-V iT ( ∆ = − 1 . 0%; T able 7.4), retaining 98.9% of the performance. The accuracy gap is notably smaller than for full-scale classiﬁcation ( − 1 . 0% vs. − 5 . 3%), which can be attributed to two factors: (i) the segmentation task operates at higher resolution (128 × 128 vs. 32 × 32), increasing the token count and thus the relati ve beneﬁt of parameter efﬁcienc y; and (ii) the bi- nary segmentation objecti ve may be less sensiti ve to ﬁne-grained cross-frequency interactions than 10-class classiﬁcation. The error analysis (Figure 7.5) rev eals that degradation is concentrated at object boundaries, 14 F I G . 7 . 6 . Efﬁciency tr ade-off: parameters vs. mIoU . The dashed arrow indicates the compr ession direction. consistent with the loss of inter-slice coupling in the DCT domain. The c-product imposes a block-diagonal structure in the DCT domain: each of the C frontal slices is pro- cessed by an independent linear map, with cross-channel coupling handled exclusi vely by the ﬁxed orthog- onal matrices Φ C and Φ ⊤ C . This structure has both advantages and limitations. On the positiv e side, it provides a principled and exact (lossless) parameter reduction, preserv ed gradient norms due to orthogonal- ity , and a natural decomposition into interpretable frequency components (slice k = 1 captures the channel mean, higher slices capture inter-channel contrasts). On the negativ e side, the strict block-diagonal con- straint pre vents the model from learning arbitrary cross-frequenc y interactions, which limits expressi veness when sufﬁcient data is av ailable. This trade-of f is analogous to group con volutions, where splitting channels into independent groups reduces parameters at the cost of inter-group communication. Limitations and future dir ections. Sev eral limitations of the current work should be ackno wledged. 1. Sequential implementation. The current implementation processes the C frontal slices sequentially . A parallel implementation or multi-stream CUD A kernels is required to realize wall-clock speedups proportional to the theoretical 1 / C FLOPs ratio. 2. T ask head dilution. Shared task-speciﬁc components (e.g., the CNN decoder in segmentation) are not tensorized and therefore dilute the ov erall compression from 1 / C to a higher value. T ensorizing the decoder is a natural extension. 3. Accuracy gap on lar ge datasets. The − 5 . 3% gap on full-scale CIF AR-10 suggests that the strict block- diagonal structure discards useful cross-frequency interactions. Future work could explore learnable inter-slice coupling (e.g., a lightweight mixing layer between slices) or partial c-product formulations that relax the block-diagonal constraint while preserving most of the parameter savings. 4. Scale of experiments. The current e v aluation uses small-scale V iT conﬁgurations ( L = 4, d eff ≤ 192) on low-resolution benchmarks (32 × 32 to 128 × 128). V alidation on lar ger-scale settings (e.g., ImageNet at 224 × 224 with deeper architectures) is needed to conﬁrm the scalability of the c-product frame work. 8 Conclusion W e developed a rigorous algebraic framew ork for tensorizing V ision Transformers based on the tensor cosine product. The central theoretical result is that replacing e very linear projection in a standard Transformer by the c-product yields a prov able and uniform 1 / C parameter reduction that is exact (lossless), architecture-agnostic, and independent of the netw ork depth, number of heads, or FFN expansion ratio. The reduction arises from the block-diagonal structure induced by the ﬁxed orthogonal DCT matrix Φ C , which handles cross-channel coupling at zero parametric cost while preserving gradient norms. Proof- of-concept experiments on classiﬁcation (CIF AR-10, SVHN, STL-10) and segmentation (Oxford-IIIT Pet) benchmarks conﬁrmed that the theoretical compression ratios are achie ved in practice (0 . 362 × and 0 . 338 × vs. the theoretical 0 . 333 × ), the tensorized model retains competiti ve accuracy , and the c-product acts as 15 an implicit regularizer in low-data regimes. These results validate the algebraic framew ork and motiv ate further in vestigation. Sev eral directions remain open. On the theoretical side, extensions to learnable inter-slice coupling and partial c-product formulations could relax the block-diagonal constraint while preserving most of the pa- rameter savings. On the practical side, parallel GPU implementations, large-scale validation on ImageNet with deeper architectures, application to other T ransformer v ariants (Swin, DeiT), and extensions to higher - order tensors for video and volumetric data constitute natural next steps. The frame work is particularly promising for hyperspectral imaging, where C ranges from tens to hundreds of spectral bands and the 1 / C reduction becomes correspondingly more signiﬁcant. REFERENCES [1] Z. Bai, S. M. Cioaca, and L. Reichel, Multilinear systems and tensor Krylov methods, Numer . Linear Algebr a Appl. , vol. 27, no. 2, e2286, 2020. [2] A. Coates, A. Ng, and H. Lee, An analysis of single-layer networks in unsupervised feature learning, in Pr oc. AIST A TS , 2011, pp. 215–223. [3] A. Dosovitskiy et al. , An image is worth 16 × 16 words: Transformers for image recognition at scale, in Pr oc. ICLR , 2021. [4] F . Dufrenois, A. El Ichi, and K. Jbilou, Multilinear discriminant analysis using tensor -tensor products, J . Math. Model. , v ol. 11, no. 1, pp. 83–101, 2023. [5] M. El Guide, A. El Ichi, and K. Jbilou, Discrete cosine transform LSQR and GMRES methods for multidimensional ill-posed problems, J. Math. Model. , v ol. 10, no. 1, pp. 21–37, 2022. [6] A. El Hachimi, K. Jbilou, A. Ratnani, and L. Reichel, A tensor bidiagonalization method for higher-order singular value decomposition with applications, Numer . Linear Algebra Appl. , v ol. 31, no. 2, e2530, 2024. [7] R. Falcao, T . Santos, and M. Carvalho, Efﬁcient T ransformers via structured tensor decomposition, IEEE T rans. Neural Netw . Learn. Syst. , vol. 33, no. 5, pp. 2034–2047, 2022. [8] G. H. Golub and C. F . V an Loan, Matrix Computations , 4th ed. Baltimore, MD, USA: Johns Hopkins Univ . Press, 2013. [9] Y . Gu, W . Zhou, G. Iacovides, and D. Mandic, T ensorLLM: T ensorising multi-head attention for enhanced reasoning and compression in LLMs, arXiv:2501.15674, 2025. [10] M. Hached, K. Jbilou, C. K oukouvinos, and M. Mitrouli, A multidimensional principal component analysis via the c-product Golub–Kahan–SVD, Mathematics , vol. 9, no. 11, 1249, 2021. [11] F . L. Hitchcock, The expression of a tensor or a polyadic as a sum of products, J. Math. Phys. , vol. 6, no. 1–4, pp. 164–189, 1927. [12] F . E. Keddous, A. Llanza, N. Shvai, and A. Nakib, V ision transformers inference acceleration based on adaptive layer normal- ization, Neur ocomputing , vol. 610, p. 128524, 2024. doi: 10.1016/j.neucom.2024.128524. [13] S. Khan, M. Naseer, M. Hayat, S. W . Zamir, F . S. Khan, and M. Shah, Transformers in vision: A survey , ACM Comput. Surv . , vol. 54, no. 10, pp. 1–41, 2022. [14] M. E. Kilmer and C. D. Martin, Factorization strategies for third-order tensors, Linear Algebra Appl. , vol. 435, pp. 641–658, 2011. [15] M. E. Kilmer , K. Braman, N. Hao, and R. C. Hoov er , Third-order tensors as operators on matrices: A theoretical and computa- tional framew ork with applications in imaging, SIAM J. Matrix Anal. Appl. , v ol. 34, no. 1, pp. 148–172, 2013. [16] Y .-D. Kim, E. Park, S. Y oo, T . Choi, L. Y ang, and D. Shin, Compression of deep conv olutional neural netw orks for fast and lo w power mobile applications, in Pr oc. ICLR , 2016. [17] T . G. Kolda and B. W . Bader, T ensor decompositions and applications, SIAM Rev . , vol. 51, no. 3, pp. 455–500, 2009. [18] A. Krizhevsk y , Learning multiple layers of features from tiny images, T ech. Rep., Univ . T oronto, 2009. [19] Z. Liu et al. , Swin T ransformer: Hierarchical vision transformer using shifted windows, in Proc. ICCV , 2021, pp. 10012–10022. [20] I. Loshchilov and F . Hutter, Decoupled weight decay re gularization, in Pr oc. ICLR , 2019. [21] L. Martin and B. Smith, Orthogonal transform-based methods for large-scale image processing, J. Comput. Appl. Math. , vol. 339, pp. 620–638, 2018. [22] Y . Netzer , T . W ang, A. Coates, A. Bissacco, B. W u, and A. Y . Ng, Reading digits in natural images with unsupervised feature learning, in NIPS W orkshop Deep Learn. Unsupervised F eature Learn. , 2011. [23] A. Noviko v , D. Podoprikhin, A. Osokin, and D. P . V etrov , T ensorizing neural networks, in Proc. NeurIPS , 2015, pp. 442–450. [24] O. M. Parkhi, A. V edaldi, A. Zisserman, and C. V . Jawahar , Cats and dogs, in Proc. CVPR , 2012, pp. 3498–3505. [25] P . Sharma, J. T . Ash, and D. Misra, The truth is in there: Improving reasoning in language models with layer-selecti ve rank reduction, in Pr oc. ICLR , 2024. [26] K. Su, L. Cao, B. Zhao, N. Li, D. W u, X. Han, and Y . Liu, DctV iT : Discrete cosine transform meet vision transformers, Neur al Netw . , vol. 172, p. 106139, 2024. [27] Z. T an, W . W ang, and C. Shan, Vision transformers are activ e learners for image copy detection, Neurocomputing , vol. 587, p. 127687, 2024. doi: 10.1016/j.neucom.2024.127687. [28] H. T ouvron, M. Cord, M. Douze, F . Massa, A. Sablayrolles, and H. J ´ egou, Training data-efﬁcient image transformers and distillation through attention, in Pr oc. ICML , 2021, pp. 10347–10357. [29] H. W ang, Y . Zhu, B. Green, H. Adam, A. Y uille, and L.-C. Chen, Axial-DeepLab: Stand-alone axial-attention for panoptic segmentation, in Pr oc. ECCV , 2020, pp. 108–126. [30] J. Guo, K. Han, H. W u, Y . T ang, X. Chen, Y . W ang, and C. Xu, CMT : Con volutional neural networks meet vision transformers, in Pr oc. CVPR , 2022, pp. 12175–12185. 16

A Computationally Efficient Multidimensional Vision Transformer

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment