텐서 분해 기반 신경망의 하드웨어 친화적 공동 탐색 프레임워크

Reading time: 6 minute
...

📝 Abstract

High-order tensor decomposition has been widely adopted to obtain compact deep neural networks for edge deployment. However, existing studies focus primarily on its algorithmic advantages such as accuracy and compression ratio-while overlooking the hardware deployment efficiency. Such hardware-unaware designs often obscure the potential latency and energy benefits of tensorized models. Although several works attempt to reduce computational cost by optimizing the contraction sequence based on the number of multiply-accumulate operations, they typically neglect the underlying hardware characteristics, resulting in suboptimal real-world performance. We observe that the contraction path, hardware architecture, and dataflow mapping are tightly coupled and must be optimized jointly within a unified design space to maximize deployment efficiency on real devices. To this end, we propose a co-exploration framework that unifies these dimensions within a unified design space for efficient training and inference of tensorized neural networks on edge platforms. The framework formulates a latency oriented search objective and solves it via a global latency-driven exploration across the unified design space to achieve end-to-end model efficiency. The optimized configurations are implemented on a configurable FPGA kernel, achieving up to 4x and 3.85x lower inference and training latency compared with the dense baseline.

💡 Analysis

High-order tensor decomposition has been widely adopted to obtain compact deep neural networks for edge deployment. However, existing studies focus primarily on its algorithmic advantages such as accuracy and compression ratio-while overlooking the hardware deployment efficiency. Such hardware-unaware designs often obscure the potential latency and energy benefits of tensorized models. Although several works attempt to reduce computational cost by optimizing the contraction sequence based on the number of multiply-accumulate operations, they typically neglect the underlying hardware characteristics, resulting in suboptimal real-world performance. We observe that the contraction path, hardware architecture, and dataflow mapping are tightly coupled and must be optimized jointly within a unified design space to maximize deployment efficiency on real devices. To this end, we propose a co-exploration framework that unifies these dimensions within a unified design space for efficient training and inference of tensorized neural networks on edge platforms. The framework formulates a latency oriented search objective and solves it via a global latency-driven exploration across the unified design space to achieve end-to-end model efficiency. The optimized configurations are implemented on a configurable FPGA kernel, achieving up to 4x and 3.85x lower inference and training latency compared with the dense baseline.

📄 Content

Deep neural networks (DNNs) have achieved remarkable success in image classification [7,10,19], object detection [20], and video recognition [1]. However, their rapidly growing computational and memory demands pose significant challenges for deployment on resource-constrained hardware such as FPGAs. To mitigate these limitations, a variety of model compression techniques have been proposed to reduce parameter redundancy and computational overhead, including quantization [13,[25][26][27], pruning [2,3,12,24], and tensor decomposition [8,9,14,16]. Among these methods, tensor decomposition offers an especially promising solution, achieving orders-of-magnitude reductions in model parameters while preserving accuracy [30][31][32]. By representing high-dimensional weight tensors as sequences of low-rank tensor cores, tensor decomposition significantly reduces storage requirements without compromising model performance. This property enables the deployment of largescale DNNs on lightweight edge devices, facilitating low-latency and energy-efficient real-world applications.

Although tensorized neural networks (TNNs) have demonstrated ultra-low model sizes, prior algorithm-level designs often overlook the actual hardware efficiency, thereby failing to achieve true acceleration and energy benefits in real deployments.

Recent studies have shown that the contraction sequence can greatly impact the computational cost of TNNs. Gu et al. [5] partially addressed this issue by searching for contraction paths that minimize the number of multiply-accumulate (MAC) operations, while Tian et al. [22] adopted a fixed bi-directional contraction path to enhance intra-sequence parallelism. These approaches mainly focus on reducing theoretical MAC, without considering real-time execution latency. More recently, Zhang et al. [33] introduced a mapping-aware contraction sequence search algorithm that incorporates hardware considerations into sequence optimization. However, their design supports only sequential contraction paths starting from the input node, whose search space ignores both the intra-sequence parallelism inherent in tensorized structures and the tensor-core-primary contraction order, thereby often leading to suboptimal paths.

To overcome these limitations, we propose a comprehensive design space exploration (DSE) and architecture generation framework that optimizes jointly the contraction paths, hardware architecture, and dataflow for end-to-end TNN training and inference on edge devices.

We first formulate a latency-oriented configuration search objective to identify the optimal combination of design parameters that minimize overall execution latency. To improve search efficiency, the contraction path search space is constrained by a MAC-guided sequence exploration algorithm, which prunes redundant highcost paths while preserving promising low-MAC candidates. Next, we simulate the latency of all feasible configuration combinations within the unified search space and employ a global latency-driven search algorithm to select the model-level optimal configuration. Through this comprehensive DSE framework, our approach surpasses prior local (layer-wise) search methods, achieving superior end-to-end hardware efficiency and overall model speedup. Finally, to validate the effectiveness of our framework, we deploy the optimized designs on an FPGA platform and evaluate their actual runtime performance.

Paper Contributions. Our contributions could be summarized as the following: • We propose a comprehensive design space exploration framework that jointly explores contraction paths, hardware architectures, and dataflow to minimize end-to-end latency for tensorized neural networks (TNNs) on edge devices. • We develop a global latency-driven search algorithm that efficiently evaluates and selects configuration combinations across layers and hardware settings from a whole-model perspective.

• We design a parameterized GEMM kernel on FPGA and validate the optimized design through real-world implementation, demonstrating significant improvements in latency and hardware efficiency.

Tensor is a high-dimensional data structure [11], and a tensor with 𝑑 dimensions (or modes) could be represented as A ∈ R 𝑛 1 ×…×𝑛 𝑑 , where 𝑛 𝑘 is the size of mode 𝑘. For clarity, tensor operations can also be visualized using graph representations. As illustrated in Fig. 1 (a)-(b), a 𝑑-way tensor is represented by a node with 𝑑 edges, where a matrix corresponds to a 2-way tensor.

Tensor contraction refers to the operation that multiplies two tensors along a shared mode, effectively eliminating that mode and producing a new tensor. Consider tensors A ∈ R 𝑛 1 ×…×𝑛 𝑑 and B ∈ R 𝑚 1 ×…×𝑚 𝑙 .

We use × 𝑡 𝑠 to denote the contraction between the 𝑠-th mode of A and the 𝑡-th mode of B, where the dimensions match (𝑛 𝑠 = 𝑚 𝑡 ). The resulting tensor can be written as

Fig. 1(c) and (d) illustrate examples of tensor contractions between 2-way and 3-way tenso

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut