TT-Edge: A Hardware-Software Co-Design for Energy-Efficient Tensor-Train Decomposition on Edge AI

The growing demands of distributed learning on resource-constrained edge devices underscore the importance of efficient on-device model compression. Tensor-Train Decomposition (TTD) offers high compression ratios with minimal accuracy loss, yet repeated singular value decompositions (SVDs) and matrix multiplications can impose significant latency and energy costs on low-power processors. In this work, we present TT-Edge, a hardware-software co-designed framework aimed at overcoming these challenges. By splitting SVD into two phases-bidiagonalization and diagonalization, TT-Edge offloads the most compute-intensive tasks to a specialized TTD-Engine. This engine integrates tightly with an existing GEMM accelerator, thereby curtailing the frequent matrix-vector transfers that often undermine system performance and energy efficiency. Implemented on a RISC-V-based edge AI processor, TT-Edge achieves a 1.7× speedup compared to a GEMM-only baseline when compressing a ResNet-32 model via TTD, all while reducing overall energy usage by 40.2%. Notably, these gains come with only a 4% increase in total power and minimal hardware overhead-enabled by a lightweight design that reuses GEMM resources and employs a shared floating-point unit. Our experimental results on both FPGA prototypes and post-synthesis power analysis at 45 nm demonstrate that TT-Edge effectively addresses the latency/energy bottlenecks of TTD-based compression in edge environments.

📜 Original Paper Content