Leveraging ASIC AI Chips for Homomorphic Encryption

Leveraging ASIC AI Chips for Homomorphic Encryption
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google’s TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU’s low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU’s coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU’s architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU’s MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS


💡 Research Summary

The paper addresses the long‑standing performance and energy‑efficiency challenges of Homomorphic Encryption (HE) by exploring the use of existing AI ASICs, specifically Google’s Tensor Processing Units (TPUs), as a commodity platform for HE acceleration. While GPUs have become the de‑facto accelerator for HE, they still lag behind dedicated HE ASICs by an order of magnitude in performance‑per‑watt. The authors identify two fundamental mismatches that prevent GPU‑optimized HE libraries from efficiently exploiting TPU architectures: (1) HE algorithms rely heavily on 32‑bit modular integer arithmetic, which on a TPU would be forced onto the low‑throughput Vector Processing Unit (VPU), leaving the high‑throughput 8‑bit Matrix Multiply Unit (MXU) idle; and (2) HE kernels such as the Number Theoretic Transform (NTT) require fine‑grained data shuffling and transposition, operations that are costly on the TPU’s coarse‑grained memory subsystem.

To bridge this gap, the authors introduce CROSS, a compiler framework that systematically rewrites HE workloads to align with TPU micro‑architectural strengths. CROSS implements two novel transformations:

  1. Basis‑Aligned Transformation (BAT) – This technique reformulates high‑precision modular arithmetic into dense, low‑precision (INT8) matrix‑vector multiplications. By converting sparse, high‑precision left‑hand matrices (e.g., twiddle‑factor matrices in NTT) into dense half‑size INT8 matrices, BAT eliminates redundant zero‑filled work, reduces the arithmetic intensity, and fully utilizes the MXU’s massive systolic array.

  2. Memory‑Aligned Transformation (MAT) – MAT removes runtime data reordering (transpose, shuffle) by embedding these permutations into the compute kernels at compile time. It represents each required permutation as a transformation matrix, pre‑multiplies it with known parameters (e.g., twiddle factors), and generates a layout‑invariant kernel that incurs no explicit memory movement during execution.

The authors evaluate CROSS on a real Google TPU v6e chip. Results show that the NTT throughput on TPU exceeds that of the state‑of‑the‑art GPU implementation WarpDrive on an NVIDIA A100 by 1.43×, and the throughput‑per‑watt of key HE operators surpasses OpenFHE, WarpDrive, FIDESlib, FAB, HEAP, and Cheddar by factors of 451×, 7.81×, 1.83×, 1.31×, 1.86×, and 1.15× respectively. These gains place the TPU on par with specialized HE ASICs in energy efficiency, with the remaining performance gap (3‑33×) attributed to the lack of a dedicated data‑shuffle engine, limited support for custom moduli, and overall memory/compute capacity.

Beyond the empirical results, the paper contributes a systematic characterization of the architectural mismatches between HE workloads and AI accelerators, introduces BAT and MAT as generalizable compiler‑level solutions, and demonstrates that commodity AI ASICs can serve as a cost‑effective, high‑efficiency platform for privacy‑preserving computation without any hardware modifications. The work opens a new research direction at the intersection of AI hardware and cryptographic acceleration, suggesting that future AI ASIC designs could incorporate modest extensions (e.g., lightweight shuffle units, flexible modulus support) to further close the gap with purpose‑built HE chips.


Comments & Academic Discussion

Loading comments...

Leave a Comment