Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration

Reading time: 5 minute
...

📝 Original Info

  • Title: Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration
  • ArXiv ID: 2512.18345
  • Date: 2025-12-20
  • Authors: ** - 최원석 (Wonseok Choi) - 유현아 (Hyunah Yu) - 김종민 (Jongmin Kim) - 지혜성 (Hyesung Ji) - 박재영 (Jaiyoung Park) - 안정호 (Jung Ho Ahn) 소속: 서울대학교 (Seoul National University) **

📝 Abstract

Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.

💡 Deep Analysis

📄 Full Content

Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration Wonseok Choi, Hyunah Yu, Jongmin Kim, Hyesung Ji, Jaiyoung Park, and Jung Ho Ahn Seoul National University {wonseok.choi, yhyuna, jongmin.kim, kevin5188, jeff1273, gajh}@snu.ac.kr Abstract—Fully homomorphic encryption (FHE) enables se- cure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high- bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory- aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge. I. INTRODUCTION Fully homomorphic encryption (FHE) has long been con- sidered the “holy grail” of cryptography [47], enabling ar- bitrary computation on encrypted data without ever expos- ing the underlying plaintext. Among various FHE schemes, CKKS [15] has emerged as a promising solution for privacy- preserving machine learning (ML) due to its native support for fixed-point arithmetic and efficient plaintext structure for vector operations. However, despite its theoretical promise, the computational overhead of FHE remains a major barrier to its practical deployment, often resulting in slowdowns of 2–6 orders of magnitude compared to plaintext execution [17]. To bridge this performance gap, recent research has turned to GPUs as the primary acceleration platform [17], [23], [33], [35]. GPUs offer massive parallelism and high memory bandwidth, making them a compelling platform for the large- degree polynomial arithmetic inherent to FHE. State-of-the- art implementations, such as Cheddar [17], have demonstrated remarkable speedups over baseline implementations based on CPUs [6], [8], [49] or even FPGAs [4], [30], [60]. Prior GPU studies have sought acceleration opportunities from various directions. First, numerous studies have focused on optimizing compute-intensive FHE kernels, particularly the number-theoretic transform (NTT) [23], [25], [33], [41]. This line of work includes TensorFHE [25], WarpDrive [23], and Neo [33], which repurpose otherwise idle data pipelines, such as tensor cores and FP64 units, on NVIDIA GPUs to accel- erate integer operations required in FHE, thereby achieving additional arithmetic throughput. On top of these compute optimizations, several studies [5], [17], [23], [33], [35] apply kernel fusion to reduce memory transfers and kernel launch overheads. Cheddar is a prominent example, reporting some of the fastest execution times along with HEAAN2 [20], a proprietary GPU library. Despite these efforts, the research community still lacks a systematic analysis of the current bottlenecks in GPU- based FHE execution and hence has remaining optimization opportunities. We address this gap through a comprehensive microarchitectural study spanning individual kernels to full FHE workloads. Our thorough analysis unveils a dominant memory wall in GPU-based FHE execution, driven by limitations in GPU memory hierarchy. While prior studies [5], [25], [35], [41], have focused on DRAM, we show that highly optimized GPU FHE libraries, such as Cheddar [17], require a deeper analysis of the full memory hierarchy. We identify that, even for the most compute-intensive operations such as NTT, prior compute-centric optimizations have culminated in highly tuned kernels that are constrained by memory bandwidth, despite the increased L2 bandwidth of modern GPUs. Also, we highlight the limits of memory-centric optimizations, such as kernel fusion, stemming from small shared memory (L1 cache) capacity. Ultimately, both bandwidth and capacity limitations, rooted in current memory technology, emerge as fundamental barriers to FHE performance on GPUs. Based on this analysis, we introduce Theodosian,1 a set of microarchitectural optimizations that improve effective mem- ory throughput and overall hardware utilization. Rather than relying solely on further kernel-level tuning, we focus on managing data movement and execution to better utilize the GPU memory hierarchy. First, we propose an L2-aware multi- polynomial caching strategy that batches operatio

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut