📝 Original Info
- Title: Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration
- ArXiv ID: 2512.18345
- Date: 2025-12-20
- Authors: ** - 최원석 (Wonseok Choi) - 유현아 (Hyunah Yu) - 김종민 (Jongmin Kim) - 지혜성 (Hyesung Ji) - 박재영 (Jaiyoung Park) - 안정호 (Jung Ho Ahn) 소속: 서울대학교 (Seoul National University) **
📝 Abstract
Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.
💡 Deep Analysis
📄 Full Content
Theodosian: A Deep Dive into
Memory-Hierarchy-Centric FHE Acceleration
Wonseok Choi, Hyunah Yu, Jongmin Kim, Hyesung Ji, Jaiyoung Park, and Jung Ho Ahn
Seoul National University
{wonseok.choi, yhyuna, jongmin.kim, kevin5188, jeff1273, gajh}@snu.ac.kr
Abstract—Fully homomorphic encryption (FHE) enables se-
cure computation on encrypted data, mitigating privacy concerns
in cloud and edge environments. However, due to its high
compute and memory demands, extensive acceleration research
has been pursued across diverse hardware platforms, especially
GPUs. In this paper, we perform a microarchitectural analysis
of CKKS, a popular FHE scheme, on modern GPUs. We
focus on on-chip cache behavior, and show that the dominant
kernels remain bound by memory bandwidth despite a high-
bandwidth L2 cache, exposing a persistent memory wall. We
further discover that the overall CKKS pipeline throughput is
constrained by low per-kernel hardware utilization, caused by
insufficient intra-kernel parallelism. Motivated by these findings,
we introduce Theodosian, a set of complementary, memory-
aware optimizations that improve cache efficiency and reduce
runtime overheads. Our approach delivers consistent speedups
across various CKKS workloads. On an RTX 5090, we reduce the
bootstrapping latency for 32,768 complex numbers to 15.2ms with
Theodosian, and further to 12.8ms with additional algorithmic
optimizations, establishing new state-of-the-art GPU performance
to the best of our knowledge.
I. INTRODUCTION
Fully homomorphic encryption (FHE) has long been con-
sidered the “holy grail” of cryptography [47], enabling ar-
bitrary computation on encrypted data without ever expos-
ing the underlying plaintext. Among various FHE schemes,
CKKS [15] has emerged as a promising solution for privacy-
preserving machine learning (ML) due to its native support
for fixed-point arithmetic and efficient plaintext structure for
vector operations. However, despite its theoretical promise, the
computational overhead of FHE remains a major barrier to
its practical deployment, often resulting in slowdowns of 2–6
orders of magnitude compared to plaintext execution [17].
To bridge this performance gap, recent research has turned
to GPUs as the primary acceleration platform [17], [23],
[33], [35]. GPUs offer massive parallelism and high memory
bandwidth, making them a compelling platform for the large-
degree polynomial arithmetic inherent to FHE. State-of-the-
art implementations, such as Cheddar [17], have demonstrated
remarkable speedups over baseline implementations based on
CPUs [6], [8], [49] or even FPGAs [4], [30], [60].
Prior GPU studies have sought acceleration opportunities
from various directions. First, numerous studies have focused
on optimizing compute-intensive FHE kernels, particularly the
number-theoretic transform (NTT) [23], [25], [33], [41]. This
line of work includes TensorFHE [25], WarpDrive [23], and
Neo [33], which repurpose otherwise idle data pipelines, such
as tensor cores and FP64 units, on NVIDIA GPUs to accel-
erate integer operations required in FHE, thereby achieving
additional arithmetic throughput. On top of these compute
optimizations, several studies [5], [17], [23], [33], [35] apply
kernel fusion to reduce memory transfers and kernel launch
overheads. Cheddar is a prominent example, reporting some
of the fastest execution times along with HEAAN2 [20], a
proprietary GPU library.
Despite these efforts, the research community still lacks
a systematic analysis of the current bottlenecks in GPU-
based FHE execution and hence has remaining optimization
opportunities. We address this gap through a comprehensive
microarchitectural study spanning individual kernels to full
FHE workloads.
Our thorough analysis unveils a dominant memory wall
in GPU-based FHE execution, driven by limitations in GPU
memory hierarchy. While prior studies [5], [25], [35], [41],
have focused on DRAM, we show that highly optimized
GPU FHE libraries, such as Cheddar [17], require a deeper
analysis of the full memory hierarchy. We identify that, even
for the most compute-intensive operations such as NTT, prior
compute-centric optimizations have culminated in highly tuned
kernels that are constrained by memory bandwidth, despite
the increased L2 bandwidth of modern GPUs. Also, we
highlight the limits of memory-centric optimizations, such as
kernel fusion, stemming from small shared memory (L1 cache)
capacity. Ultimately, both bandwidth and capacity limitations,
rooted in current memory technology, emerge as fundamental
barriers to FHE performance on GPUs.
Based on this analysis, we introduce Theodosian,1 a set of
microarchitectural optimizations that improve effective mem-
ory throughput and overall hardware utilization. Rather than
relying solely on further kernel-level tuning, we focus on
managing data movement and execution to better utilize the
GPU memory hierarchy. First, we propose an L2-aware multi-
polynomial caching strategy that batches operatio
Reference
This content is AI-processed based on open access ArXiv data.