프리필·디코드 분리형 FPGA LLM 가속기 PD Swap

Reading time: 6 minute
...

📝 Original Info

  • Title: 프리필·디코드 분리형 FPGA LLM 가속기 PD Swap
  • ArXiv ID: 2512.11550
  • Date: 2025-12-12
  • Authors: Yifan Zhang, Zhiheng Chen, Ye Qiao, Sitao Huang

📝 Abstract

Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. However, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill-decode asymmetry: prefill is compute-bound and dominated by dense matrix-matrix operations, whereas decoding is memorybandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill-decode disaggregated LLM accelerator, PD-Swap, that uses Dynamic Partial Reconfiguration (DPR) to time-multiplex the attention module on edge FPGAs. The core tablelookup ternary matrix multiplication and weight-buffering engines remain static, while the attention subsystem is a reconfigurable partition with two phase-specialized architectures: a compute-heavy, token-parallel prefill engine and a bandwidth-optimized, KV-cachecentric decoding engine. A roofline-inspired model and design space exploration jointly optimize reconfigurable-region size, parallelism under reconfiguration and routability constraints, and reconfiguration latency is hidden by computation latency. PD-Swap achieves up to 27 tokens/s decoding throughput, outperforming prior state-ofthe-art works by 1.3×-2.1× (larger gains at longer context lengths), without extra area cost. Introduction Transformer-based large language models (LLMs) underpin many modern AI services, but their computation, memory, and bandwidth demands clash with the strict power and cost budgets of edge devices. Quantization is a key enabler for on-device LLM inference: BitNet-style 1.58-bit models show that ternary weights ({-1, 0, +1}) can approach full-precision accuracy while drastically reducing model size and replacing multiplications with low-cost operations. Combined with the reconfigurability and fine-grained parallelism of FPGAs, such models offer a promising path toward privacypreserving, low-latency LLM inference at the edge. Recent works [1, 2] have implemented end-to-end LLM accelerators with edge FPGAs, and they accelerate both prefill and autoregressive decoding on chip under tight power budgets and achieve Conference'17,

💡 Deep Analysis

Deep Dive into 프리필·디코드 분리형 FPGA LLM 가속기 PD Swap.

Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. However, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill-decode asymmetry: prefill is compute-bound and dominated by dense matrix-matrix operations, whereas decoding is memorybandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill-decode disaggregated LLM accelerator, PD-Swap, that uses Dynamic Partial Reconfigu

📄 Full Content

PD-Swap: Prefill–Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration Yifan Zhang, Zhiheng Chen, Ye Qiao, and Sitao Huang {yifanz58,zhihenc5,yeq6,sitaoh}@uci.edu University of California, Irvine Irvine, California, USA Abstract Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. How- ever, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill–decode asymmetry: prefill is compute-bound and dominated by dense matrix–matrix operations, whereas decoding is memory- bandwidth-bound and dominated by KV-cache traffic. A static ac- celerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fab- ric, and tight LUT/URAM limits that cap model size and usable context. We propose a prefill–decode disaggregated LLM accelera- tor, PD-Swap, that uses Dynamic Partial Reconfiguration (DPR) to time-multiplex the attention module on edge FPGAs. The core table- lookup ternary matrix multiplication and weight-buffering engines remain static, while the attention subsystem is a reconfigurable par- tition with two phase-specialized architectures: a compute-heavy, token-parallel prefill engine and a bandwidth-optimized, KV-cache- centric decoding engine. A roofline-inspired model and design space exploration jointly optimize reconfigurable-region size, parallelism under reconfiguration and routability constraints, and reconfigura- tion latency is hidden by computation latency. PD-Swap achieves up to 27 tokens/s decoding throughput, outperforming prior state-of- the-art works by 1.3×–2.1× (larger gains at longer context lengths), without extra area cost. 1 Introduction Transformer-based large language models (LLMs) underpin many modern AI services, but their computation, memory, and bandwidth demands clash with the strict power and cost budgets of edge devices. Quantization is a key enabler for on-device LLM inference: BitNet-style 1.58-bit models show that ternary weights ({−1, 0, +1}) can approach full-precision accuracy while drastically reducing model size and replacing multiplications with low-cost operations. Combined with the reconfigurability and fine-grained parallelism of FPGAs, such models offer a promising path toward privacy- preserving, low-latency LLM inference at the edge. Recent works [1, 2] have implemented end-to-end LLM acceler- ators with edge FPGAs, and they accelerate both prefill and autore- gressive decoding on chip under tight power budgets and achieve Conference’17, Washington, DC, USA 2025. ACM ISBN 978-x-xxxx-xxxx-x/YYYY/MM https://doi.org/10.1145/nnnnnnn.nnnnnnn competitive tokens/s compared to INT8 and FP16 designs. However, study on these end-to-end accelerators with both prefill and decode reveals a fundamental prefill-decode asymmetry. Prefill processes the entire prompt in parallel and is dominated by matrix-matrix op- erations, making it compute-bound and constrained by LUT/URAM budget and timing closure. Decoding generates one token at a time, repeatedly accessing the KV cache and weights; its arithmetic inten- sity drops sharply and performance becomes dominated by DDR bandwidth, which is quickly saturated even with 4-bit quantization. A static edge accelerator must therefore provision hardware and a single dataflow for both regimes, duplicating attention logic, con- trol, and buffering and limiting model size, frequency, and usable context length. Modern FPGAs, including AMD Zynq and Versal families, sup- port Dynamic Function Exchange (DFX), a vendor-integrated form of dynamic and partial reconfiguration that allows part of the fabric to be reprogrammed while the rest continues to operate. In the DFX flow, the design is split into a static region and one or more re- configurable partitions (RPs) that can host multiple reconfigurable modules (RMs) loaded via partial bitstreams. For modest RP sizes, reconfiguration can complete in milliseconds. Recent works have explored DPR-based FPGA accelerators for CNNs and small-scale neural networks on edge devices [3–5]. However, these designs mainly target vision workloads with static computation patterns, and do not address the highly asymmetric and dynamic compute and memory characteristics in autoregressive LLM inference. The prefill-decode asymmetry in LLM inference is a natural fit for logic swapping on edge FPGAs. In our design, the ternary table- lookup MatMul and weight-buffering engines, which are shared by both phases, reside in the static region, while the attention sub- system is implemented as a reconfigurable partition with two RMs: a

…(Full text truncated)…

📸 Image Gallery

PD-Swap_FIG_1_Ye.png Prefill_attention.png TLMM_1.png decoding_attention.png dse_flow_chart.png dse_roof_line.png logic_swap.png reconfig_overlap2.png reverse_zhiheng.png throughput_plot_new.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut