Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via T wo-T ier Re gister -Threadgroup Memory Decomposition Mohamed Amine BERGA CH Illumina mbergach@illumina.com Abstract —W e present an optimized Fast F ourier T rans- form (FFT) implementation for Apple Silicon GPUs, achiev- ing 138.45 GFLOPS for N = 4096 complex single-pr ecision transforms—a 29% improvement over Apple’ s highly opti- mized vDSP/Accelerate baseline (107 GFLOPS). Our approach is grounded in a two-tier local memory model that f ormally characterizes the Apple GPU’ s 208 KiB register ﬁle as the primary data-resident tier and the 32 KiB thr eadgroup memory as an exchange-only tier , extending the decomposition framework established in a 2015 PhD thesis on Intel integrated GPU FFT for radar processing . W e implement and evaluate radix-4 and radix-8 split-radix Stockham kernels in Metal Shading Language (MSL), demonstrating that the radix-8 decimation-in-time but- terﬂy with 512 threads yields the best performance. W e further present the ﬁrst inv estigation of Apple’ s simdgroup_matrix 8 × 8 hardware MMA f or FFT b utterﬂy computation and report the counter-intuiti ve ﬁnding that on Apple GPU , threadgroup memory barriers are inexpensive ( ∼ 2 cycles) while scattered threadgr oup access patterns are the true bottleneck. Our multi- size implementation supports N = 256 through N = 16384 using a four -step decomposition for sizes exceeding the 32 KiB threadgr oup memory limit. All kernels are validated against vDSP refer ence outputs. Index T erms —FFT , Apple Silicon, GPU computing, Metal, radar signal processing, Stockham algorithm, simdgroup matrix I . I N T RO D U C T I O N The Fast Fourier Transform (FFT) remains a fundamental computational kernel in radar signal processing, particularly in Synthetic Aperture Radar (SAR) systems where range and azimuth compression require millions of FFT operations per second [1]. GPU acceleration of FFT has been extensi vely studied for NVIDIA architectures [2]–[4], yet Apple Silicon GPUs—found in the M-series system-on-chip (SoC) fam- ily po wering an increasing number of deployed computing platforms—remain lar gely unexplored for high-performance FFT . This work extends a 2015 PhD thesis [5], [6] that estab- lished an optimal FFT decomposition frame work for Intel IvyBridge/Haswell integrated GPUs in the context of radar processing. The thesis’ s core principle is: Given a GPU execution unit with local memory of size M bytes, identify the lar gest FFT of size B points whose working set ﬁts entir ely in M . This block size B becomes the fundamental building unit. On Intel’ s architecture (circa 2015), the binding constraint was ∼ 2 KiB of EU local memory , limiting local FFT to B = 2 10 = 1024 points. On Apple Silicon (M1 onwards), the architectural landscape is fundamentally dif ferent: the GPU provides a two-tier local storage hierarchy comprising a 208 KiB register ﬁle and 32 KiB threadgroup (shared) mem- ory , uniﬁed physical memory eliminating CPU-GPU transfer ov erhead, and 32-wide SIMD groups with hardware shufﬂe instructions. A. Contrib utions 1) T wo-tier local memory model for FFT on Apple GPU: a formal characterization mapping the 208 KiB register ﬁle (T ier 1, data-resident) and 32 KiB threadgroup memory (T ier 2, exchange-only) to the decomposition framew ork of [5], with empirically measured bandwidth and latency parameters. 2) Radix-8 split-radix DIT Stockham kernel achieving 138.45 GFLOPS (1.78 µ s/FFT) at N = 4096 , batch 256—a 29% speedup over Apple’ s vDSP/Accelerate (107 GFLOPS). 3) Empirical ﬁnding that on Apple GPU, threadgroup memory barriers are cheap ( ∼ 2 cycles) while scattered threadgroup access is expensi ve, contradicting the com- mon GPU optimization heuristic of minimizing barriers. 4) First inv estigation of simdgroup_matrix (Apple’ s 8 × 8 hardware MMA) for FFT radix-8 DFT compu- tation, with analysis of why data marshaling over - head negates the MMA advantage for single-FFT -per- threadgroup conﬁgurations. 5) Four -step FFT decomposition for sizes N = 8192 and N = 16384 that exceed the 32 KiB threadgroup memory capacity . 6) Multi-size validation from N = 256 through N = 16384 against vDSP reference outputs. All source code, kernels, and benchmarks are a vail- able under the MIT license at https://github .com/aminems/ AppleSiliconFFT. I I . B A C K GR O U N D A. FFT and the Cooley-T uke y F actorization An N -point DFT computes X [ k ] = P N − 1 n =0 x [ n ] W nk N , where W N = e − 2 π i/ N . The Cooley-T uke y factorization for N = N 1 · N 2 yields: F N = ( F N 1 ⊗ I N 2 ) · T N · ( I N 1 ⊗ F N 2 ) (1) where T N is a diagonal twiddle factor matrix and ⊗ denotes the Kronecker product [7], [8]. B. Stoc kham Autosort FFT The Stockham autosort formulation [9] eliminates the bit- rev ersal permutation of classical Cooley-T uke y by absorbing the index permutation into the natural stage addressing. Each stage reads from one buf fer and writes to another with per- muted indices, producing correctly ordered output without an explicit reordering step. This out-of-place-per-stage approach is well-suited to GPU shared memory , where coalesced access patterns are critical [10]. All kernels in this work use the Stockham formulation, consistent with VkFFT [11] and other modern GPU FFT implementations. C. The 2015 Thesis F r amework The 2015 thesis [5] formalized optimal FFT stage grouping for Intel IvyBridge/Haswell integrated GPU (OpenCL). For N = 2 n with maximum local block B = 2 b , the decomposi- tion requires L = ⌈ n/b ⌉ levels, with L − 1 global memory transpositions. The thesis demonstrated that maximizing B (and thus minimizing L ) minimizes the dominant cost: global memory bandwidth. Ke y parameters of the 2015 work: • EU local memory: ∼ 2 KiB, limiting B ≤ 2 10 • SIMD width: 8 (A VX CPU) / 4–8 (GPU) • Radix mix: radix-2/4/8 for GPU, pure radix-8 for CPU (A VX) • Discrete memory model with signiﬁcant CPU-GPU trans- fer overhead This paper extends the framework to Apple Silicon, where the memory hierarchy , SIMD width, and uniﬁed memory model are fundamentally different. D. Radar Signal Pr ocessing Context SAR processing relies heavily on batched 1D FFTs: range compression applies N r -point FFTs across azimuth lines, and azimuth compression applies N a -point FFTs across range bins, with N r , N a typically 2 10 – 2 14 [1]. The batched nature provides natural parallelism well-suited to GPU e xecution, with batch sizes of 256–16384 independent transforms. T ABLE I A P PL E M 1 G P U C O M P UT E P A R AM E T ER S Parameter V alue GPU cores 8 ALUs per core 128 FP32 FLOPs/cycle/core 256 (128 FMA) FP16 FLOPs/cycle/core 512 SIMD group width 32 threads Max threads/threadgroup 1024 Max SIMD groups/threadgroup 32 GPRs per thread up to 128 × 32-bit Register ﬁle per threadgroup 208 KiB Threadgroup memory 32 KiB Uniﬁed DRAM bandwidth 68 GB/s GPU clock 1278 MHz I I I . A P P L E S I L I C O N G P U H A R D W A R E M O D E L W e target the Apple M1 GPU (8 cores, 1278 MHz) as our primary ev aluation platform. The architectural parameters de- riv e from ofﬁcial Apple documentation [12], [13], community rev erse engineering [14]–[16], and our own microbenchmarks. A. GPU Compute Arc hitectur e Each Apple GPU core contains 128 ALUs organized as 4 ex ecution pipelines, each dispatching 32-wide SIMD instruc- tions. T able I summarizes the key parameters. B. T wo-T ier Local Memory Model The central architectural insight for FFT design on Apple GPU is the two-tier local memory hierar c hy . Unlike Intel’ s EU model (single local memory level of ∼ 2 KiB), Apple GPU provides two distinct on-chip storage tiers: • Tier 1 — Registers + SIMD shufﬂe : 208 KiB reg- ister ﬁle (priv ate per thread, up to 128 × 32-bit GPRs). Cross-thread exchange within a 32-thread SIMD group via simd_shuffle at ∼ 1–2 cycles. T otal capacity: 208 KiB. • Tier 2 — Threadgroup memory : 32 KiB shared across all threads in the threadgroup. The only mechanism for data e xchange between SIMD groups. Access latency ∼ 2–4 cycles. The 208 KiB ﬁgure refers to the register ﬁle, not thread- group memory—a point of confusion in some sources. Thread- group memory is implemented via tile memory in Apple’ s TBDR (T ile-Based Deferred Rendering) architecture [13]. Im- ageblock access draws from the same 32 KiB pool, providing no additional capacity [13]. FFT implication : The kernel design must maximize work in registers and SIMD shuf ﬂes (T ier 1), using threadgroup memory (T ier 2) only for inter-SIMD-group data exchange. This inv erts the typical GPU FFT strategy of keeping data in shared memory . C. Empirical Memory Characterization W e measure threadgroup memory bandwidth and related parameters via microbenchmarks on M1. T able II summarizes the results. T ABLE II M E AS U R E D M EM O RY S U B SY S T E M P ER F O R MA N C E ( M 1 G P U ) Metric Measured Threadgroup memory BW (sequential) 688 GB/s Threadgroup memory BW (strided) 217 GB/s SIMD shufﬂe throughput (ﬂoat2) 262 GB/s Register -threadgroup copy BW 407–420 GB/s Optimal thread count (butterﬂy) 1024 Occupancy drop threshold ∼ 128 GPRs/thread T ABLE III H A RD W A RE C O MPA R IS O N : I N T EL I V Y B R I D GE E U V S . A P PL E M 1 G P U Parameter Intel EU Apple M1 GPU SIMD width 8–16 32 Local/shared memory ∼ 2 KiB 32 KiB Register ﬁle ∼ 2 KiB 208 KiB Max local FFT (FP32) 2 10 2 12 Memory model Discrete Uniﬁed T ransfer overhead Signiﬁcant Zero DRAM bandwidth 25.6 GB/s 68 GB/s The 3.2 × bandwidth difference between sequential and strided threadgroup access is a key ﬁnding: access pattern matters far more than barrier count . This directly informed our kernel design decisions (Section V). D. Comparison with 2015 Thesis Hardwar e T able III compares the Apple M1 GPU with the Intel IvyBridge GPU used in the 2015 thesis. The Apple GPU offers 16 × more shared memory , ∼ 100 × more register ﬁle, 4 × wider SIMD, and zero CPU-GPU transfer overhead. I V . O P T I M A L F F T D E C O M P O S I T I O N A. Extending the Thesis F r amework Follo wing [5], we seek the largest FFT size B computable within a single threadgroup dispatch (no device memory exchange between stages). For Apple GPU: Threadgr oup memory constraint : 32 KiB = 32 , 768 bytes. W ith complex ﬂoat32 at 8 bytes/element: B max = ⌊ 32 , 768 / 8 ⌋ = 4 , 096 = 2 12 (2) This uses the register-tiled Stockham approach where a single threadgroup buf fer is reused across stages. The classic double-buf fered Stockham would limit B ≤ 2 11 . Register -resident extension : W ith 256 threads × 32 com- plex elements/thread in registers, B can reach 2 13 (FP32) or 2 14 (FP16), using threadgroup memory only as a narrow exchange pipe. This trades occupancy for local FFT size— acceptable for batch radar processing. B. Decomposition for N > B For FFT sizes exceeding the single-threadgroup limit ( N > 4096 ), we apply the four-step FFT method [7]: F N = ( F N 1 ⊗ I N 2 ) · T N · P · ( F N 2 ⊗ I N 1 ) (3) T ABLE IV R A DI X A N ALY SI S F O R A PP L E G P U ( 1 2 8 G P R S / T H R E AD ) Radix FLOPs/bﬂy GPRs Stages ( N =4096) Barriers 2 10 8 12 ∼ 22 4 34 18 6 ∼ 10 8 94 38 4 ∼ 6 16 214 78 3 ∼ 4 where P is a stride permutation (transpose) and N = N 1 · N 2 with N 2 ≤ B . Each sub-FFT of size N 2 is computed within a single threadgroup; the transpose between steps passes through device memory . On Apple Silicon, the uniﬁed memory architecture reduces the transpose cost compared to discrete GPU systems: data passes through the System Lev el Cache (SLC, 8–96 MB depending on chip variant) rather than requiring explicit DMA transfers. C. Radix Selection W e analyze radix- k b utterﬂies for register pressure and arithmetic efﬁciency on Apple GPU (T able IV). Radix-8 uses only 30% of the register b udget while re- quiring just 4 stages for N = 4096 . Radix-16 is feasible but consumes 61% of registers, leaving insufﬁcient room for twiddle factors and compiler temporaries. Radix-32 exceeds the 128-GPR budget and causes register spilling. The simdgroup_matrix hardware further fav ors radix- 8: Apple’ s MMA operates on 8 × 8 matrices exclusi vely , mak- ing radix-8 the unique radix that maps to a single MMA tile dimension (Section V -C). D. Synthesis Rules W e formalize the decomposition rules extending [5]: 1) Single-threadgroup ( N ≤ 4096 ): Compute entirely in one threadgroup dispatch. Use radix-4 or radix-8 Stock- ham stages through threadgroup memory . ⌈ log k N ⌉ stages, each with one barrier pair . 2) Four -step ( 4096 < N ≤ 2 14 ): Decompose N = N 1 × N 2 with N 2 ≤ 4096 . T wo threadgroup dispatches with an intermediate device-memory transpose. T widdle factors applied during the transpose. 3) Multi-level four -step ( N > 2 14 ): Recursiv e application with SLC-resident intermediates for N ≤ SLC capacity . V . I M P L E M E N T A T I O N All kernels are written in Metal Shading Language (MSL) as compute shaders, compiled at runtime by Metal’ s shader compiler , and dispatched via the Metal Compute API from a Swift host application. Results are v alidated against Apple’ s vDSP fft_zop reference implementation. A. Radix-4 Stockham K ernel ( N = 4096 ) Our baseline kernel uses 6 radix-4 Stockham passes with 1024 threads and a single 32 KiB threadgroup b uffer ( float2 buf[4096] ). Key optimizations: 1) Single sincos per butterﬂy : T widdle factors w 2 = w 2 1 and w 3 = w 3 1 are computed via complex multiplication from a single sincos call, reducing transcendental function ev aluations by 3 × . 2) Device-memory bypass : Pass 0 reads directly from device memory (no initial load-to-threadgroup-then- barrier). The ﬁnal pass writes directly to device output, eliminating two barriers. 3) Fully unrolled passes : All 6 passes are statically unrolled with compile-time constant strides, enabling the Metal compiler to optimize register allocation and instruction scheduling. The kernel achiev es 113.6 GFLOPS at batch 256 (2.16 µ s/FFT) with 10 threadgroup barriers total. B. Radix-8 Split-Radix DIT K ernel ( N = 4096 ) The radix-8 kernel uses 4 passes with 512 threads. Each thread processes 8 elements per pass. The radix-8 butterﬂy is implemented as a split-radix DIT decomposition: DFT 8 = radix-2 ( DFT ev en 4 , DFT odd 4 · W 8 ) (4) This reduces the butterﬂy from ∼ 320 FLOPs (na ¨ ıve 8 × 8 complex matrix-vector) to ∼ 52 real additions and 12 real multiplications per butterﬂy . The twiddle application between stages uses the same single-sincos chain as the radix-4 kernel, extended to 7 twiddle factors ( w 1 through w 7 ) via successive complex multiplications. Thread count : 512 threads (16 SIMD groups), not 1024. At 8 elements/thread, each thread uses ∼ 38 GPRs for butterﬂy data, leaving ample room for twiddles and temporaries. The reduced thread count avoids the occupancy cliff at ∼ 128 GPRs/thread while maintaining sufﬁcient parallelism. Perf ormance : 138.45 GFLOPS at batch 256 (1.78 µ s/FFT)—our best result, 29% abov e vDSP . C. simdgroup_matrix MMA Kernel W e inv estigate mapping the radix-8 DFT to Apple’ s 8 × 8 hardware MMA ( simdgroup_float8x8 ). The DFT 8 ma- trix F 8 is constant with entries F 8 [ j, k ] = W j k 8 . A complex 8 × 8 multiply decomposes into 4 real MMA operations: Y re = F 8 , re · X re − F 8 , im · X im (5) Y im = F 8 , re · X im + F 8 , im · X re (6) The MMA hardware achieves ∼ 102 FFMA32/cycle versus ∼ 25 for scalar SIMD—a 4 × ALU utilization advantage [15], [17]. Howe v er , the complex DFT via real MMA requires ∼ 3.4 × more total FLOPs than the direct split-radix butterﬂy , yielding a net estimated speedup of only ∼ 1.2 × for FP32. Practical challenge : The simdgroup_matrix data dis- tribution across 32 threads (2 elements/thread for 8 × 8 = 64 elements) requires careful data marshaling between the Stockham layout in threadgroup memory and the MMA tile layout. This marshaling—loading from threadgroup memory into MMA tiles and storing back—adds overhead that negates the ALU advantage for single-FFT -per-threadgroup conﬁgura- tions. T ABLE V M U L T I -S I Z E K E RN E L C O N FIG U R A T IO N N Threads Passes (radix-4) Threadgroup mem 256 64 4 2 KiB 512 128 4 + 1 (radix-2) 4 KiB 1024 256 5 8 KiB 2048 512 5 + 1 (radix-2) 16 KiB 4096 1024 6 32 KiB For batched operation (8+ simultaneous FFTs per thread- group), the MMA approach becomes more attractiv e as the matrix multiply dimensions align naturally . This is relev ant for SAR processing where hundreds of range lines provide the batch dimension, but our current single-FFT kernel does not beneﬁt. D. Multi-Size K ernels W e implement single-threadgroup kernels for N = 256 , 512 , 1024 , 2048 , 4096 using radix-4 Stockham with size- appropriate thread counts (T able V). For N = 512 and N = 2048 (not pure powers of 4), we append a ﬁnal radix-2 pass. For N = 8192 and N = 16384 , we use the four-step decomposition through device memory: N = 8192 : N 1 = 2 , N 2 = 4096 (7) N = 16384 : N 1 = 4 , N 2 = 4096 (8) Each step dispatches the N = 4096 single-threadgroup kernel on sub-problems, with an intermediate transpose kernel applying twiddle factors. E. SIMD Shufﬂe Experiment W e also implemented a variant using simd_shuffle for intra-SIMD-group radix-32 sub-FFTs, eliminating threadgroup memory for the lower 5 bits of the FFT index. This achieved only 61.5 GFLOPS—56% below the radix-8 Stockham kernel. The cause: the SIMD shufﬂe approach requires scattered threadgroup memory access patterns for the inter-SIMD-group exchange stages, which incurs the 3.2 × bandwidth penalty identiﬁed in T able II. This result conﬁrms our two-tier model: minimizing thread- group memory tr afﬁc is more important than minimizing threadgroup memory barriers . V I . E X P E R I M E N T A L R E S U LT S A. Experimental Setup All measurements are performed on an Apple M1 (8 GPU cores, 1278 MHz, 68 GB/s DRAM) running macOS. Each benchmark executes 1000 iterations after a 100-iteration warmup, reporting median GPU time from Metal’ s command buf fer timestamps. GFLOPS are computed as 5 N log 2 N × batch / time. The baseline is Apple’ s vDSP/Accelerate framework ( vDSP_fft_zop ), which uses the AMX coprocessor and NEON SIMD internally and represents Apple’ s own best-ef fort FFT optimization [18]. T ABLE VI P E RF O R M AN C E AT N = 4096 , B A TC H 2 5 6 ( A P P LE M 1 ) Ker nel GFLOPS µ s/FFT vs. vDSP vDSP/Accelerate 107.0 2.29 1.00 × Radix-4 Stockham 113.6 2.16 1.06 × Radix-8 Stockham 138.45 1.78 1.29 × SIMD shufﬂe variant 61.5 3.99 0.57 × T ABLE VII M U L T I -S I Z E P E RF O R M AN C E AT O P T I MA L B A T CH ( A P PL E M 1 ) N Decomposition GFLOPS µ s/FFT 256 Single TG 53 0.29 512 Single TG 66 0.42 1024 Single TG 83 0.49 2048 Single TG 97 0.85 4096 Single TG (R-8) 138.45 1.78 8192 F our-step 112 3.80 16384 Four-step 103 8.87 B. N = 4096 P erformance Comparison T able VI presents the main performance comparison at N = 4096 . The radix-8 kernel’ s 29% advantage ov er vDSP is notable because vDSP is Apple’ s own heavily optimized implemen- tation that uses the AMX matrix coprocessor internally . Our GPU kernel achiev es this speedup by exploiting the GPU’ s 256 FP32 FLOPs/cycle × 8 cores = 2048 FLOPs/cycle peak throughput, versus AMX’ s ∼ 350 GFLOPS per CPU core [19]. C. Multi-Size Results T able VII presents performance across FFT sizes. Performance increases with N up to the single-threadgroup limit ( N = 4096 ), reﬂecting improved arithmetic intensity . The four-step kernels ( N = 8192 , N = 16384 ) sho w a performance drop due to the device-memory transpose, but remain above 100 GFLOPS. D. Scaling with Batch Size Fig. 1 illustrates performance scaling with batch size for the radix-8 kernel at N = 4096 . The GPU kernel requires batch ≥ 64 to saturate the GPU cores and exceed the vDSP baseline. For small batches ( ≤ 16), vDSP’ s lower dispatch overhead giv es it an advantage. E. Barrier vs. Access P attern Analysis A critical e xperimental ﬁnding is the relati ve cost of threadgroup memory barriers v ersus access pattern quality . T able VIII compares two design points. The SIMD shufﬂe variant uses fewer barriers but achiev es only 44% of the radix-8 kernel’ s throughput because the inter-SIMD exchange requires scattered threadgroup access. Each barrier costs only ∼ 2 GPU cycles on Apple hardware, while a scattered access pattern reduces effecti ve threadgroup bandwidth by 3.2 × . Architectural insight : On Apple GPU, barriers synchronize via a lightweight hardware mechanism (consistent with the Batch size GFLOPS 0 35 70 105 140 1 4 16 64 128 256 512 vDSP (107) Radix-8 (138) Fig. 1. Performance scaling with batch size ( N = 4096 , M1). The GPU kernel saturates at batch ∼ 128, exceeding vDSP at batch ≥ 64. T ABLE VIII B A R R IE R C O UN T V S . A CC E S S P A T T E R N I M PAC T ( N = 4096 ) Design Barriers TG Access GFLOPS Radix-8 Stockham 6 Sequential 138.45 SIMD shufﬂe hybrid 4 Scattered 61.5 TBDR architecture’ s tile synchronization hardware), making them nearly free. The threadgroup memory bank structure, howe v er , heavily penalizes non-sequential access patterns. V I I . D I S C U S S I O N A. Comparison with 2015 Thesis T able IX compares the key results from this work with the 2015 thesis [5]. The 4 × increase in local FFT size (from 2 10 to 2 12 ) directly follows from the 16 × increase in shared memory (2 KiB to 32 KiB). The thesis’ s prediction that maximizing local block size minimizes global memory passes is conﬁrmed: our N = 4096 single-threadgroup kernel av oids all device memory trafﬁc for intermediate results. The uniﬁed memory architecture validates the thesis’ s cost model simpliﬁcation: the CPU-GPU transfer term that dom- inated the 2015 model drops to zero, making the theoretical optimal decomposition achiev able in practice. B. Ar c hitectural Insights Barriers are cheap, scattered access is expensive. This contradicts the conv entional GPU optimization wisdom (from NVIDIA/AMD) of minimizing synchronization barriers. On Apple GPU’ s TBDR architecture, the tile synchronization hardware makes barriers nearly free. Kernel designers should instead focus on ensuring sequential (coalesced) threadgroup memory access patterns. Higher radix is better , up to register limits. Radix- 8 outperforms radix-4 by 22% (138.45 vs. 113.6 GFLOPS) despite performing more arithmetic per butterﬂy . The fewer passes reduce total threadgroup memory trafﬁc, which is the true bottleneck. Howe ver , radix-16 would consume 61% of registers, likely causing performance degradation from reduced occupancy . T ABLE IX C O MPA R IS O N W I T H 2 0 1 5 P H D T H ES I S R E S ULT S Metric 2015 (Intel GPU) This work (M1) Max local FFT 2 10 2 12 (4 × ) Local memory used ∼ 2 KiB 32 KiB (16 × ) Register ﬁle ∼ 2 KiB 208 KiB (104 × ) Best GFLOPS ∼ 20 138.45 (7 × ) vs. vendor baseline > MKL > vDSP (1.29 × ) Radix strategy Mixed 2/4/8 Pure radix-8 T ransfer overhead Dominant cost Zero (uniﬁed) Thread count matters. VkFFT’ s Metal backend artiﬁcially limits threads to 256 and reports optimal performance at 64 threads [11]. Our kernels use 512 (radix-8) or 1024 (radix-4) threads. The optimal thread count depends on the per-thread register footprint: the radix-4 kernel’ s light register pressure allows 1024 threads, while the radix-8 kernel’ s heavier foot- print is best at 512. C. simdgroup_matrix for FFT : Current Limitations Our inv estigation of simdgroup_matrix for radix-8 DFT reveals a mismatch between the MMA hardware’ s strengths and the single-FFT -per-threadgroup ex ecution model: 1) Data marshaling overhead : Loading data from the Stockham layout in threadgroup memory into 8 × 8 MMA tiles and storing results back requires format con version that consumes cycles. 2) FLOP inﬂation : Complex 8 × 8 multiply via real MMA requires 4 operations where the split-radix DIT butterﬂy needs only ∼ 64 real FLOPs—a 3.4 × arithmetic ov er- head. 3) Batch dimension : MMA is designed for matrix-matrix products. For single-FFT ex ecution, the “batch” dimen- sion of the 8 × 8 input matrix is degenerate, underutiliz- ing the hardware. The MMA approach becomes attractiv e for batched exe- cution (8+ simultaneous FFTs) where the matrix dimensions align naturally . This is directly applicable to SAR range processing (hundreds of azimuth lines as the batch dimension) and represents a promising direction for future work. D. Implications for Radar Pr ocessing The 1.78 µ s/FFT at N = 4096 enables real-time processing rates. For a SAR system with N r = 4096 range bins and 256-line azimuth blocks: T range = 256 × 1 . 78 µ s = 456 µ s (9) This leaves substantial headroom for azimuth compression, matched ﬁltering, and other pipeline stages within typical SAR frame times of 10–100 ms. The uniﬁed memory architecture further beneﬁts the pipeline by eliminating data staging be- tween CPU-side control and GPU-side processing. V I I I . R E L A T E D W O R K VkFFT [11], [20] is the only production-quality FFT library with nativ e Metal support. It uses runtime code generation to produce Stockham autosort kernels across six GPU backends. On Apple Silicon, VkFFT limits threads to 256 and identiﬁes threadgroup memory bandwidth as the primary bottleneck. Our work conﬁrms this ﬁnding and shows that higher thread counts (512–1024) with architecture-aware radix selection yield signiﬁcantly better performance. tcFFT [4] demonstrated that FFT butterﬂies can be re- formulated as matrix multiplications on NVIDIA T ensor Cores, achieving 1.29–3.24 × speedup ov er cuFFT in FP16. Their radix-16 approach maps to NVIDIA ’ s 16 × 16 T en- sor Core tiles. W e adapt this concept to Apple’ s 8 × 8 simdgroup_matrix , ﬁnding that the smaller tile size makes radix-8 the natural choice but data marshaling overhead limits single-FFT performance. T urboFFT [3] presents architecture-aware FFT optimiza- tions for NVIDIA A100/T4 with fault tolerance. Their re gister- lev el optimization strate gy parallels our two-tier model, though the speciﬁc architectural parameters (shared memory size, warp width, tensor core dimensions) differ signiﬁcantly . MetalFFT [21] was an attempt at Metal-nativ e FFT by Philip Turner that was abandoned after failing to outperform vDSP/Accelerate. Our work succeeds where MetalFFT did not, primarily through the radix-8 split-radix b utterﬂy and architecture-informed thread/memory conﬁguration. FFTW [22] and cuFFT [2] represent the state of the art for CPU and NVIDIA GPU FFT respectively , but neither targets Apple Silicon GPU. ThunderMittens [17] ported the ThunderKittens T ensor Core framew ork from NVIDIA 16 × 16 to Apple GPU 8 × 8 simdgroup_matrix , providing the ﬁrst systematic charac- terization of MMA performance on Apple Silicon. W e build on their performance measurements in our MMA kernel analysis. I X . C O N C L U S I O N A N D F U T U R E W O R K W e have presented a high-performance FFT implementation for Apple Silicon GPU that achiev es 138.45 GFLOPS at N = 4096 —a 29% improvement ov er Apple’ s vDSP/Accelerate. The key enabler is a two-tier local memory model that maps the 208 KiB register ﬁle and 32 KiB threadgroup memory to distinct roles in the FFT decomposition, extending the framew ork of a 2015 PhD thesis from Intel integrated GPU to Apple Silicon. Our empirical ﬁndings—that barriers are cheap and scat- tered access is expensiv e on Apple GPU, that radix- 8 outperforms radix-4 despite more arithmetic, and that simdgroup_matrix MMA does not yet beneﬁt single- FFT kernels—provide actionable guidance for GPU k ernel optimization on Apple Silicon beyond FFT . A. Futur e W ork SAR pipeline integration : Fusing FFT with windo wing, matched ﬁltering, and corner-turn operations within a single Metal render/compute pass to exploit data locality in the 32 KiB threadgroup memory . No published kernel-fused SAR pipeline exists on any GPU platform. Batched simdgroup_matrix FFT : Processing 8+ si- multaneous FFTs per threadgroup using MMA radix-8 but- terﬂies. SAR range processing naturally provides this batch dimension. The estimated FP32 speedup of 1.2 × (2.4 × for FP16) over scalar SIMD makes this attractiv e for throughput- oriented workloads. Mixed-precision FFT : Exploiting Apple GPU’ s nativ e FP16 support (2 × throughput, free conv ersion) for radar applications where sufﬁcient SNR permits reduced precision. The 32 KiB threadgroup memory would support local FFTs up to 2 13 at FP16. AMX coprocessor integration : Using the CPU-side AMX matrix unit for batched small-FFT ( N ≤ 16 ) processing as matrix multiplies, with the GPU handling larger FFTs— enabled by uniﬁed memory with zero-copy sharing. Larger Apple Silicon chips : The M4 Max (40 GPU cores, 546 GB/s bandwidth) should scale performance roughly proportional to core count for batched workloads, potentially exceeding 500 GFLOPS for batched N = 4096 FFTs. R E P RO D U C I B I L I T Y All source code, Metal compute shaders, benchmark har- nesses, and validation scripts are publicly av ailable at https:// github .com/aminems/AppleSiliconFFT under the MIT license. The repository includes build instructions for reproducing all results on any Apple Silicon Mac (M1 or later). A C K N O W L E D G M E N T The author acknowledges the open-source rev erse engineer- ing ef forts of Dougall Johnson (Apple GPU ISA), Alyssa Rosenzweig (Asahi Linux GPU dri ver), and Philip T urner (Metal benchmarks) that made the architectural characteriza- tion in this work possible. R E F E R E N C E S [1] “The challenge of onboard SAR processing: A GPU opportunity , ” in Computational Science – ICCS 2020 , ser . Lecture Notes in Computer Science, vol. 12139, 2020, pp. 46–59. [2] NVIDIA Corporation, cuFFT Library User’ s Guide , 2024. [Online]. A vailable: https://docs.n vidia.com/cuda/cufft/inde x.html [3] S. W u, Y . Zhai, J. Liu, J. Huang, Z. Jian, H. Dai, S. Di, F . Cappello, and Z. Chen, “T urboFFT : Co-designed high-performance and fault-tolerant fast Fourier transform on GPUs, ” in Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of P ar allel Pr ogramming (PP oPP ’25) . A CM, 2025. [4] S. Li and Y . Cheng, “tcFFT: A fast half-precision FFT library for NVIDIA tensor cores, ” in IEEE International P arallel and Distributed Pr ocessing Symposium W orkshops (IPDPSW) , 2021. [5] M. A. Bergach, “ Adaptation du calcul de la transform ´ ee de Fourier rapide sur une architecture mixte CPU/GPU int ´ egr ´ ee, ” Ph.D. dissertation, Univ ersit ´ e Nice Sophia Antipolis, 2015, iNRIA. In French. [Online]. A vailable: https://theses.hal.science/tel- 01245958 [6] M. A. Bergach, E. K ofman, R. de Simone, S. Tissot, and M. Syska, “Efﬁcient FFT mapping on GPU for radar processing application: modeling and implementation, ” arXiv preprint , 2015. [Online]. A vailable: https://arxiv .org/abs/1505.08067 [7] C. F . V . Loan, Computational F rame works for the F ast F ourier T rans- form , ser . Frontiers in Applied Mathematics. SIAM, 1992, no. 10. [8] J. W . Cooley and J. W . Tuke y , “ An algorithm for the machine calculation of complex F ourier series, ” Mathematics of Computation , vol. 19, no. 90, pp. 297–301, 1965. [9] J. Thomas G. Stockham, “High-speed conv olution and correlation, ” MIT Lincoln Laboratory , T ech. Rep. AFCRL-66-55, 1966. [10] N. K. Govindaraju, B. Lloyd, Y . Dotsenko, B. Smith, and J. Manferdelli, “High performance discrete Fourier transforms on graphics processors, ” in SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Super- computing , 2008. [11] D. T olmachev , “VkFFT – a performant, cross-platform and open-source GPU FFT library , ” IEEE Access , vol. 11, pp. 12 039–12 058, 2023. [12] Apple Inc., Metal Shading Language Speciﬁcation, V ersion 4 , 2024. [Online]. A vailable: https://developer .apple.com/metal/ Metal- Shading- Language- Speciﬁcation.pdf [13] ——, Metal F eatur e Set T ables , 2024. [Online]. A vailable: https: //dev eloper .apple.com/metal/Metal- Feature- Set- T ables.pdf [14] D. Johnson, “Apple G13 GPU architecture reference, ” Online documentation. [Online]. A vailable: https://dougallj.github .io/applegpu/ docs.html [15] P . T urner , “metal-benchmarks: Apple GPU microarchitecture, ” GitHub repository . [Online]. A v ailable: https://github.com/philipturner/ metal- benchmarks [16] A. Rosenzweig, “Dissecting the Apple GPU, part 3, ” Blog post, 2022. [Online]. A v ailable: https://alyssarosenzweig.ca/blog/asahi- gpu- part- 3. html [17] Stanford Hazy Research, “ThunderMittens for your ThunderKittens, ” Blog post, 2024. [Online]. A v ailable: https://hazyresearch.stanford.edu/ blog/2024- 11- 28- tk- mlx [18] Apple Inc., “vDSP .FFT – Accelerate framew ork, ” Apple Developer Documentation. [Online]. A vailable: https://dev eloper .apple.com/ documentation/accelerate/vdsp/fft [19] J. Zhou, “Performance analysis of the Apple AMX matrix accelerator, ” M.Eng. thesis, Massachusetts Institute of T echnology , 2025. [20] D. T olmache v , “VkFFT: V ulkan/CUDA/HIP/OpenCL/Le vel Zero/Metal fast Fourier transform library , ” GitHub repository , 2020, mIT License. [Online]. A vailable: https://github .com/DT olm/VkFFT [21] P . T urner , “MetalFFT: 1d, 2d, and 3d v ariations of fast Fourier transforms, ” GitHub repository , 2022, abandoned; could not outperform vDSP/Accelerate. [Online]. A vailable: https://github .com/philipturner/ metal- fft [22] M. Frigo and S. G. Johnson, “The design and implementation of FFTW3, ” Pr oceedings of the IEEE , vol. 93, no. 2, pp. 216–231, 2005.

Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment