Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap
ArXiv ID: 2512.10236
Date: 2025-12-11
Authors: ** Shagnik Pal¹², Shaizeen Aga¹, Suchita Pati¹, Mahzabeen Islam¹, Lizy K. John² ¹ Advanced Micro Devices Inc. ² University of Texas at Austin (연락처: shagnik@utexas.edu 등) — **

📝 Abstract

As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model across GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of data-dependent communication and computation operations where communication is exposed, leaving as high as 1.7x ideal performance on the table. Prior works harness the fact that ML model state and inputs are already sharded, and employ careful overlap of individual computation/communication shards. While such coarse-grain overlap is promising, in this work, we instead make a case for finer-grain compute-communication overlap which we term FiCCO, where we argue for finer-granularity, one-level deeper overlap than at shard-level, to unlock compute/communication overlap for a wider set of network topologies, finer-grain dataflow and more. We show that FiCCO opens up a wider design space of execution schedules than possible at shard-level alone. At the same time, decomposition of ML operations into smaller operations (done in both shard-based and finer-grain techniques) causes operation-level inefficiency losses. To balance the two, we first present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures. Doing so helps us design heuristics that frameworks and runtimes can harness to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed bespoke schedules deliver up to 1.6x speedup and our heuristics provide accurate guidance in 81% of unseen scenarios.

💡 Deep Analysis

📄 Full Content

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap Shagnik Pal1,2, Shaizeen Aga1, Suchita Pati1, Mahzabeen Islam1, Lizy K. John2 1Advanced Micro Devices Inc. , 2The University of Texas at Austin shagnik@utexas.edu, {shaizeen.aga, suchita.pati, mahzabeen.islam}@amd.com, ljohn@ece.utexas.edu Abstract—As both ML training and inference are increasingly distributed, parallelization techniques that shard (divide) ML model state and inputs, generally into the number of GPUs of a distributed system, are often deployed. With such techniques, there is a high prevalence of on-critical-path data-dependent com- munication and computation operations where communication is exposed, leaving as high as 1.7× ideal performance on the table. To recover this lost performance, prior works harness the fact that ML model state and inputs are already sharded and employ careful overlap of individual computation/communication shards when possible. While such coarse-grain overlap is promising, in this work, we instead make a case for finer-grain compute- communication overlap which we term FiCCO, where we argue for finer-granularity, one-level deeper overlap than at shard-level, to unlock compute/communication overlap for a wider set of network topologies, finer-grain dataflow and more. We show that FiCCO opens up a wider design space of exe- cution schedules than possible at shard-level alone. At the same time, decomposition of ML operations into smaller operations (done in both shard-based and finer-grain techniques) causes operation-level inefficiency losses. To balance the two, we first present a detailed characterization of these inefficiency losses, then present a design space of FiCCO schedules, and finally overlay the schedules with concomitant inefficiency signatures. Doing so helps us design heuristics that frameworks and runtimes can harness to select bespoke FiCCO schedules based on the nature of underlying ML operations. Finally, to further minimize contention inefficiencies inherent with operation overlap, we offload communication to GPU DMA engines. We evaluate several scenarios from realistic ML deployments and demonstrate that our proposed bespoke schedules deliver up to 1.6× speedup and our heuristics provide accurate guidance in 81% of unseen scenarios. Index Terms—Finer-grain overlap, GPUs, ML, DMAs I. INTRODUCTION The steep and continual increase of compute and memory needs of ML [41] has led to increased reliance on distributed computing over multiple GPUs. For instance, training Llama3 models involved close to 16K GPUs [15]. Such distributed setups deploy various ML model parallelization strategies [26], [33], [35], [60] which shard ML model state and inputs over participating GPUs necessitating communication collectives such as all-gather amongst GPUs to communicate model state (e.g., activations) at periodic intervals. While communication can be hidden in the shadow of independent computation where possible, said communication can be exposed otherwise. An example of the former is fully- sharded data parallel [60] technique where weights are parti- tioned across GPUs and the communication of weights of the GEMM (G) Collective (C) FiCCO speedup Serial Execution FiCCO Execution C0 C1 C2 GPU DMA GPU G1 G2 G3 1 2 Inefficiencies FiCCO search space 3 FiCCO heuristics G0 a b c Fig. 1: Speedup with finer-grain decomposition of data- dependent communication and computation (FiCCO). next layer can be overlapped with the computation of the cur- rent layer. Several examples of the latter are highly prevalent in ML and include tensor-sequence parallelism [31] and context- parallelism [35], wherein communication on critical-path feeds into a data-dependent computation. In such cases, exposed communication leaves as high as 1.7× ideal performance on the table, and addressing this is the focus of this work. To address above challenge, as ML parallelization tech- niques already shard ML models and inputs (e.g., tensor parallelism shards model weights of single layer equally across GPUs), prior works [2], [24] overlap computation and communication at shard granularity (shard-level) to deliver speedups. However, such coarse-grain shard-based techniques manifest a severe limitation in that they harness peer-to-peer communication operations (i.e., a GPU communicating with only one other GPU at a time) which while suitable for switch- based GPU networks (flexible bandwidth allocation), leaves network links idle with direct-connection based GPU networks delivering considerably lower performance (up to 3.9× lower). Further, as they inherently operate at shard-granularity they limit granularity of subsequent operations. We observe in this work that finer-grain compute- communication overlap which we term FiCCO, wherein com- munication is decomposed at one-level deeper granularity (i.e., transfer sizes one-eighth that of shard-based overlap in an eight GPU system) allow overcoming of above discussed li

📄 Read Full PDF on ArXiv