Placement Semantics for Distributed Deep Learning A Systematic Framework for Analyzing Parallelism Strategies
📝 Original Paper Info
- Title: Placement Semantics for Distributed Deep Learning A Systematic Framework for Analyzing Parallelism Strategies- ArXiv ID: 2601.02311
- Date: 2026-01-05
- Authors: Deep Pankajbhai Mehta
📝 Abstract
Training large language models requires distributing computation across many accelerators, yet practitioners select parallelism strategies (data, tensor, pipeline, ZeRO) through trial and error because no unified systematic framework predicts their behavior. We introduce placement semantics: each strategy is specified by how it places four training states (parameters, optimizer, gradients, activations) across devices using five modes (replicated, sharded, sharded-with-gather, materialized, offloaded). From placement alone, without implementation details, we derive memory consumption and communication volume. Our predictions match published results exactly: ZeRO-3 uses 8x less memory than data parallelism at 1.5x communication cost, as reported in the original paper. We prove two conditions (gradient integrity, state consistency) are necessary and sufficient for distributed training to match single-device results, and provide composition rules for combining strategies safely. The framework unifies ZeRO Stages 1-3, Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism as instances with different placement choices.💡 Summary & Analysis
1. **Systematic Placement Semantics**: This paper introduces a systematic framework to analyze memory and communication costs associated with parallelism strategies in distributed deep learning. It defines how each state (parameters, optimizer state, gradients, activations) is placed across devices, allowing for precise cost derivation.-
Cost Derivation Rules: The paper presents rules that derive memory and communication costs from the placement specifications. For example, ZeRO Stage 3 reduces per-device memory to 140GB but increases communication overhead by 1.5 times.
-
Composition Calculus: A method for combining parallelism strategies into new systems is provided, enabling the prediction and optimization of complex distributed training setups.
📄 Full Paper Content (ArXiv Source)
We address a gap between practice and theory in distributed deep learning. Training a 70-billion parameter model requires approximately 1120 GB of memory for model state alone , far exceeding the 80 GB capacity of current GPUs. Practitioners must distribute this state across devices using parallelism strategies: data parallelism (DP) , ZeRO/Fully Sharded Data Parallel (FSDP) , tensor parallelism (TP) , pipeline parallelism (PP) , and expert parallelism .
Each strategy is described through its implementation: communication operations, data structure layouts, and runtime optimizations. This implementation-centric view makes it difficult to answer fundamental questions:
-
What precisely distinguishes ZeRO Stage 2 from Stage 3?
-
Given a new configuration, how much memory will each device use?
-
When can we safely combine tensor parallelism with pipeline parallelism?
-
What properties must hold for distributed training to match single-device results?
Our framework answers these questions precisely. For example, we show that ZeRO Stage 2 and Stage 3 differ in exactly one placement choice (parameters: replicated vs. sharded-with-gather). From this difference alone, we derive that Stage 3 reduces memory from 1120 GB to 140 GB per device, an $`8\times`$ reduction, while increasing communication by $`1.5\times`$. These predictions match the original ZeRO paper exactly.
Our Contribution
We introduce placement semantics, a systematic framework that answers these questions. The framework rests on three ideas:
Training state is the primitive. We identify four states that every training configuration manages: parameters $`\Theta`$, optimizer state $`\Omega`$, gradients $`G`$, and activations $`A`$. These are the fundamental objects of distributed training.
Placement is the specification. For each state, we define its placement: which devices hold which portions. We formalize five placement modes with precise semantics: replicated ($`R`$), sharded ($`S`$), sharded-with-gather ($`S^*`$), materialized ($`M`$), and offloaded ($`O`$). The five modes arise because sharding has two variants: pure sharding where each device uses only its local shard, and sharded-with-gather where shards are temporarily reassembled for computation. This distinction is critical: it separates ZeRO Stage 2 (pure sharding of gradients) from Stage 3 (sharded-with-gather for parameters). We restrict to these five modes as they cover all strategies in current practice; intermediate modes (e.g., $`k`$-way replication for $`1 < k < N`$) are straightforward extensions.
Costs derive from placement. Given a placement specification, we derive memory and communication through formal rules. This is our key technical result: implementation details are unnecessary for resource prediction.
What Is New
While prior work describes specific systems, we contribute systematic foundations that enable reasoning across systems:
-
Systematic placement semantics with precise definitions of modes (Section 3)
-
Derivation rules computing memory and communication from specifications (Section 4)
-
Correctness conditions with proofs of necessity and sufficiency (Section 5)
-
Composition calculus for combining strategies (Section 6)
Prior work describes systems. We provide a systematic framework in which those systems are instances. The relationship is analogous to computational complexity theory versus specific algorithms: complexity theory provides tools to analyze any algorithm, while algorithm papers describe specific solutions.
Validation. We validate our framework against published results from the ZeRO paper . Our derivation rules predict the same memory reduction ($`8\times`$) and communication overhead ($`1.5\times`$) reported by the original authors, confirming that placement specifications capture real system behavior (Section 7).
Background
We establish notation and review what consumes memory during training. We use standard terminology: FP16 and FP32 denote 16-bit and 32-bit floating-point formats respectively; SGD denotes stochastic gradient descent; NVMe denotes Non-Volatile Memory Express storage.
| State | Count | Precision | Memory |
|---|---|---|---|
| Parameters $`\Theta`$ | $`P`$ | FP16 | 140 GB |
| Master weights | $`P`$ | FP32 | 280 GB |
| Optimizer $`\Omega`$ (Adam $`m`$, $`v`$) | $`2P`$ | FP32 | 560 GB |
| Gradients $`G`$ | $`P`$ | FP16 | 140 GB |
| Model state total | 1120 GB |
Memory requirements for training a 70B parameter model with Adam optimizer using mixed-precision training. Following the ZeRO paper’s accounting , we include FP32 master weights. For derivation purposes, we group master weights with optimizer state, giving $`|\Omega| = 12P`$ bytes total.
Training State
A training step transforms parameters $`\Theta_t`$ to $`\Theta_{t+1}`$ using a batch of data. This requires maintaining four state tensors.
Parameters $`\Theta \in \mathbb{R}^P`$ are the model weights. For a transformer with $`L`$ layers and hidden dimension $`H`$, the parameter count is approximately $`P \approx 12LH^2`$ .1 Each attention layer contributes $`4H^2`$ parameters (query, key, value, output projections) and each feed-forward layer contributes $`8H^2`$ parameters (two matrices with 4$`\times`$ expansion).
Optimizer state $`\Omega`$ contains auxiliary values maintained by the optimizer. Adam stores first moment $`m \in \mathbb{R}^P`$ and second moment $`v \in \mathbb{R}^P`$, giving $`|\Omega| = 2P`$. These are stored in FP32 for numerical stability .
Gradients $`G \in \mathbb{R}^P`$ are the derivatives $`\nabla_\Theta \mathcal{L}`$ computed during backpropagation.
Activations $`A`$ are intermediate values from the forward pass needed for gradient computation. Their size depends on batch size $`B`$, sequence length $`S`$, and architecture details.
Memory Accounting
Table 1 shows concrete memory requirements following the ZeRO paper’s mixed-precision accounting . The key observation is that optimizer state dominates: Adam requires $`2P`$ values in FP32, which is $`8P`$ bytes versus $`2P`$ bytes for FP16 parameters. Including FP32 master weights (required for mixed-precision training stability), the total is 16 bytes per parameter.
Remark 1 (Memory Accounting Convention). Throughout this paper, we use the ZeRO paper’s convention of 16 bytes per parameter: 2 bytes (FP16 parameters) + 2 bytes (FP16 gradients) + 4 bytes (FP32 master weights) + 8 bytes (FP32 optimizer state). When we write $`|\Theta|`$, $`|\Omega|`$, $`|G|`$, we refer to memory footprint in bytes, not parameter count. Specifically: $`|\Theta| = 2P`$ bytes, $`|G| = 2P`$ bytes, and $`|\Omega| = 12P`$ bytes (master weights + Adam states).
Communication Primitives
Distributed training uses collective communication operations. We use standard cost models . For $`N`$ devices and tensor size $`|T|`$:
All-Reduce aggregates (sums) $`T`$ across devices and distributes the result to all. Using the ring algorithm, each device sends and receives $`2 \cdot \frac{N-1}{N} \cdot |T|`$ bytes.
Reduce-Scatter aggregates $`T`$ and distributes disjoint shards. Device $`i`$ receives shard $`i`$ of the sum. Cost: $`\frac{N-1}{N} \cdot |T|`$ bytes per device.
All-Gather collects shards and distributes the complete tensor to all. Cost: $`\frac{N-1}{N} \cdot |T|`$ bytes per device.
Placement Semantics
We now present the systematic framework. We begin with intuition, then give precise definitions.
Intuition
Consider training on $`N = 8`$ devices. For parameters $`\Theta`$, we have choices:
-
Every device stores a full copy (data parallelism)
-
Each device stores $`1/8`$ of parameters (ZeRO Stage 3)
-
No device stores parameters persistently; gather them when needed (FSDP with aggressive sharding)
Each choice has different memory and communication implications. We formalize these choices as placement modes.
Placement Modes
Definition 1 (Placement Mode). Let $`X`$ be a state tensor of size $`|X|`$ distributed across $`N`$ devices indexed $`\{0, \ldots, N-1\}`$. A placement mode $`\pi`$ specifies, for each device $`i`$, what portion of $`X`$ device $`i`$ stores persistently. We define five modes:
*Replicated ($`R`$): Every device stores the complete tensor.
\begin{equation}
\pi_R(X, i) = X \quad \text{for all } i \in \{0, \ldots, N-1\}
\end{equation}
```*
***Sharded ($`S`$):** The tensor is partitioned into $`N`$ contiguous
shards; device $`i`$ stores shard $`i`$.
``` math
\begin{equation}
\pi_S(X, i) = X\left[\frac{i \cdot |X|}{N} : \frac{(i+1) \cdot |X|}{N}\right]
\end{equation}
```*
***Sharded with Gather ($`S^*`$):** Like $`S`$, but before each use the
full tensor is reconstructed via All-Gather, used, then the non-local
portions are discarded. This captures ZeRO-3/FSDP parameter handling.
``` math
\begin{equation}
\pi_{S^*}(X, i) = X\left[\frac{i \cdot |X|}{N} : \frac{(i+1) \cdot |X|}{N}\right] \text{ (persistent)}, \quad X \text{ (transient during use)}
\end{equation}
```*
***Materialized ($`M`$):** No device stores $`X`$ persistently. When
$`X`$ is needed, it is reconstructed from other state, used, then
discarded. This applies to intermediate values like activations that can
be recomputed.
``` math
\begin{equation}
\pi_M(X, i) = \emptyset \quad \text{(persistent storage)}
\end{equation}
```*
***Offloaded ($`O`$):** The tensor is stored in CPU memory or NVMe,
transferred to GPU when needed.
``` math
\begin{equation}
\pi_O(X, i) = \emptyset \quad \text{(GPU memory)}
\end{equation}
```*
</div>
<div class="example">
**Example 1** (Data Parallelism). *In data parallelism (DP), all model
state is replicated: $`\pi_\Theta = R`$, $`\pi_\Omega = R`$,
$`\pi_G = R`$. Each device holds full parameters, full optimizer state,
and computes full gradients. After local gradient computation, an
All-Reduce synchronizes gradients across devices.*
</div>
<div class="example">
**Example 2** (ZeRO Stage 3). *ZeRO Stage 3 shards everything:
$`\pi_\Theta = S^*`$, $`\pi_\Omega = S`$, $`\pi_G = S`$. Parameters are
sharded across devices; before each layer’s computation, an All-Gather
reconstructs the full parameters, which are then discarded after use.
Optimizer state and gradients remain sharded throughout.*
</div>
<div id="tab:placements">
| Strategy | $`\pi_\Theta`$ | $`\pi_\Omega`$ | $`\pi_G`$ | $`\pi_A`$ |
|:---|:--:|:--:|:--:|:--:|
| Data Parallel (DP) | $`R`$ | $`R`$ | $`R`$ | $`R`$ |
| ZeRO Stage 1 | $`R`$ | $`S`$ | $`R`$ | $`R`$ |
| ZeRO Stage 2 | $`R`$ | $`S`$ | $`S`$ | $`R`$ |
| ZeRO Stage 3 / FSDP | $`S^*`$ | $`S`$ | $`S`$ | $`R`$ |
| ZeRO-Offload | $`O`$ | $`O`$ | $`S`$ | $`R`$ |
| Tensor Parallel (TP, intra-layer) | $`S`$ | $`S`$ | $`S`$ | $`S`$ |
| Pipeline Parallel (PP, inter-layer) | $`S`$ | $`S`$ | $`S`$ | $`R`$ |
Placement specifications for common parallelism strategies. For tensor
and pipeline parallelism, placement applies per layer or stage. $`S^*`$
denotes sharded-with-gather before computation. DP = Data Parallelism,
TP = Tensor Parallelism, PP = Pipeline Parallelism.
</div>
<figure id="fig:modes" data-latex-placement="t">
<figcaption>The five placement modes. Top: per-device GPU memory cost
(<span class="math inline"><em>s</em></span> = tensor size, <span
class="math inline"><em>N</em></span> = device count). Bottom: example
uses in common strategies. <span
class="math inline"><em>S</em><sup>*</sup></span> (sharded-with-gather)
is the key innovation in ZeRO-3/FSDP: parameters are sharded for storage
but gathered transiently for computation.</figcaption>
</figure>
## Placement Specification
<div class="definition">
**Definition 2** (Placement Specification). *A placement specification
is a tuple $`\Pi = (\pi_\Theta, \pi_\Omega, \pi_G, \pi_A)`$ where each
$`\pi_X \in \{R, S, S^*, M, O\}`$ specifies the placement mode for state
$`X`$.*
</div>
A parallelism strategy is fully determined by its placement
specification. Table <a href="#tab:placements" data-reference-type="ref"
data-reference="tab:placements">2</a> shows specifications for known
strategies, and Figure <a href="#fig:modes" data-reference-type="ref"
data-reference="fig:modes">1</a> illustrates the five modes with their
memory costs.
<div class="remark">
**Remark 2**. *The materialized mode ($`M`$) does not appear in Table
<a href="#tab:placements" data-reference-type="ref"
data-reference="tab:placements">2</a> because common strategies store
all states persistently. However, $`M`$ enables modeling activation
checkpointing, where activations are recomputed rather than stored.*
</div>
# Derivation Rules
We now present our main technical contribution: rules that derive memory
and communication from placement specifications.
## Preliminaries
<div id="def:sunit" class="definition">
**Definition 3** (Reconstruction Unit). *Let $`s_{\text{unit}}`$ denote
the size of the smallest unit that can be independently reconstructed
during sharded-with-gather operations. For transformer models, this
typically corresponds to one layer:
$`s_{\text{unit}} = 12H^2 \cdot \text{bytes\_per\_param}`$ for a
standard transformer layer with hidden dimension $`H`$. The choice of
$`s_{\text{unit}}`$ is an implementation decision that trades memory for
communication granularity.*
</div>
## Memory Derivation
<div id="thm:memory" class="theorem">
**Theorem 1** (Memory from Placement). *Let
$`\Pi = (\pi_\Theta, \pi_\Omega, \pi_G, \pi_A)`$ be a placement
specification for $`N`$ devices. The per-device GPU memory is:
``` math
\begin{equation}
M(\Pi) = \mu(\pi_\Theta, |\Theta|) + \mu(\pi_\Omega, |\Omega|) + \mu(\pi_G, |G|) + \mu(\pi_A, |A|)
\end{equation}
where $`\mu : \{R, S, S^*, M, O\} \times \mathbb{R}^+ \to \mathbb{R}^+`$ is defined as:
\begin{align}
\mu(R, s) &= s \\
\mu(S, s) &= s/N \\
\mu(S^*, s) &= s/N + s_{\text{unit}} \\
\mu(M, s) &= s_{\text{unit}} \\
\mu(O, s) &= 0
\end{align}
where $`s_{\text{unit}}`$ is defined in Definition 3.*
Proof. We prove each case from Definition 1.
Case $`R`$: By equation (1), $`\pi_R(X, i) = X`$ for all $`i`$. Each device stores the full tensor, so per-device memory is $`|X| = s`$.
Case $`S`$: By equation (2), device $`i`$ stores $`X[i|X|/N : (i+1)|X|/N]`$, which has size $`|X|/N = s/N`$.
Case $`S^*`$: By equation (3), device $`i`$ persistently stores shard $`i`$ (size $`s/N`$). During computation, the full tensor is gathered transiently. If gathering happens one unit at a time (as is standard practice), peak memory is $`s/N + s_{\text{unit}}`$. Note: pipelined implementations that overlap gather with computation may require $`2 \cdot s_{\text{unit}}`$ transient memory.
Case $`M`$: By equation (4), persistent storage is $`\emptyset`$. However, during computation, the tensor must be reconstructed. If reconstruction happens one unit at a time (e.g., one layer), peak transient memory is $`s_{\text{unit}}`$.
Case $`O`$: By equation (5), GPU memory is $`\emptyset`$. The tensor resides in CPU/NVMe, contributing 0 to GPU memory. ◻
Example 3 (Memory Calculation). For a 70B model ($`P = 70 \times 10^9`$) with $`N = 8`$ devices, using 16 bytes per parameter (see Remark 1):
*Data Parallel ($`R, R, R, R`$):
\begin{align*}
M_{\text{DP}} &= \mu(R, |\Theta|) + \mu(R, |\Omega|) + \mu(R, |G|) + \mu(R, |A|/N) \\
&= 2P + 12P + 2P + |A|/8 = 16P + |A|/8
\end{align*}
In bytes: $`16 \times 70 \times 10^9 = 1120`$ GB per device (excluding activations).*
*ZeRO Stage 3 ($`S^*, S, S, R`$):
\begin{align*}
M_{\text{Z3}} &= \mu(S^*, 2P) + \mu(S, 12P) + \mu(S, 2P) + \mu(R, |A|/N) \\
&= 2P/N + 12P/N + 2P/N + |A|/8 = 16P/N + |A|/8
\end{align*}
In bytes for $`N = 8`$: $`16P/8 = 2P = 140`$ GB per device, an 8$`\times`$ reduction.*
Communication Derivation
Theorem 2 (Communication from Placement). Let $`\Pi`$ be a placement specification. The communication volume per training step is determined by state transitions required for the forward-backward-update cycle:
*1. If $`\pi_G = R`$ and gradients are computed locally, synchronization requires All-Reduce:
\begin{equation}
C^R_{\text{sync}} = 2 \cdot \frac{N-1}{N} \cdot |G|
\end{equation}
```*
*2. If $`\pi_G = S`$, synchronization uses Reduce-Scatter:
``` math
\begin{equation}
C^S_{\text{sync}} = \frac{N-1}{N} \cdot |G|
\end{equation}
```*
*3. If $`\pi_\Theta = S^*`$, parameters must be gathered before use. For
forward and backward passes:
``` math
\begin{equation}
C_{\text{gather}} = 2 \cdot \frac{N-1}{N} \cdot |\Theta|
\end{equation}
```*
</div>
<div class="proof">
*Proof.* **Part 1:** With $`\pi_G = R`$, each device computes local
gradients $`G_i`$ on its data shard. For correctness (Theorem
<a href="#thm:gradient" data-reference-type="ref"
data-reference="thm:gradient">3</a>), the final gradient must be
$`G = \frac{1}{N}\sum_i G_i`$. All-Reduce computes this sum and
distributes it to all devices. The ring All-Reduce algorithm requires
each device to send $`(N-1)/N \cdot |G|`$ bytes in the scatter phase and
$`(N-1)/N \cdot |G|`$ bytes in the gather phase, totaling
$`2(N-1)/N \cdot |G|`$.
**Part 2:** With $`\pi_G = S`$, we need the sum but each device only
needs its shard. Reduce-Scatter computes the sum and distributes shard
$`i`$ to device $`i`$. This requires only the scatter phase of ring
All-Reduce: $`(N-1)/N \cdot |G|`$ bytes.
**Part 3:** With $`\pi_\Theta = S^*`$, parameters are sharded for
storage but must be complete for computation. Before each layer’s
forward pass, All-Gather reconstructs parameters from shards. The same
reconstruction is needed in the backward pass for gradient computation.
Each All-Gather costs $`(N-1)/N \cdot |\Theta|`$, and two are needed,
giving $`2(N-1)/N \cdot |\Theta|`$. ◻
</div>
<div class="example">
**Example 4** (Communication Calculation). *For $`P = 70 \times 10^9`$
parameters and $`N = 8`$ devices (gradients in FP16, so $`|G| = 2P`$
bytes):*
***Data Parallel:** Only gradient synchronization.
``` math
C_{\text{DP}} = 2 \cdot \frac{7}{8} \cdot 2P = 3.5P \approx 245 \text{ GB per device}
```*
***ZeRO Stage 3:** Gradient sync (Reduce-Scatter) + parameter gather
(2$`\times`$ All-Gather).
``` math
C_{\text{Z3}} = \frac{7}{8} \cdot 2P + 2 \cdot \frac{7}{8} \cdot 2P = 5.25P \approx 368 \text{ GB per device}
```*
*ZeRO Stage 3 communicates **1.5$`\times`$ more** than data parallelism
but uses **8$`\times`$ less memory**.*
</div>
## The Fundamental Trade-off
<div class="corollary">
**Corollary 1** (Memory-Communication Trade-off). *For strategies using
modes $`\{R, S, S^*\}`$, the relationship between memory reduction and
communication overhead depends on which state is sharded:*
1. *Sharding optimizer state ($`R \to S`$) reduces memory with no
communication increase (updates are local).*
2. *Sharding gradients ($`R \to S`$) reduces memory and reduces
communication (Reduce-Scatter vs All-Reduce).*
3. *Sharding parameters ($`R \to S^*`$) reduces memory but increases
communication by $`2 \cdot \frac{N-1}{N} \cdot |\Theta|`$ (two
All-Gathers per step).*
</div>
<div class="proof">
*Proof.* **Part 1:** Optimizer state is only accessed during the update
step. With sharding, each device updates its local shard using its local
gradient shard. No cross-device communication is needed for the
optimizer itself.
**Part 2:** Gradient synchronization changes from All-Reduce (cost
$`2 \cdot \frac{N-1}{N} \cdot |G|`$) to Reduce-Scatter (cost
$`\frac{N-1}{N} \cdot |G|`$), a 2$`\times`$ reduction.
**Part 3:** With $`\pi_\Theta = S^*`$, parameters must be gathered
before forward and backward passes. Each All-Gather costs
$`\frac{N-1}{N} \cdot |\Theta|`$, and two are needed per step. ◻
</div>
This explains the design of ZeRO stages: Stage 1 shards optimizer state
(free memory reduction), Stage 2 additionally shards gradients (further
memory reduction with communication benefit), and Stage 3 shards
parameters (maximum memory reduction but additional communication cost).
# Correctness Conditions
We formalize when distributed training produces correct results.
<div class="definition">
**Definition 4** (Semantic Equivalence). *A distributed training
configuration is semantically equivalent to single-device training if it
produces the same sequence of parameter updates
$`\Theta_0, \Theta_1, \Theta_2, \ldots`$, up to floating-point
differences arising from reduction order.*
</div>
<div id="thm:gradient" class="theorem">
**Theorem 3** (Gradient Integrity). *For semantic equivalence, the
gradient used for parameter update at step $`t`$ must equal:
``` math
\begin{equation}
G_t = \frac{1}{B} \sum_{j=1}^{B} \nabla_\Theta \mathcal{L}(x_j, \Theta_t)
\end{equation}
where $`B`$ is the global batch size and $`\{x_j\}_{j=1}^{B}`$ are the training samples.*
Proof. Single-device SGD computes exactly equation (16) and updates $`\Theta_{t+1} = \Theta_t - \eta \cdot \text{optimizer}(G_t, \Omega_t)`$. If the distributed configuration produces a different gradient $`G'_t \neq G_t`$, then $`\Theta'_{t+1} \neq \Theta_{t+1}`$, violating semantic equivalence.
Conversely, if $`G_t`$ satisfies equation (16), the update matches single-device training. ◻
Gradient integrity violations:
-
Missing samples: device fails to contribute its gradients
-
Duplicate samples: same sample processed by multiple devices
-
Incorrect normalization: dividing by local batch size instead of global
Theorem 4 (State Consistency). For semantic equivalence, whenever a state tensor is accessed or communicated, all participating devices must hold values that are bitwise identical (up to floating-point associativity) and use identical data types.
Proof. Suppose devices hold inconsistent values. Consider parameters: if device 0 has $`\Theta^{(0)}`$ and device 1 has $`\Theta^{(1)} \neq \Theta^{(0)}`$, they compute different gradients for the same input, violating gradient integrity.
For data types: if device 0 reduces in FP32 and device 1 reduces in FP16, rounding differs, producing inconsistent results. ◻
State consistency violations:
-
Stale parameters: device uses outdated copy after an update
-
Type mismatch: different devices use different precisions
-
Reduction order dependence: non-deterministic reduction without proper handling
Theorem 5 (Necessity and Sufficiency). Under the following assumptions, gradient integrity and state consistency are jointly necessary and sufficient for semantic equivalence:
-
Deterministic operations: All arithmetic operations produce identical results given identical inputs.
-
Consistent initialization: All devices begin with identical $`\Theta_0`$ and $`\Omega_0`$.
-
Synchronous execution: All devices complete each step before any begins the next.
Proof. Necessity: Shown in Theorems 3 and 4. Violating either condition causes the parameter trajectory to diverge from single-device training.
Sufficiency: We prove by induction on training steps.
Base case: At $`t = 0`$, all devices have identical $`\Theta_0`$ and $`\Omega_0`$ by assumption (2). State consistency holds.
Inductive step: Assume at step $`t`$, all devices have consistent $`\Theta_t`$ and $`\Omega_t`$. We show step $`t + 1`$ produces consistent $`\Theta_{t+1}`$.
-
Each device $`i`$ computes local gradients $`G_t^{(i)}`$ on its data shard of size $`b = B/N`$. The local gradient is the mean over local samples: $`G_t^{(i)} = \frac{1}{b}\sum_{j \in \text{shard}_i} \nabla_\Theta \mathcal{L}(x_j, \Theta_t)`$. By assumption (1), identical parameters and inputs yield identical gradients.
-
Gradient synchronization via All-Reduce computes $`\sum_i G_t^{(i)}`$. Since each local gradient is already the mean over $`b`$ local samples, and there are $`N`$ devices each contributing such a mean, dividing the All-Reduce sum by $`N`$ yields the correct global mean gradient $`G_t = \frac{1}{B}\sum_{j=1}^{B}\nabla_\Theta \mathcal{L}(x_j, \Theta_t)`$. By assumption (3), all devices complete this before proceeding.
-
By gradient integrity, the synchronized result equals equation (16).
-
By state consistency, all devices see the same $`G_t`$, $`\Theta_t`$, $`\Omega_t`$.
-
The optimizer update $`\Theta_{t+1} = f(\Theta_t, G_t, \Omega_t)`$ produces identical results on all devices by assumption (1).
Therefore, $`\Theta_{t+1}`$ is consistent and matches single-device training. ◻
Remark 3. Assumption (1) may be violated by non-deterministic GPU operations (e.g., atomics in reductions). Frameworks provide deterministic modes that satisfy this assumption at some performance cost.
Composition Calculus
Large-scale training combines multiple strategies. We formalize valid compositions. For example, tensor parallelism within a node can be combined with data parallelism across nodes: TP handles intra-layer distribution while DP handles gradient averaging.
Definition 5 (Composition). Let $`\Pi_1`$ and $`\Pi_2`$ be placement specifications over device groups $`D_1`$ and $`D_2`$. The composition $`\Pi_1 \otimes \Pi_2`$ applies $`\Pi_1`$ within each subset of $`D_1`$ and $`\Pi_2`$ across subsets.
Theorem 6 (Tensor-Data Composition). Tensor parallelism (degree $`T`$) composes with data parallelism (degree $`D`$) on $`N = T \times D`$ devices when:
-
TP groups consist of devices $`\{iT, iT + 1, \ldots, (i+1)T - 1\}`$ for $`i \in \{0, \ldots, D-1\}`$
-
TP communication (per-layer) completes before DP gradient sync
-
DP gradient sync aggregates across TP groups, not within
Proof. We verify gradient integrity and state consistency.
Gradient integrity: Within each TP group, devices hold different parameter shards but process the same data, computing partial gradients. TP communication (All-Reduce on activations) ensures correct forward/backward computation. Across TP groups, different data shards are processed. DP gradient sync (All-Reduce across groups) averages gradients, satisfying equation (16).
State consistency: TP groups maintain consistent sharded parameters through synchronized updates. DP groups maintain consistent replicated state through gradient sync. The separation (TP within, DP across) ensures no conflicts. ◻
Theorem 7 (Pipeline-Data Composition). Pipeline parallelism ($`K`$ stages) composes with data parallelism ($`D`$ replicas) on $`N = K \times D`$ devices when:
-
Each stage $`k`$ is replicated $`D`$ times across devices $`\{kD, kD + 1, \ldots, (k+1)D - 1\}`$
-
Gradient sync is per-stage: All-Reduce only among replicas of the same stage
-
Activation transfer is between corresponding stages in the same pipeline
Proof. Gradient integrity: Each stage $`k`$ has parameters $`\Theta^{(k)}`$. Replicas of stage $`k`$ process different data shards and compute local gradients. Per-stage All-Reduce averages these, satisfying gradient integrity for $`\Theta^{(k)}`$.
State consistency: Within a pipeline, stages hold disjoint parameters (no overlap). Within a DP group, replicas hold identical parameters (synchronized by All-Reduce). No device needs to reconcile conflicting placements. ◻
Invalid Compositions
Proposition 1 (TP Across Slow Interconnect). Tensor parallelism across devices with interconnect latency $`\alpha`$ incurs per-step latency overhead $`O(L \cdot \alpha)`$ where $`L`$ is the number of layers.
Proof. Each layer requires at least one synchronous collective (All-Reduce or All-Gather) for TP. With $`L`$ layers, this adds $`L`$ latency terms to the critical path. If $`\alpha`$ is large (e.g., cross-node Ethernet vs. intra-node NVLink), this dominates compute time. ◻
This explains why TP is restricted to intra-node communication in practice.
Remark 4 (Three-Way Composition). Production systems commonly use TP $`\otimes`$ PP $`\otimes`$ DP (3D parallelism). This composes validly when: (1) TP is innermost (intra-node), (2) PP is middle (inter-node within a pipeline), and (3) DP is outermost (across pipeline replicas). The correctness follows by applying Theorems 6 and 7 hierarchically.
Application: Strategy Selection
We demonstrate how the framework guides practical decisions.
Model size $`P`$, device memory $`M_d`$, device count $`N`$, interconnect type Placement specification $`\Pi`$ $`M_{\text{model}} \leftarrow 16P`$ $`(R, R, R, R)`$ $`(S^*, S, S, R)`$ Add tensor parallelism within node composed specification
Note: The thresholds (0.7, 0.3) in Algorithm [alg:select] are illustrative heuristics leaving headroom for activations and runtime allocations. Practitioners should adjust based on measured activation sizes and framework overhead.
Validation Against Published Results
We validate our derivation rules against published numbers from the ZeRO paper .
Memory validation. The ZeRO paper uses mixed-precision accounting: 2 bytes (FP16 params) + 2 bytes (FP16 gradients) + 4 bytes (FP32 master weights) + 8 bytes (FP32 optimizer state) = 16 bytes per parameter. Our framework uses this same accounting (see Table 1 and Remark 1).
For data parallelism, ZeRO reports $`16P`$ bytes per device. Our framework: $`\mu(R, 2P) + \mu(R, 2P) + \mu(R, 12P) = 16P`$ bytes, matching exactly.
For ZeRO Stage 3 with $`N`$ devices, the ZeRO paper reports $`16P/N`$ bytes per device. Our framework predicts: $`16P/N`$ bytes (plus transient memory for gathered parameters), again matching.
Communication validation. The ZeRO paper reports that ZeRO Stage 3 requires 1.5$`\times`$ the communication volume of data parallelism. Our framework computes:
-
Data parallelism: $`C_{\text{DP}} = 2 \cdot \frac{N-1}{N} \cdot |G| \approx 2|G|`$ for large $`N`$
-
ZeRO Stage 3: $`C_{\text{Z3}} = \frac{N-1}{N} \cdot |G| + 2 \cdot \frac{N-1}{N} \cdot |\Theta| \approx |G| + 2|\Theta| = 3|G|`$ (since $`|\Theta| = |G|`$ in FP16)
The ratio $`3|G|/2|G| = 1.5`$, matching the published 1.5$`\times`$ overhead.
We note this validation compares analytical predictions with published analytical results, not runtime measurements. The match demonstrates our framework captures the same cost model used by ZeRO authors. Empirical validation with profiling tools would strengthen these results but is outside our theoretical scope.
Verification protocol: Given a configuration, verify correctness by:
-
Gradient integrity check: Run identical batch on 1 device and $`N`$ devices. Compare gradient norm: $`\|G_1 - G_N\|/\|G_1\| < 10^{-5}`$.
-
State consistency check: After any collective, verify all devices have identical checksums.
-
Trajectory check: Train for 100 steps on 1 device and $`N`$ devices with same seed. Final loss difference should be $`< 10^{-4}`$.
Related Work
Parallelism systems. Data parallelism was systematized by Li et al. and scaled by Goyal et al. . ZeRO introduced state sharding, implemented in DeepSpeed . FSDP provides PyTorch-native sharding. Megatron-LM established tensor parallelism patterns; Korthikanti et al. extended them to sequence parallelism. GPipe introduced synchronous pipelines; PipeDream explored asynchronous variants; Narayanan et al. developed efficient schedules. GShard and Switch Transformer established expert parallelism.
Automatic parallelism. Alpa formulates parallelism selection as an optimization problem with cost models for memory and communication; it focuses on search algorithms rather than semantic foundations. Galvatron similarly optimizes parallelism configurations using profiling-based cost models. Unity jointly optimizes algebraic transformations and parallelization using formally verified graph substitutions; it uses theorem provers to verify correctness of individual transformations, while our work proves correctness conditions for the overall training procedure. Our work differs by providing a declarative framework where strategies are specified by placement rather than discovered by search; the two approaches are complementary.
Memory optimization. Mixed precision training reduces memory via lower precision. Activation checkpointing trades compute for memory. FlashAttention optimizes attention memory via recomputation.
Scaling studies. Scaling laws guide capacity allocation. Training reports for GPT-3 , PaLM , and LLaMA describe practical configurations.
| Capability | Prior Work | This Paper |
|---|---|---|
| Describes specific system | ||
| Systematic placement definitions | ||
| Derives costs from specification | ||
| Proves correctness conditions | ||
| Composition calculus with proofs | ||
| Covers arbitrary new strategies |
Comparison of our contribution versus prior work.
Distinction from our work. Table 3 summarizes the key differences. Prior work describes specific systems, empirical findings, or search-based optimization. We provide a systematic framework with definitions, derivation rules, and proofs. ZeRO describes an implementation; we provide semantics in which ZeRO Stages 1, 2, and 3 are instances differing only in placement specification. This enables systematic reasoning about properties that no single system paper addresses.
Limitations
Our framework assumes synchronous training. Asynchronous methods (e.g., PipeDream’s weight stashing) introduce staleness that requires additional formalization.
We assume homogeneous devices. Heterogeneous systems (mixing GPU types) require per-device capability modeling that our current framework does not capture.
The derivation rules give asymptotic costs. Implementation constants (kernel launch overhead, memory allocator behavior) affect actual performance but are outside our scope. We model communication volume, not time; overlap between communication and computation is an implementation optimization not captured by our framework.
We model memory and communication but not compute time. A complete resource model would require operation-level analysis.
Expert parallelism (Mixture-of-Experts) requires extending the framework to handle conditional routing, where different inputs activate different parameter subsets. This extension is future work.
Sequence parallelism fits our framework as $`\pi_A = S`$ for activations, with corresponding communication for activation sharding. Context parallelism (ring attention) requires modeling communication patterns within the attention operator, an extension we leave to future work.
Activation checkpointing is orthogonal to our framework: it reduces $`|A|`$ through recomputation but does not change the placement mode of activations. The framework applies unchanged with the reduced $`|A|`$.
Gradient accumulation (processing multiple micro-batches before synchronization) is a straightforward extension: communication costs are amortized over accumulation steps, reducing effective communication by a factor equal to the number of accumulation steps.
Conclusion
We introduced placement semantics, a systematic framework for distributed training. The framework defines five placement modes, derives memory and communication from specifications, proves correctness conditions, and provides composition rules.
By formalizing distributed training, we enable: (1) precise comparison of strategies via their specifications; (2) prediction of resource requirements without implementation; (3) verification of correctness via explicit conditions; (4) principled composition of strategies.
We hope this framework aids practitioners in understanding existing systems and researchers in designing new parallelism strategies with formal guarantees.
Acknowledgments
We thank the developers of PyTorch, DeepSpeed, and Megatron-LM for open-source implementations that informed this analysis.
99
Tom Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020.
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv:1604.06174, 2016.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. PaLM: Scaling language modeling with Pathways. JMLR, 24(240):1–113, 2023.
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In ICLR, 2024.
William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR, 23(120):1–39, 2022.
Priya Goyal, Piotr Dollár, Ross Girshick, et al. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training compute-optimal large language models. In NeurIPS, 2022.
Yanping Huang, Youlong Cheng, Ankur Bapna, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, pages 103–112, 2019.
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling laws for neural language models. arXiv:2001.08361, 2020.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
Vijay Korthikanti, Jared Casper, Sangkug Lym, et al. Reducing activation recomputation in large transformer models. In MLSys, 2023.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, et al. GShard: Scaling giant models with conditional computation and automatic sharding. In ICLR, 2021.
Shen Li, Yanli Zhao, Rohan Varma, et al. PyTorch Distributed: Experiences on accelerating data parallel training. PVLDB, 13(12):3005–3018, 2020.
Xupeng Miao, Yujie Wang, Youhe Jiang, et al. Galvatron: Efficient transformer training over multiple GPUs using automatic parallelism. In VLDB, 2023.
Paulius Micikevicius, Sharan Narang, Jonah Alben, et al. Mixed precision training. In ICLR, 2018.
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, et al. PipeDream: Generalized pipeline parallelism for DNN training. In SOSP, pages 1–15, 2019.
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. In SC, 2021.
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. In SC, 2020.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506, 2020.
Alexander Sergeev and Mike Del Balso. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv:1802.05799, 2018.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053, 2019.
Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
Colin Unger, Zhihao Jia, Wei Wu, et al. Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In OSDI, pages 267–284, 2022.
Yanli Zhao, Andrew Gu, Rohan Varma, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel. PVLDB, 16(12):3848–3860, 2023.
Lianmin Zheng, Zhuohan Li, Hao Zhang, et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In OSDI, pages 559–578, 2022.
A Note of Gratitude
The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.-
This approximation holds for large $`H`$ and omits embedding parameters, which add approximately $`V \cdot H`$ where $`V`$ is vocabulary size. For a 70B model with typical vocabulary, embeddings contribute roughly 1–2% of total parameters. ↩︎