Hierarchical Zero-Order Optimization for Deep Neural Networks
Zeroth-order (ZO) optimization has long been favored for its biological plausibility and its capacity to handle non-differentiable objectives, yet its computational complexity has historically limited its application in deep neural networks. Challenging the conventional paradigm that gradients propagate layer-by-layer, we propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network. We prove that HZO reduces the query complexity from $O(ML^2)$ to $O(ML \log L)$ for a network of width $M$ and depth $L$, representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit ($L_{lip} \approx 1$). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.
💡 Research Summary
The paper tackles a long‑standing limitation of zeroth‑order (ZO) optimization for deep neural networks: the “curse of dimensionality” that makes ZO prohibitively expensive as network depth grows. While prior works have reduced the query complexity by compressing perturbations along the width dimension (e.g., neuron‑wise or low‑rank perturbations), they still suffer from a quadratic dependence on depth, $O(L^{2})$, which blocks scaling to modern deep architectures.
To overcome this, the authors introduce Hierarchical Zeroth‑Order (HZO) optimization, a divide‑and‑conquer framework that recursively splits the network along its depth axis. A deep network $F(x)=f_{L}\circ\cdots\circ f_{1}(x)$ is viewed as a composition of subnetworks $N_{i:j}$. For any subnetwork with more than one layer, HZO bisects it at the midpoint $k$, forming left $N_{i:k}$ and right $N_{k+1:j}$ parts. The right part’s Jacobian $J=\partial a_{j}/\partial a_{k}$ is estimated via a bidirectional finite‑difference scheme that requires two forward passes per perturbed neuron. The global loss gradient (target signal) $T_{j}=\partial\mathcal{L}/\partial a_{j}$ is propagated backward to the intermediate layer by $T_{k}=J^{\top}T_{j}$. The recursion continues until a single‑layer subnetwork is reached, at which point a classic delta‑rule update $\Delta W_{i}=-\eta,T_{i}a_{i-1}^{\top}$ is applied.
Theoretical analysis shows that the total number of forward passes $T(L)$ satisfies the recurrence $T(L)=2T(L/2)+C_{\text{Jacobian}}(L/2)$. Since estimating the Jacobian for a subnetwork of depth $L/2$ and width $M$ costs $2M$ forward passes of length $L/2$, the per‑level cost simplifies to $M L$, independent of the recursion depth. Summing over $\log_{2}L$ levels yields a query complexity of $O(ML\log L)$, a dramatic improvement over the $O(ML^{2})$ or $O(ML^{2}\log L)$ complexities of previous ZO methods.
Error analysis reveals that ZO gradient estimates inherently suffer from an exponential factor $L_{\text{lip}}^{L}$, where $L_{\text{lip}}$ is the Lipschitz constant of layerwise mappings. Consequently, stability requires $L_{\text{lip}}\approx1$, a condition naturally satisfied by architectures with residual connections or orthogonal initialization. The authors prove that HZO’s hierarchical bisection does not amplify this error; the overall estimation error remains $O(L^{2}L_{\text{lip}})$, matching that of a monolithic ZO perturbation.
For convolutional networks, the authors propose Spatial Parallel Perturbation (SPP). By exploiting the locality of convolutional kernels, they identify a set of spatially disjoint positions whose receptive fields do not overlap. Perturbations at these positions can be evaluated in parallel, reducing the per‑layer Jacobian estimation cost from $O(H\times W)$ to a constant $O(R^{2})$, where $R$ is the receptive‑field size. This makes HZO viable for high‑resolution inputs such as ImageNet.
Empirically, HZO is evaluated on CIFAR‑10 using ResNet‑32, ResNet‑110, and WideResNet‑28‑10, achieving test accuracies within 0.5–2 % of standard back‑propagation while cutting memory footprint by roughly 30–50 %. On ImageNet, a ResNet‑50 trained for 90 epochs with HZO reaches a top‑1 accuracy of 76.2 % versus 76.7 % for back‑propagation, confirming competitive performance at scale. The measured query count aligns with the $O(ML\log L)$ prediction, and the method remains stable when the network operates near the unitary limit.
In summary, HZO offers a principled solution to the depth‑related bottleneck of zeroth‑order learning. By hierarchically propagating target signals and estimating Jacobians locally, it reduces query complexity to near‑linear in depth, preserves gradient fidelity, and can be extended to convolutional layers via spatial parallelism. The work bridges biological plausibility (gradient‑free learning) with practical scalability, though it relies on networks that maintain $L_{\text{lip}}\approx1$ and careful tuning of the perturbation magnitude $\epsilon$. Future research may explore adaptive schemes for $\epsilon$, integration with other ZO acceleration techniques, and broader applications to neuromorphic hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment