Simplifying Multi-Task Architectures Through Task-Specific Normalization

Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS$σ$BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS$σ$BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.

💡 Research Summary

The paper tackles two fundamental challenges in multi‑task learning (MTL): how to share a backbone efficiently while preventing harmful interference between tasks. Traditional solutions often resort to elaborate task‑specific adapters, routing networks, or cross‑attention modules, which increase parameter count, computational cost, and implementation complexity. The authors propose a radically simpler approach: rely solely on normalization layers to achieve task differentiation and capacity allocation.

First, they replace the conventional shared Batch Normalization (BN) with Task‑Specific BN (TSBN), assigning each task its own scale (γ) and shift (β) parameters while keeping the running mean and variance shared. This minimal modification already yields substantial gains across a variety of vision benchmarks, demonstrating that per‑task affine transformations are enough to let each task “see” a slightly different feature distribution.

Building on this insight, the authors introduce Task‑Specific Sigmoid Batch Normalization (TSσBN). In addition to per‑task γ/β, TSσBN learns a channel‑wise gating vector gₜ for each task, computed as a sigmoid of a small set of learnable parameters wₜ. The normalized activation (\hat{x}) is multiplied element‑wise by gₜ before the affine transform, effectively allowing each task to softly select which channels it wants to use (g≈1) and which to suppress (g≈0). The sigmoid ensures smooth gradients and a continuous spectrum between full sharing and complete isolation, while the extra parameters constitute only 1–2 % of the total model size.

The method is evaluated on both convolutional (ResNet‑50/101) and transformer (ViT‑Base) backbones across four major MTL benchmarks: NYUv2 (depth, surface normals, indoor segmentation), Cityscapes (semantic segmentation + depth), CelebA (40 facial attributes), and PascalContext (dense semantic segmentation). TSσBN consistently matches or exceeds the performance of state‑of‑the‑art MTL architectures such as MTAN, PAD‑Net, Cross‑Stitch, and others, while adding negligible FLOPs and only a tiny parameter overhead. For example, on NYUv2 the proposed model improves mean IoU by 2.3 % over a strong shared‑BN baseline and reduces RMSE on depth prediction by 4 %.

Beyond raw performance, the learned gates provide a natural lens for interpreting MTL dynamics. Visualizing gₜ reveals filter specialization: tasks that require geometric reasoning (depth, normals) allocate high gates to channels that capture shape cues, whereas semantic segmentation emphasizes texture‑oriented channels. By computing cosine similarity between gate vectors of different tasks, the authors construct a task relationship matrix that quantifies how much two tasks rely on the same feature subspace. In NYUv2, depth and normals show a similarity of 0.92, confirming their shared geometric nature, while segmentation diverges with a similarity of 0.65. Moreover, during later training stages some channels receive low gates across all tasks, indicating that the network automatically prunes redundant capacity.

The paper also discusses limitations and future directions. TSσBN currently operates at the channel level; extending the gating mechanism to spatial positions, attention heads, or even to the normalization statistics themselves could yield finer‑grained control. The authors note that gate initialization and learning schedules influence convergence speed, suggesting that more sophisticated optimization schemes might further improve stability. Finally, while the experiments focus on computer‑vision tasks, the underlying principle—task‑specific normalization as a lightweight, interpretable capacity allocator—should be applicable to NLP, speech, and multimodal domains.

In conclusion, the work challenges the prevailing belief that sophisticated architectural components are necessary for effective MTL. By demonstrating that normalization alone, when made task‑specific and equipped with a soft gating mechanism, can deliver state‑of‑the‑art results, the authors open a new design space where simplicity, efficiency, and interpretability coexist. TSσBN offers a compelling recipe: keep the backbone fully shared, introduce only a handful of per‑task parameters, and let the model learn how to allocate its representational power. This paradigm shift could streamline future MTL research and deployment, especially in resource‑constrained settings where parameter budget and explainability are paramount.

💡 Research Summary

📜 Original Paper Content