Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TSσBN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TSσBN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.
Multi-task learning (MTL) trains a single model to solve multiple tasks jointly, leveraging shared representations to improve generalization and computational efficiency. Despite many successes, MTL remains difficult to understand and control. Core challenges include task interference, where competing gradients from divergent task requirements disrupt joint training (Zhang et al., 2022); capacity allocation, where shared and task-specific resources must be balanced to avoid dominance (Maziarz et al., 2019;Newell et al., 2019); and task similarity, where the degree of relatedness determines how tasks should interact (Standley et al., 2020). Existing approaches typically address only one of these issues. Optimization-based methods focus on mitigating interference by reweighting losses or modifying gradients (Yu et al., 2020;Navon et al., 2022). Soft-sharing architectures attempt to disentangle capacity by adding task-specific modules on top of a shared backbone, but in doing so often introduce significant design complexity in deciding how modules should interact (Misra et al., 2016;Liu et al., 2019). Neural architecture search methods learn to partition networks based on data-driven estimates of task-relatedness (Guo et al., 2020;Sun et al., 2020).
In this work, we argue that normalization layers and in particular batch normalization (BN) (Ioffe, 2015) are a sufficient and highly effective solution for all the aforementioned challenges in MTL. Our motivation stems from the following observations: First, while neural networks are heavily over-parameterized, existing approaches struggle to resolve tasks conflicts (Shi et al., 2023), indicating a failure to utilize the available network capacity optimally. Second, BN has proven to be highly expressive -not only does it stabilize and accelerate training (Santurkar et al., 2018;Bjorck et al., 2018), but it also demonstrates remarkable standalone performance when used on random feature extractors (Rosenfeld & Tsotsos, 2019;Frankle et al., 2021) and its ability to leverage features not explicitly optimized for a specific task (Zhao et al., 2024). Third, BN can learn to ignore unimportant features (Frankle et al., 2021) or be explicitly regularized to produce structured sparsity (Liu et al., 2017;Suteu & Guo, 2022). This can be leveraged for MTL when unrelated tasks cannot fully share all features without interference and require disentanglement.
Fourth, normalization layers are extremely parameter-efficient, taking up typically less than 0.5% of a model’s size. This makes them particularly suitable as lightweight universal adapters for applications where models need to scale to multiple tasks (Rebuffi et al., 2017;Bilen & Vedaldi, 2017).
Lastly, while conditional BN layers have been explored in settings with domain shift (Wallingford et al., 2022;Xie et al., 2023;Chang et al., 2019;Deng et al., 2023), these methods focus on the issue of mismatched normalization statistics and use task-specific BN as a domain-alignment tool. Our focus is different: we study single-domain MTL, where all tasks share the same input distribution and normalization does not become a failure mode. In this setting, we show that task-specific BN can provide a simple way to modulate representations via their affine parameters -turning it from a normalization module into a lightweight mechanism for capacity allocation and interference reduction. The extension of BN as the sole mechanism for modulation and interpretability rather than domain alignment remains largely unexplored.
Motivated by these observations, we propose a minimalist soft-sharing approach to MTL, where feature extractors are fully shared and only normalization layers are task-specific. Unlike prior softsharing architectures that add complex modules or routing schemes, our design isolates normalization as the sole mechanism for balancing tasks. Building on σBN (Suteu & Guo, 2022), we introduce lightweight task-specific gates that modulate feature usage with negligible overhead, making the approach broadly compatible, easy to implement, and resilient to task imbalance. Beyond performance and efficiency, the learned σBN parameters naturally form a task-filter importance matrix, enabling a structured analysis of capacity allocation, filter specialization, and task relationships, providing an interpretable view of MTL that is largely absent in prior work.
• A minimal MTL baseline. We show that simply replacing shared normalization with task-specific BatchNorm (TSBN) already delivers competitive performance out-of-the-box, questioning the necessity of elaborate task-specific modules or routing schemes.
• An extended design with sigmoid normalization. We introduce TSσBN which improves stability and scale across CNNs and transformers. This variant achieves superior performance on nearly all benchmarks while remaining parameter-efficient. • An interpretable analysis framework. The use of σBN further provides a natural lens for analyzing MTL dynamics. By interpreting learned feature importances, we obtain structured insights into capacity allocation, filter specialization, and task relationships.
Soft parameter sharing methods tackle MTL interference architecturally by introducing task-specific modules to a shared backbone. Design options include replicating backbones (Misra et al., 2016;Ruder et al., 2019), adding attention mechanisms (Liu et al., 2019;Maninis et al., 2019), low-rank adaptation modules (Liu et al., 2022b;Agiza et al., 2024) or allowing cross-talk at a decoder level (Xu et al., 2018;Vandenhende et al., 2020b). However, these methods rely on task-specific feature extractors to avoid negative transfer at the cost of forgoing the multi-task inductive bias. Furthermore, adding task-specific capacity scales poorly with many tasks (Strezoski et al., 2019), and requires extensive code modifications that hinder adaptation to new architectures. Although BatchNorm is present in many of these systems, it is embedded in larger task-specific designs. In contrast, our method isolates BatchNorm as the sole soft-sharing mechanism, showing that it is a sufficient solution for competitive MTL while challenging unnecessary complexity.
Neural Architecture Search (NAS) methods reduce task interference by choosing which parameters to share among tasks as hard-partitioned sub-networks. Some approaches use probabilistic sampling (Sun et al., 2020;Bragman et al., 2019;Maziarz et al., 2019;Newell et al., 2019) or explicit branching/grouping strategies based on task affinities (Vandenhende et al., 2020a;Guo et al., 2020;Bruggemann et al., 2020;Standley et al., 2020;Fifty et al., 2021). Others use hypernetworks (Raychaudhuri et al., 2022;Aich et al., 2023) which learn to generate MTL architectures conditioned on user preferences. While our method also models task relationships and capacity allocation, it does so without architecture search, relying solely on static modulation via normalization layers.
Mixture-of-Experts (MoE) methods address task interference by dynamically routing inputs to specialized experts, enabling flexible capacity allocation among tasks (Ma et al., 2018;Hazimeh et al., 2021;Tang et al., 2020). More recent work extends MoE designs to large-scale transformer architectures for vision and language tasks (Fan et al., 2022;Chen et al., 2023;Ye & Xu, 2023;Yang et al., 2024). Although effective, these methods rely on dynamic, per-sample routing that increases architectural and training complexity. In contrast, our approach provides a static and lightweight form of soft partitioning, achieving similar benefits with minimal changes to the wrapped backbone.
Parameter-efficient fine-tuning (PEFT) is a popular approach for adapting large pre-trained models without updating the full backbone. Single-task PEFT methods such as Adapters (He et al., 2021), BitFit (Zaken et al., 2022), VPT (Jia et al., 2022), Compacter (Karimi Mahabadi et al., 2021), and LoRA-style updates add small task-specific modules or low-rank layers while keeping most weights frozen. Extending these ideas to MTL requires managing several task-specific adapters at once.
Recent PEFT-MTL methods address this by generating adapter weights through hypernetworks or decompositions, as in HyperFormer (Mahabadi et al., 2021), Polyhistor (Liu et al., 2022b), and MTLoRA (Agiza et al., 2024). However, these methods still rely on additional task-specific capacity, which parallels traditional soft-parameter sharing and scales poorly with the number of tasks. In contrast, we modulate the shared capacity directly through BN, without adding new feature extractors.
Domain-specific normalization has become a common technique in settings with domain shift, where shared BatchNorm fails because domains have different input distributions. In these cases, separate BN statistics or layers are required to maintain stable normalization (Li et al., 2016;Zajac et al., 2019;Chang et al., 2019). The same motivation appears in several areas: In meta-learning, TaskNorm (Bronskill et al., 2020) adapt BN statistics per episode to handle changes in input distribution. In continual learning, CLBN (Xie et al., 2023) store task-specific BN parameters to avoid catastrophic forgetting from normalization drift. In conditional or multi-modal models, BN and LayerNorm is adjusted to match modality-specific statistics (Michalski et al., 2019;Zhao et al., 2024). In multidomain MTL (Bilen & Vedaldi, 2017;Mudrakarta et al., 2019;Wallingford et al., 2022;Deng et al., 2023), task-specific BN is used as an adapter for tasks from different domains. In contrast, our work targets single-domain MTL, where all tasks share the same input and normalization does not fail. In this case, task-specific BN is not needed for statistical correction. Instead, we focus on its affine parameters as a basis for task-specific feature modulation, and extend this idea with a reparameterization and optimization scheme tailored to reduce interference and allocate capacity. Networks (Misra et al., 2016) and MTAN (Liu et al., 2019) incorporate additional feature extractors, which lead to scalability challenges as the number of tasks increases. Task-Specific σBN Networks introduce only task-specific normalization layers, offering a highly parameter-efficient solution.
Batch normalization is a cornerstone for deep CNNs due to its versatility, efficiency, and wide-ranging benefits, including improved training stability for faster convergence (Santurkar et al., 2018;Bjorck et al., 2018), regularization effects (Luo et al., 2019), and the orthogonalization of representations (Daneshmand et al., 2021). BN operates in two key steps -normalization and affine transformation:
The normalization step standardizes input activations using the mini-batch mean µ B and variance σ 2 B , while the affine transformation applies channel-specific learnable parameters, γ and β, to re-scale and shift the normalized activations. During inference, BN relies on population statistics collected during training via running estimates. When the test distribution differs from the training set, these statistics can become mismatched and significantly degrade model performance (Summers & Dinneen, 2020). Because of this, many BN variants aim to improve the normalization step itself by adjusting µ and σ to handle distribution changes, domain shift, meta-learning episodes, or multi-modal inputs. For a survey on normalization approaches we refer to Huang et al. (2023).
In single-domain MTL, all tasks share the same input distribution, so the normalization component of BN does not need adjustment. Instead, we focus on the affine transformation post-normalization. These parameters represent only a small fraction of the network, yet they have substantial expressive power, as shown by studies demonstrating high performance when training BN alone (Frankle et al., 2021). In this work, we build on a variation of BN originally introduced to determine feature importance in structured pruning, Sigmoid Batch Normalization (Suteu & Guo, 2022) replaces the affine transformation with a single bounded scaler:
Using a single bounded scaler per feature has little impact on performance, but enables targeted regularization and improves interpretability. These properties make σBN especially attractive for multi-task learning, where understanding how tasks share limited capacity is critical. In this setting, σ(γ) acts as a static soft gate that can down-weight or disable features. This implicit static gating contrasts with soft-sharing models, which explicitly partition capacity, and MoE methods, which route features dynamically through task-specific gates. Furthermore, this formulation can be extended to other normalization layers (Ba et al., 2016), as we show in experiments on transformers. Using σBN as the only task-specific components, we create a parameter-efficient framework that sustains performance while providing tools to analyze and influence capacity allocation and task relationships.
TSσBN networks are constructed by replacing every shared Batch Normalization layer with taskspecific σBN layers, as illustrated in Figure 1. This design allows tasks to normalize and modulate the outputs of shared convolutional layers:
enabling better disentanglement of representations and reduced task interference. Unlike prior methods introducing additional task-specific capacity, TSσBN keeps all convolutions shared, preserving the multi-task learning inductive bias toward generalizable representations. While domain-specific BN has been used reactively in domain adaptation (Chang et al., 2019) to handle distribution shifts, our work is the first to use it proactively as a standalone mechanism in single-input scenarios.
Task interference. Conflicting gradient updates between tasks is a central challenge in MTL, often measured by negative cosine similarity (Zhao et al., 2018;Yu et al., 2020;Shi et al., 2023). Figure 2 (left) shows the gradient similarity distribution for shared convolutional parameters: in hard parameter sharing, the distribution is nearly uniform, meaning roughly half of all updates conflict. MTAN ( Liu et al., 2019) partially alleviates this issue by introducing task-specific convolutions. In contrast, TSσBN yields a sharp, zero-centered distribution with low variance, indicating gradients are mostly orthogonal. This mirrors optimization-based methods that explicitly enforce orthogonality (Yu et al., 2020;Suteu & Guo, 2019), yet TSσBN achieves it through a lightweight architectural change.
Figure 2 (middle) further supports this: on CelebA, task representations form well-separated clusters, illustrating reduced interference. A full analysis across all tasks is provided in Appendix A.
Parameter Efficiency. Task-Specific σBN is highly parameter efficient since it does not introduce additional feature extractors like related soft parameter sharing architectures. At the extreme end, such as Single Task Learning or Cross-Stitch networks, the entire backbone is duplicated for each new task. TSσBN on the other hand duplicates only σBN layers, whose parameters comprise a fraction of the total model size. Figure 2 (right) shows how different approaches scale with additional tasks.
TSσBN adds an insignificant amount of new parameters, allowing it to scale to any number of tasks.
Discriminative Learning Rates. We increase the learning rate of σBN parameters by a fixed multiple (α σBN = 10 2 ) relative to other parameters, allowing them to allocate filters before these undergo significant updates. This accelerates specialization and ensures capacity allocation occurs early in training. A further advantage of σBN is its robustness to high learning rates: the sigmoid dampens gradients, making training stable across scales, whereas vanilla BN is more sensitive and requires careful tuning. The approach parallels transfer learning, where deeper layers are updated more aggressively to drive adaptation (Howard & Ruder, 2018;Vlaar & Leimkuhler, 2022). We provide ablations on how higher learning rates improve performance and filter allocation.
A key advantage of the TSσBN design is the ability to quantify filter allocation through task-filter importance matrices. Since each σBN layer introduces a dedicated scaling parameter γ t,i per task and filter, we construct a task-filter importance matrix I ∈ R T ×F , where each entry I t,i captures the importance task t assigns to filter i. Applying the sigmoid function to the raw scaling parameters I t,i = σ(γ t,i ) ensures that values remain within [0, 1], facilitating interpretability and comparability across tasks, layers, and models. Using this representation, TSσBN enables a principled analysis of MTL dynamics, including capacity allocation, task relationships, and filter specialization.
One of the central challenges in multi-task learning is understanding how model capacity is allocated among competing tasks. The TSσBN task-filter importance matrix I can directly quantify the total capacity of a task t as the normalized sum of the importances it assigns to filters
). This measure provides an overall assessment of the resources required for each task; however, it does not account for task relationships or shared capacity. A task with high absolute capacity does not necessarily imply it monopolizes filters, as it may rely heavily on shared generic filters.
We apply an orthogonal projection-based decomposition to differentiate between task-specific and shared capacity. Given the set of task importance vectors {I 1 , I 2 , …, I T }, we decompose each task’s capacity into an independent component and a shared component. Let A be the matrix formed by stacking all task importance vectors except I t . The projection of I t onto the subspace spanned by the other tasks is given by the projection matrix P A :
The shared Ît = P A I t and independent I ⊥ t = I t -Ît components of I t can therefore be defined so that I ⊥ t is orthogonal to the subspace spanned by the other task importance vectors. To derive a capacity decomposition consistent with the original measure, we define the independent and shared capacities as scaled versions of the total capacity: Because in this formulation the components are orthogonal, the L 2 norm satisfies the Pythagorean theorem, yielding
) 2 . This guarantees that a task’s total capacity is preserved while providing an interpretable split between shared and independent resource usage.
Using our framework, we analyze task capacity allocation after training as shown in Figure 3. For both SegNet and DeepLabV3 architectures, we find that most capacity is shared among tasks without a single task dominating. For a more detailed analysis on the effects of task difficulty and similarity on capacity allocation, we refer to Appendix E. Overall, this view offers interpretability into the interaction between tasks and can be a powerful tool in real-world applications where relationships are not known a priori.
A desirable feature for any multi-task learning model is the ability to derive task relationships, as this can help gauge interference between tasks and provide insights into the joint optimization process.
To showcase this, we use the CelebA dataset, containing 40 binary facial attribute tasks, allowing us to explore complex task relationships and hierarchies via TSσBN. Moreover, because these attributes are semantically interpretable (e.g., “Smiling”, “Mouth Slightly Open”), they enable meaningful qualitative assessments of the learned relationships.
To derive task relationships we compute the pairwise cosine similarity between the task importance vectors I t ∈ R F , yielding a T × T similarity matrix, with values ranging from 0 (orthogonal filter usage) to 1 (indicating identical usage). We use this as the basis for constructing distance matrices to identify task clusters and hierarchical relationships that reflect the model’s capacity allocation.
To assess the stability of the task relationships derived from our model, we focus on the consistency of task hierarchies across multiple training runs. Specifically, we evaluate the similarity matrices obtained from seven independently trained models with different intializations. We compute the pairwise Spearman rank correlation between similarity matrices to determine whether the relative task orderings are robust to such variations. Our results show that the task hierarchies are highly stable, with an average Spearman correlation of 0.8 across all model pairs. We further assess the resulting relationships by aggregating the representative task clusters from the seven runs, via co-occurrence matrices and hierarchical clustering. The identified clusters exhibit semantic coherence, suggesting a correlation with the spatial proximity of facial attributes. For instance, tasks related to hair characteristics (e.g., Bangs, Blond Hair) form a distinct cluster. In contrast, facial hair attributes (e.g. Goatee, Mustache) are grouped separately. More details about the procedure and resulting task clusters can be found in the Appendix C. A different way to analyze multi-task learning is from an individual filter perspective. Using the task-filter matrix, we can gauge each task’s reliance on a filter to determine if the resource is specialized or generic. We define a filter as specialized for a particular task if its normalized task-filter importance exceeds a threshold τ . We set τ = 0.5 to signify that the filter predominantly contributes to a single task rather than being shared among multiple tasks. Formally, let σ(γ t,i ) denote the importance of filter i for task t. A filter i is deemed specialized for task
We prune the top 200 most important filters per task to test our definitions of specialization and importance. If accurate, removing a task’s specialized filters should degrade its performance more than others. Figure 4 (right) confirms this: diagonal elements, representing self-impact, show significantly larger drops than off-diagonals, supporting our hypothesis.
Next, we examine where specialized filters occur across the network. Figure 4 (left) shows the percentage of specialized filters per layer from different runs. Specialization increases with network depth, indicating that early layers are more shared while deeper layers become task-specific. This mirrors findings in single-task learning (Yosinski et al., 2015), where lower layers encode general features, and aligns with branching-based NAS heuristics (Bruggemann et al., 2020;Vandenhende et al., 2020a;Guo et al., 2020), which assign specialized layers to later stages. Our method for quantifying specialization and task similarity offers an alternative perspective for NAS strategies.
We evaluate TSσBN across a wide range of MTL settings -covering three CNN (from scratch and pretrained) and two vision transformer architectures over four standard MTL datasets: NYUv2 (Silberman et al., 2012), Cityscapes (Cordts et al., 2016), CelebA (Liu et al., 2015) and PascalContext (Chen et al., 2014). We follow established protocols from prior work (Liu et al., 2019;Ban & Ji, 2024;Lin & Zhang, 2023;Yang et al., 2024;Agiza et al., 2024) for training, evaluation, and metric reporting. TSσBN achieves comparable or superior performance to related and state-of-the-art methods while maintaining better resource efficiency. We refer to Appendix F for additional details on TSσBN integration, datasets, protocols and baselines.
Convolutional Neural Networks. We evaluate TSσBN on CNNs in two settings: models trained from scratch and initialized from pretrained backbones. For models trained from scratch, we follow standard protocols on NYUv2 (3-task) using SegNet (Badrinarayanan et al., 2017) as in Liu et al. (2019), and on Cityscapes (3-task) using DeepLabV3 (Chen, 2017) following Liu et al. (2022a). We also evaluate on CelebA, which contains 40 binary classification tasks, and adopt the CNN architecture used in Liu et al. (2024); Ban & Ji (2024). For pretrained CNNs, we integrate TSσBN into LibMTL (Lin & Zhang, 2023) using DeepLabV3 with a pretrained ResNet50 backbone on NYUv2 (3-task) and Cityscapes (2-task). This allows comparison to a wide range of recent MTL baselines under a consistent framework.
Vision Transformers. We evaluate TSσBN on two transformer-based MTL setups that reflect current state-of-the-art: MoE-style modulation, and parameter-efficient adapter-based methods. Both settings use pretrained Vision Transformer backbones with CNN based fusion or downsampling modules before task-specific decoders. For recent MoE MTL methods we follow the MLoRE protocol (Yang et al., 2024) on PascalContext (5-task). We use a pretrained ViT-S backbone (Dosovitskiy et al., 2021) and fine-tune the entire model. We also evaluate TSσBN on the MTLoRA benchmark (Agiza et al., 2024), which focuses on parameter-efficient MTL. This setup uses a partially frozen Swin-T (Liu et al., 2021c) backbone on PascalContext (4-task). We compare against a wide range of LoRA and adapter based models reported in MTLoRA. To showcase compatibility we also evaluate TSσBN with added task-generic (shared) LoRA(r = 16) adapters.
Multi-task evaluation. Following Maninis et al. (2019) to evaluate a multi-task model, we compute the average per-task performance gain or drop relative to a baseline B specified in the top row of the results tables. ∆m% = 1 T T t=1 (-1) δt Mm,t-MB,t MB,t ×100, where M m,t is the performance of a model m on a task t, and δ t is an indicator variable that is 1 if a lower value shows better performance for the metric of task t. All results are presented as an average over three independent runs. Additionally, we report parameters (P) and FLOPs (F) relative to the baseline.
Baselines. Across all experiments we compare TSσBN to a set of standard and protocol-specific multitask baselines. The most common reference points are Single-Task Learning (STL), which trains a separate model for each task, and Hard Parameter Sharing (HPS), which shares the entire backbone with equal task weights. We also include TSBN, the multi-task equivalent of domain-specific BN, which simply duplicates BN layers without our reparameterization and optimization changes. Each experimental setting includes additional baselines that follow the protocol and architecture family, reflecting standard practice in prior work and ensuring fair comparisons. For completeness, we also report results for multi-task optimization methods in the Appendix G. We note that even the simpler TSBN variant (without sigmoid and differential learning rates) delivers competitive performance out of the box, suggesting that complex architectures may be unnecessarily over-engineered. Overall, TSσBN achieves the best balance of accuracy, efficiency, and simplicity, consistently outperforming specialized MTL architectures across CNNs and transformers, while scaling to many-task regimes. We analyze the impact of different learning rate multipliers applied to the σBN layers, focusing on their effect on the distribution of scaling parameters γ t and overall model performance.
Figure 5 illustrates how varying the α BN multiplier influences the distribution of σ(γ t ) values across all filters. A more detailed task-wise breakdown is provided in the Appendix. Higher learning rates induce more significant parameter variance, increasing their expressivity. Since σ(γ t ) is initialized at 0.5, lower learning rates result in minimal divergence, with α σBN = 1 being excluded as it shows almost no differentiation between tasks. At α σBN = 100, we see a substantial spread in σ(γ t ) values across the full [0, 1] range, allowing tasks to choose and specialize on subsets of filters. However, an extreme learning rate of α σBN = 10 3 leads to a highly polarized distribution, where filter importances collapse to a binary mask, effectively enforcing a hard-partitioning regime. These findings highlight how BN learning rates control the degree of task-specific capacity allocation, influencing both representation disentanglement and network adaptability.
We further analyze the impact of different learning rate multipliers on the MTL performance in Table 6. For TSBN, moderate multipliers yield small gains, but performance collapses at high rates. In contrast, σBN consistently benefits from larger multipliers across values, indicating that sigmoid activation is essential both for unlocking greater improvements and for robustness. A well-known challenge in multi-task learning is the discrepancy in loss scales and, consequently, gradient magnitudes across tasks, which can lead to task dominance and suboptimal performance. Many existing approaches rely on manual tuning or specialized optimization strategies for dynamic weighting. Our method is highly robust to perturbations of loss scales without any additional changes.
To evaluate the robustness of our method to loss weight perturbations, we conduct a series of experiments on NYUv2 by varying the weight of each task. Specifically, we scale each task loss by factors of {0.5, 1.5, 2.0} while maintaining the default weight of 1.0 for the remaining tasks. The distribution of relative performances under these perturbations is visualized in Figure 5. TSσBN shows the lowest variance under loss scale perturbations, indicating robustness to task dominance and improved optimization stability.
We present TSσBN, a simple soft-sharing mechanism for multi-task learning that relies only on task-specific normalization layers. Using a sigmoid-gated reparameterization and differential learning rates, our method turns BN from a normalization module into a stable and expressive tool for capacity allocation and interference reduction.
Across convolutional and transformer architectures, TSσBN achieves competitive or superior performance while using substantially fewer parameters. Notably, it matches or outperforms state-of-the-art MoE-style and PEFT-based MTL methods without adding routing modules, experts, or adapters. The learned gates also provide a direct view of model behavior, yielding interpretable measures of capacity allocation, filter specialization, and task relationships.
Overall, our results show that lightweight, normalization-driven designs can replace much heavier mechanisms while offering clearer interpretability. We hope this encourages a reevaluation of complexity in MTL and promotes simple, transparent alternatives.
To further investigate task interference, we expand on the analysis presented in Section 4 and provide a more comprehensive view of gradient conflicts across all task pairs for NYUv2. Specifically, in figure 7 we plot the distribution of cosine similarities between gradients for every task pair across the shared parameters of the SegNet backbone.
In addition to the methods discussed in the main paper, we include Task-Specific Batch Normalization (TSBN) as a baseline. Interestingly, TSBN alone is sufficient to induce a mode around orthogonality, demonstrating that normalization can already reduce some degree of task interference. However, incorporating σBN significantly amplifies this effect, further increasing the number of near-orthogonal gradients and reducing interference. This highlights the role of σBN in not only mitigating conflicts but also improving gradient disentanglement across tasks.
It is important to note that the presented gradient distributions are measured after one epoch of training over the training set. As training progresses, we observe that the differences between methods become less pronounced. Regardless of the initial distribution, all approaches gradually converge toward a bell-shaped distribution centered around orthogonality. This suggests that while early-stage interference may impact optimization dynamics, multi-task models eventually adjust to reduce conflicts over time.
A notable exception is observed in MTAN, which produces more aligned gradients specifically for the semantic segmentation and surface normal estimation task pair. Despite this alignment, we do not observe a corresponding performance gain. This suggests that while reducing conflicts is beneficial, not all aligned gradients lead to improved task synergy, underscoring the notion that mitigating interference alone does not guarantee optimal performance.
We extend Figure 2 from Section 4 by visualizing encoder representations for all 40 tasks in the CelebA setting. As before, we use t-SNE to project the high-dimensional representations into a more interpretable space. Each data point is assigned representations for every task due to the nature of the soft parameter sharing paradigm, resulting in multiple embeddings per sample. In Figure 8, we observe that most tasks form well-separated clusters, though a few outliers exhibit some degree of overlap.
We utilize the CelebA dataset to identify relationships and hierarchies among the 40 binary classification tasks of facial attributes. We compute pairwise cosine similarities between task importance vectors, producing a task similarity matrix S =
, that serves as the foundation for identifying task clusters and hierarchies. Crucially, for these relationships to be useful, they must be robust -unstable hierarchies would offer little insight into model behavior or optimization dynamics. We find the relationships from TSσBN to be highly stable, with an average Spearman rank correlation of 0.8 between similarity matrices from seven independent training runs.
For a qualitative assessment of task relationships, we compute representative clusters of tasks from the seven runs. To achieve this, we construct a co-occurrence matrix that captures the frequency with which each pair of tasks appears in the same cluster. This co-occurrence matrix effectively aggregates clustering information from all runs, highlighting task pairs that consistently exhibit strong relationships regardless of initialization. We then apply hierarchical clustering directly to this matrix to identify a representative cluster of tasks that frequently co-occur.
The identified clusters exhibit apparent semantic coherence, as shown in Table 5. Since these clusters are derived from filter-usage based relationships, tasks grouped tend to rely on similar specialized filters within the network. This suggests that the model internally organizes tasks based on shared feature representations. Notably, the clustering patterns appear to correlate with the spatial proximity of facial attributes. For instance, tasks related to hair characteristics (e.g., Bangs, Blond Hair) form a distinct cluster. In contrast, facial hair attributes (e.g. Goatee, Mustache) are grouped separately, indicating that the network leverages localized feature detectors. This spatial coherence reinforces the idea that task relationships emerge from shared activations of filters sensitive to specific facial regions, reflecting the model’s ability to capture both semantic and structural commonalities across tasks.
We extend the ablation study from Section 7.1, investigating the impact of discriminative learning rates for σBN layers. Specifically, we apply a higher learning rate to BN parameters, allowing them to adapt more rapidly to the shared convolutional layers before those layers undergo significant updates. This adjustment is controlled by a multiplier applied to the model’s base learning rate.
In this more detailed analysis, we examine the importance distributions of filters per task across different learning rate multipliers. Figure 9 presents the resulting distributions for four multiplier values: 10 0 , 10 1 , 10 2 , 10 3 . As the multiplier increases, the variance of filter importance distributions grows, leading to progressively softer filter allocations. At a multiplier of 1, BN parameters remain close to their initialization, resulting in near-uniform filter sharing across tasks, similar to hard parameter sharing. On the opposite extreme, a multiplier of 10 3 effectively induces a binary filter mask, resembling a hard partitioning approach. Notably, σBN plays a crucial role in stabilizing this process, as its sigmoid activation mitigates potential gradient explosion. We use α σBN = 10 2 in all our experiments.
To further investigate MTL capacity allocation using the TSσBN framework, we conduct a synthetic experiment designed to control task difficulty and relationships systematically. Specifically, we modify the NYUv2 dataset by removing the surface normals estimation task and replacing it with a noisy variant of the depth estimation task. We generate a family of datasets where the additional depth task is corrupted by Gaussian noise of increasing variance. Formally, given the original depth labels D, we construct synthetic tasks:
where ξ controls the level of corruption as a scaler of the original depth task’s variance. Using TSσBN, we analyze how model capacity is allocated between shared and task-specific components, as well as how task relationships change, by computing cosine similarity over task importance vectors.
In figure 10 we plot the decomposed task capacities and pairwise similarities for datasets with ξ ranging between [0, 3]. As expected, when ξ is low, the original and noisy depth tasks exhibit strong alignment, reinforcing high shared capacity. However, as ξ increases, the similarity between the tasks decreases, and their filter allocations become more distinct, with independent capacity increasing. This aligns with our hypothesis that related tasks co-adapt to share resources, whereas unrelated tasks require greater specialization. Overall, this experiment highlights how TSσBN automatically balances shared and independent capacity in response to increasing task difficulty and lower task similarity.
Hardware. Experiments on NYUv2 and Cityscapes were run on an NVIDIA RTX 3090 GPU. Due to higher memory requirements, CelebA (40 tasks) and transformer-based models were trained on an NVIDIA A100 GPU.
NYUv2. We follow the setup of Liu et al. (2019;2024) for base architecture, training configuration, and evaluation metrics. A multi-task SegNet is used, with both encoder and decoder shared across tasks and lightweight task-specific heads composed of two convolutional layers. All methods are trained with Adam (lr = 10 -4 ), using a step schedule that halves the learning rate at epoch 100.
Training runs for 200 epochs with a batch size of 4.
Cityscapes. Following Liu et al. (2022a), we use DeepLabV3 with a ResNet-50 backbone and taskspecific ASPP decoders, which account for most of the parameters. Optimization is performed with SGD (lr = 10 -2 , weight decay = 10 -4 , momentum = 0.9) for 200 epochs using a CosineAnnealing scheduler and batch size of 4. For TSσBN layers, weight decay is disabled.
CelebA. We adopt the configuration from Liu et al. (2024); Ban & Ji (2024), using a shared CNN backbone with task-specific linear classifiers. Models are trained for 15 epochs with Adam (lr = 3 × 10 -4 ) and batch size 256.
Implementation. Converting pretrained BN layers into σBN depends on their weights. A network trained from scratch may learn a purely linear transformation, but converting an affine layer to linear is not possible unless β = 0. To avoid conversion shock, we copy the pretrained biases but keep them frozen during training. In ResNet-50 pretrained on ImageNet, most BN scale parameters (γ) fall within (0,1), allowing them to be represented by the sigmoid function. We therefore apply the inverse sigmoid to initialize σBN scales, ensuring consistency with the pretrained distribution.
Comparison of various multi-task architectures within the LibMTL framework using DeepLabV3 with a pre-trained ResNet-50 backbone on NYUv2 (3-task) and CityScapes (2-task). TSσBN achieves the best overall performance while being the most parameter-efficient.
We provide comprehensive experimental details in the main text in Section 6 and Appendix F, including datasets, architectures, training protocols, and evaluation metrics.
This content is AI-processed based on open access ArXiv data.