When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Reading time: 5 minute
...

📝 Original Info

  • Title: When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer
  • ArXiv ID: 2602.17144
  • Date: 2026-02-19
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (필요 시 원문에서 확인 바랍니다.) ### **

📝 Abstract

Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier's underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert-like learning problem, thereby resolving multi expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.

💡 Deep Analysis

📄 Full Content

In risk-critical machine learning tasks, misclassification can be fatal. Unlike ordinary machine learning pipelines that deploy models solely for prediction, the Learning to Defer (L2D) paradigm [16,25,1] aims to enhance system reliability by integrating human experts' decisions. This framework allows a system to defer the prediction of a sample to an expert if it deems the expert more capable of providing a correct prediction. Due to its practical importance, L2D has attracted significant attention in recent years, with substantial works extending the framework to more complex and realistic settings [9,38,5,39,20,40,37,6,30,22,12,13,29].

Despite the conceptual appeal of L2D, training such systems remains computationally challenging. The performance is typically evaluated by the overall system accuracy, an objective that is non-convex and discrete, making it difficult to optimize directly. To render the training tractable, the dominant approach in the literature relies on continuous surrogate losses. These methods are often designed to satisfy statistical consistency, ensuring that optimizing the surrogate asymptotically recovers the optimal L2D rule [25,38,3,15,33,32]. Beyond surrogate-based training, alternative strategies such as post-hoc methods have also been proposed to bypass the direct optimization of the deferral policy [26,17,19,24,23].

While the majority of existing L2D research focuses on the single-expert setting, real-world applications often involve a pool of experts with complementary skills. This shift significantly increases the difficulty of the learning problem. In the single-expert case, the system only needs to make a binary decision of whether to predict or defer. In the multi-expert setting, however, the system must determine not only when to defer but also which specific expert is the most reliable for a given input. To address this challenge, Verma et al. [39] extended classic single-expert methods [25,38] to the multi-expert domain. This approach was subsequently shown to be a special case of the unified consistency framework proposed by Mao et al. [19]. In addition to these surrogate-based methods, Mao et al. [17,21] developed two-stage methods to explicitly handle expert selection processes.

While surrogate-based methods have proven effective in standard single-expert L2D settings, the phenomenon of classifier underfitting has been observed in a specific non-conventional scenario involving explicit deferral costs, where the weakened classifier significantly impairs the system’s final accuracy and reliability. In this particular case, the issue arises from a redundant label-smoothing term induced by the cost parameter and Figure 1: Left: Illustration of underfitting when using multi-expert CE surrogate loss proposed by Verma et al. [39] on ImageNet. We consider a MobileNet-v2 model and progressively introduce “dog experts”, where each expert covers a domain consisting of 5 dog species, attaining 85% accuracy on its domain, 75% on the other dog species, and random guessing on remaining classes. Since the experts have non-overlapping domains, adding more experts strictly increases the aggregate accuracy of the expert set. We report the test accuracy of both the system and the classifier. Right: An illustration of the degraded distribution for a 5-class classification task. We present the predicted class-posterior probabilities for an instance in descending order before and after the introduction of three experts.

has been successfully mitigated by specialized techniques [26,15]. Consequently, the standard cost-free setting is generally considered to be free from such issues.

Current research in the multi-expert L2D primarily focuses on selecting the optimal expert and typically operates under the conventional setting without extra deferral costs. However, we identify that classifier underfitting unexpectedly persists in this general multi-expert setting (as in Figure 1a) despite the complete absence of deferral costs. Our analysis attributes this to the expert aggregation term inherent to multi-expert objectives rather than the cost-induced smoothing in single-expert cases. Since the underlying cause is fundamentally different, previous remedies are thus inapplicable, highlighting the need for alternative solutions.

To address these issues, we propose the Picking the Confident and Correct Expert (PiCCE) loss formulation, which mitigates the underfitting issues caused by the expert aggregation term, by exploiting the empirical/groundtruth information. We further provide theoretical guarantees for the proposed PiCCE from both optimization and statistical efficiency perspectives. Our main contributions are:

• In Section 3, we identify that classifier underfitting persists in the general multi-expert setting even without deferral costs, which contradicts the intuition from single-expert L2D. We theoretically attribute this phenomenon to the expert aggregation term, which fundame

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut