When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-ex…

Authors: Shuqi Liu, Yuzhou Cao, Lei Feng

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer
When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer Shuqi Liu 1 ∗ , Y uzhou Cao 1 ∗ , Lei F eng 2 , Bo An 1 , Luke Ong 1 1 Nanyang T echnological Univ ersity 2 Southeast Uni versity ∗ Equal Contribution Abstract Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamen- tally more challenging than the single-e xpert case. W ith multiple experts, the classifier’ s underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. W e theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which e xpert to trust from a div erse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. T o tackle this issue, we propose PiCCE ( Pi ck the C onfident and C orrect E xpert), a surrogate-based method that adaptively identifies a reliable e xpert based on empirical evidence. PiCCE ef fectiv ely reduces multi-e xpert L2D to a single-expert–like learning problem, thereby resolving multi- expert underfitting. W e further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensiv e experiments across div erse settings, including real-world expert scenarios, vali- date our theoretical results and demonstrate improv ed performance. 1 Intr oduction In risk-critical machine learning tasks, misclassification can be fatal. Unlike ordinary machine learning pipelines that deploy models solely for prediction, the Learning to Defer ( L2D ) paradigm [16, 25, 1] aims to enhance sys- tem reliability by inte grating human e xperts’ decisions. This framew ork allo ws a system to defer the prediction of a sample to an expert if it deems the expert more capable of providing a correct prediction. Due to its prac- tical importance, L2D has attracted significant attention in recent years, with substantial w orks e xtending the framew ork to more complex and realistic settings [9, 38, 5, 39, 20, 40, 37, 6, 30, 22, 12, 13, 29]. Despite the conceptual appeal of L2D, training such systems remains computationally challenging. The performance is typically e v aluated by the o verall system accurac y , an objecti ve that is non-con vex and discrete, making it difficult to optimize directly . T o render the training tractable, the dominant approach in the literature relies on continuous surrogate losses. These methods are often designed to satisfy statistical consistency , ensur- ing that optimizing the surrogate asymptotically recov ers the optimal L2D rule [25, 38, 3, 15, 33, 32]. Beyond surrogate-based training, alternati ve strate gies such as post-hoc methods ha ve also been proposed to bypass the direct optimization of the deferral policy [26, 17, 19, 24, 23]. While the majority of existing L2D research focuses on the single-e xpert setting, real-w orld applications often in volv e a pool of experts with complementary skills. This shift significantly increases the dif ficulty of the learning problem. In the single-e xpert case, the system only needs to make a binary decision of whether to predict or defer . In the multi-expert setting, howe ver , the system must determine not only when to defer but also which specific e xpert is the most reliable for a gi ven input. T o address this challenge, V erma et al. [39] extended classic single-expert methods [25, 38] to the multi-expert domain. This approach was subsequently shown to be a special case of the unified consistency framework proposed by Mao et al. [19]. In addition to these surrogate-based methods, Mao et al. [17, 21] de veloped two-stage methods to explicitly handle e xpert selection processes. While surrogate-based methods hav e proven effecti v e in standard single-expert L2D settings, the phe- nomenon of classifier underfitting has been observed in a specific non-con ventional scenario in volving explicit deferral costs, where the weakened classifier significantly impairs the system’ s final accuracy and reliability . In this particular case, the issue arises from a redundant label-smoothing term induced by the cost parameter and 1 (a) (b) Figure 1: Left: Illustration of underfitting when using multi-expert CE surrogate loss proposed by V erma et al. [39] on ImageNet. W e consider a MobileNet-v2 model and progressi vely introduce “dog experts”, where each expert co vers a domain consisting of 5 dog species, attaining 85% accuracy on its domain, 75% on the other dog species, and random guessing on remaining classes. Since the experts have non-overlapping domains, adding more e xperts strictly increases the aggregate accuracy of the expert set. W e report the test accurac y of both the system and the classifier. Right: An illustration of the degraded distribution for a 5-class classification task. W e present the predicted class-posterior probabilities for an instance in descending order before and after the introduction of three experts. has been successfully mitigated by specialized techniques [26, 15]. Consequently , the standard cost-free setting is generally considered to be free from such issues. Current research in the multi-expert L2D primarily focuses on selecting the optimal e xpert and typically operates under the con ventional setting without extra deferral costs. Ho wever , we identify that classifier under- fitting unexpectedly persists in this general multi-expert setting (as in Figure 1a) despite the complete absence of deferral costs. Our analysis attributes this to the expert aggr e gation term inherent to multi-expert objectiv es rather than the cost-induced smoothing in single-expert cases. Since the underlying cause is fundamentally different, pre vious remedies are thus inapplicable, highlighting the need for alternative solutions. T o address these issues, we propose the Pi cking the C onfident and C orrect E xpert (PiCCE) loss formulation, which mitigates the underfitting issues caused by the expert aggregation term, by exploiting the empirical/ground- truth information. W e further provide theoretical guarantees for the proposed PiCCE from both optimization and statistical efficienc y perspecti ves. Our main contributions are: • In Section 3, we identify that classifier underfitting persists in the general multi-expert setting ev en with- out deferral costs, which contradicts the intuition from single-expert L2D. W e theoretically attribute this phenomenon to the expert aggr e gation term , which fundamentally differs from cost-induced label smoothing observed in prior literature. • Given that the distinct cause of underfitting renders prior remedies inef fective, we introduce PiCCE in Section 4, a surrogate-based method that dynamically selects experts in a data-dependent manner . W e demonstrate that it enjoys fa vorable optimization properties and resolves underfitting by significantly compressing the expert aggre gation term. • In Section 5, we analyze the statistical consistency of PiCCE. W e prov e that PiCCE is not only consistent but also capable of accurately reco vering both the underlying class probabilities and e xpert accuracies. • W e experimentally verify the underfitting caused by expert aggregation in Section 6 and demonstrate the effecti veness of PiCCE using both synthetic and real-world experts. 2 Pr eliminaries In this section, we first introduce the problem formulation of L2D with multiple experts and existing solutions, then revie w the issue of underfitting under single-expert setting. 2.1 Problem F ormulation and Existing Losses for L2D with Multiple Experts Data Generating Distribution: Denote by X ⊆ R d the feature space, Y : = [ K ] the label space, and M = Y the expert prediction space. The expert prediction space for J experts is M J = [ K ] J , and m j is the prediction of the j -th expert for any m ∈ M J . The data generating distribution of L2D to multiple experts is D on X × Y × M J , and the density of this distribution is p ( x , y , m ) . W e assume the training data set consists of n i.i.d. samples dra wn from D . 2 T echnical Notations: Let us write η y ( x ) : = Pr( Y = y | X = x ) . The i -th expert’ s prediction accurac y at point x is denoted by Acc i ( x ) := Pr ( M i = Y | X = x ) . W e define the argmax i ∈I θ i ∈ I that breaks ties arbitrarily and deterministically , and Argmax is the original set of maximum indices. Define the augmented label space as Y ⊥ = Y ∪ {⊥ 1 , · · · , ⊥ J } , where ⊥ j refers to the option of deferring to the j -th expert. Problem Setup: The aim of L2D with multiple e xperts is to learn a model f : X → Y ⊥ with performance ev aluated by the following tar get system loss [39]: ℓ ⊥ 01 ( f ( x ) , y , m ) := ( 1 [ f ( x )  = y ] , if f ( x ) ∈ Y , 1 [ m j  = y ] , if f ( x ) = ⊥ j , (1) where 1 [ f ( x )  = y ] and 1 [ m j  = y ] represent the zero-one losses of the classifier and the j -th expert, respec- tiv ely , taking value of 1 if the corresponding prediction is incorrect. The risk minimization pr oblem w .r .t. (1) is defined as below , min f R ⊥ 01 ( f ) = E p ( x ,y , m ) [ ℓ ⊥ 01 ( f ( x ) , y , m )] , (2) where the target risk R ⊥ 01 ( f ) is the misclassification error of the whole L2D system over D . Its Bayes optimal solution is f ∗ ( x ) = ( ⊥ j ∗ , if Acc j ∗ ( x ) ≥ η y ∗ ( x ) , argmax y ∈ [ K ] η y ( x ) , else , (3) where y ∗ := argmax y ∈ [ K ] η y ( x ) denotes the optimal label, and j ∗ := argmax j ∈ [ J ] Acc j ( x ) is the most accurate expert. Intuitiv ely , the optimal L2D system triggers deferral options when an expert outperforms the classifier , ensuring each decision is handled by the most competent entity . Existing Loss Frameworks: Gi ven the discontinuous nature of the multi-e xpert risk (2), which is similar to the single-expert case as the multi-e xpert setting follows the same de velopment, surrogate losses ha ve been proposed to make the optimization problem tractable. While Hemmer et al. [11] introduced a mixture of e xperts method, V erma et al. [39] sho wed that it is inconsistent and overcame this limitation by generalizing single- expert CE [25] and OvA [38] surrogate losses to the multiple-expert setting, which can be further unified into a general framew ork [17]: ℓ ϕ ( θ ,y , m ) : = ϕ ( θ , y ) + X J j =1 1 [ m j = y ] ϕ ( θ , j + K ) (4) where ϕ : R K + J × [ K + J ] → R ≥ 0 is a multiclass loss function and θ ∈ R K + J denotes the scor e vector produced by the model for input x via a scoring function g : X → R K + J , i.e., g ( x ) = θ . For brevity , we omit the dependence of θ on x in the rest of this paper, w .l.o.g., since the risk is defined pointwise as an expectation ov er x . Note that (4) is consistent provided that the multi-class surrogate loss ϕ is consistent [5, 19], i.e., minimizing the expected surrogate risk w .r .t. (4) recovers the Bayes optimal solution (3). The L2D system f is implemented via the composition φ ◦ g , i.e., f ( x ) = ( φ ◦ g )( x ) , where φ : R K + J → Y ⊥ is a prediction link defined as: φ ( θ ) = ( argmax y θ y , if argmax y θ y ∈ [ K ] , ⊥ (argmax y θ y − K ) , else . (5) 2.2 Underfitting Issues in L2D When J = 1 , the abov e problem reduces to the single-expert L2D setting [25, 38, 3]. A special studied case further incorporates a non-zero deferral cost c [26, 15], reflecting the practical constraints that expert consultation often incurs additional computational or monetary expenses. Under this formulation, compared with (1), the e xpert-related loss becomes 1 [ m  = y ] + c . Consequently , the optimal system defers only when the expert’ s accuracy e xceeds the classifier’ s by at least a margin of c . Howe ver , the inclusion of c > 0 has been found to trigger classifier underfitting [26, 15]. This issue is largely rooted in the learning objective’ s structure: specifically , Narasimhan et al. [26] identified that in both CE 3 [25] and OvA-based [38] surrogates, the presence of c introduces a (redundant) label-smoothing term [36] that hampers the classifier’ s learning. Furthermore, Liu et al. [15] showed that this cost tends to induce a flattened label distrib ution that scales with the number of classes K , making the classifier more lik ely to incorrectly swap the optimal label with a competing one. T o mitigate these c -induced underfitting issues, a post-hoc estimator w as proposed by Narasimhan et al. [26]. Inspired by the theoretical success of the end-to-end approaches [25, 38], Liu et al. [15] further provided a one-staged loss framew ork to eliminate the label-redundant term by utilizing the intermediate learning results. 3 Mor e Experts, W orse Perf ormance: A New Underfitting Challenge In Section 2.2, we revie wed the underfitting issues under the single-expert with non-zero deferral costs setting. W e now show that this underfitting problem persists in the multi-expert scenario even without additional deferral costs, and show that e xisting methods and seemingly plausible extensions based on them f ailed in this case. 3.1 Multi-Expert Can Cause Underfitting W e focus on the multi-expert L2D frame work (4) in the standard setting without additional deferral costs. Under Bayes optimality (3), the system always selects the most accurate predictor from the pool of experts and the classifier . Consequently , a larger expert set should ne ver lead to a decrease in optimal performance. F or instance, consider two expert sets E 1 ⊂ E 2 , the best-in-set accuracy satisfies: max j ∈E 2 Acc j ( x ) ≥ max j ∈E 1 Acc j ( x ) , which implies that the optimal L2D system induced from E 2 will be better than or equal to the one from E 1 . Howe ver , the empirical evidence in Figure 1a rev eals an unexpected inv ersion of this analysis: expanding the expert set induces underfitting in the base classifier , which subsequently leads to a decline in the overall system accuracy . This indicates that a larger and more capable expert set can actually compromise the learning process, causing the system to underperform as more experts are introduced. This finding is particularly notable because (4) is free from redundant label-smoothing and should hav e av oided underfitting [26, 15]. W e thus identify a novel multi-e xpert underfitting challenge that contradicts these existing conclusions. How can a more capable expert set induce such underfitting? T o uncov er the underlying mechanism, we begin by analyzing the risk w .r .t. (4). R ℓ ϕ | x ( θ ) = K X y =1 η y ( x ) ϕ ( θ , y ) + J X j =1 Acc j ( x ) ϕ ( θ , j + K ) . The above risk is closely related to a ( K + J ) -class classification problem over dummy distrib ution b p , where the augmented class probabilities for a giv en x is b η ( x ) = [ b p (1 | x ) , · · · , b p ( K | x ) , b p ( ⊥ 1 | x ) , · · · , b p ( ⊥ J | x )] , and b p ( y | x ) = η y ( x ) 1 + A ( x ) , b p ( ⊥ j | x ) = Acc j ( x ) 1 + A ( x ) , (6) where the expert ag gr e gation term A ( x ) = P J j =1 Acc j ( x ) captures the sum of experts’ accuracy . Observe that the expert aggregation term A ( x ) scales as O ( J ) with the number of experts. Unlike the single-expert case where A ( x ) = O (1) , this multi-e xpert term increasingly flattens the label distribution b p ( y | x ) . As Figure 1b illustrates, this flattening narrows the mar gin between the optimal label and its competitors, making the ground truth harder to identify and ev entually triggering underfitting. In short, flattened label distribution arises ine vitably in multi-e xpert settings, which is dif ferent from single- expert cases. Consequently , this inherent flattening reveals that existing losses like CE and OvA [39] will suffer from underfitting due to their intrinsic objectiv e structure rather than external assumptions, e.g., deferral costs. 3.2 F ailure of Mer ely Using Intermediate Results As established in Section 3.1, underfitting in multi-expert L2D stems from the expert aggregation term rather than redundant label smoothing. Consequently , existing underfitting-resistant approaches [15] designed to elim- inate label smoothing fail to resolv e the underfitting inherent to the multi-expert setting, which necessitates new methods tailored for the multi-expert underfitting. 4 Since the aggregation term A ( x ) = O ( J ) is the primary cause of underfitting, a direct solution is to reduce the multi-expert system to a single-e xpert setup. T o test this logic, consider an idealized scenario where the most accurate expert j ∗ for an instance is known. In this case, comparing the classifier solely against j ∗ substitutes A ( x ) with Acc j ∗ ( x ) , which preserves Bayes optimality while immediately eliminating the label-flattening effect. Although j ∗ is inaccessible in practice, e xisting work [15] suggests that intermediate learning results can serve as a proxy of the optimal expert, i.e., utilizing model’ s predictions at intermediate stages of the training process instead of the true optimal expert j ∗ . This motiv ates the Pick the Confident Expert method below following the logic of [15]: e ℓ ◦ ϕ ( θ , y , m ) := ϕ ( θ , y ) + 1 [ m b j ∗ = y ] ϕ ( θ , b j ∗ + K ) , (7) where b j ∗ = argmax j ∈ [ J ] θ j + K is an estimator of the optimal expert, and the corresponding conditional risk is: R e ℓ ◦ ϕ | x ( θ ) = K X y =1 η y ( x ) ϕ ( θ , y ) + Acc b j ∗ ( x ) ϕ ( θ , b j ∗ + K ) . By reducing the expert aggregation in (4) to a single estimated optimal expert term, this ne w formulation (7) induces a less flattened (dummy) label distribution p ′ : p ′ ( y | x ) = η y ( x ) 1 + Acc b j ∗ ( x ) , p ′ ( ⊥ j | x ) = Acc j ( x ) 1 + Acc b j ∗ ( x ) . (8) Howe ver , despite its success in mitigating label flattening, this seemingly reasonable solution fails from an op- timization perspective because it lacks continuity , a fundamental property of any viable surrogate loss. Specif- ically , the indicator term in (7) is non-constant and depends on the scores θ through the discrete estimator b j ∗ . Consequently , an y shift in the selected expert b j ∗ induces a step discontinuity , thereby precluding effecti ve gradient-based optimization. 4 PiCCE: Using Both Intermediate and Empirical Results According to the discussion in Section 3, the multi-expert L2D paradigm is naturally prone to underfitting compared with single-expert L2D, while existing surrogate-based strategy for mitigating underfitting fails to achiev e continuity , which violates the essential premise of surrogate loss design. In this section, we show that this new challenge can be effecti vely solved by exploiting the gr ound-truth information, i.e., considering both the confidence and the empirical correctness of the experts, in the surrogate loss design. The further integration of ground-truth information not only enables continuous surrogates, but also greatly mitigates the unique phenomenon of underfitting caused by expert aggre gation in multi-expert L2D. 4.1 Regulating Confident Experts with Ground-truth T o achieve a continuous surrogate framework, let’ s analyze the reason for the discontinuity of (7) first. Revis- iting the definition of expert selector b j ∗ in (7) , it is noticeable that the selection ranges ov er the complete set of e xperts [ J ] , which encompasses both accurate and erroneous predictions. When the most confident expert shifts among this full candidate set, (7) may encounter step-lik e discontinuities due to the indicator function 1 [ m b j ∗ = y ] , if the correctness of selected experts v aries after the shift. This observ ation raises a concern: is it really helpful to consider the whole expert set? According to the discussion above, choosing the full set [ J ] indiscriminately contributes to the step-like discontinuities inherent in (7), which suggests the potential benefit of constraining the expert set . Furthermore, it is also intuitive to prune the expert set based on empirical observations of their performance, thereby focusing the selections on high-accuracy e xperts. Therefore, a natural idea is to modify the expert selector based on the intuition of regulating the candidate set with empirical evidences of expert accuracy . Despite potential concerns regarding the cost of acquiring such evidence, the ground-truth y and expert predictions m j serve as surprisingly straightforward sources. Specifically , they constitute the unbiased estimator 1 [ m j = y ] of expert accuracy Acc j . Moreov er , this incurs zero additional cost, giv en that they are readily a vailable within the standard L2D training set. Motiv ated by this empirical correctness evidence 1 [ m j = y ] , we proceed to construct a more compact expert selector: confining the expert set to the correct e xperts { j ∈ [ J ] : m j = y } , and selecting the most confident 5 expert in this set. Compared with the pick the confident expert method (7) that chooses from the full set [ J ] , our proposed new selector operates in a manner of Pi cking the C onfident and C orr ect E xpert (PiCCE), which induces the following surrog ates: Definition 1 (PiCCE) Denote by [ m = y ] the set { j ∈ [ J ] : m j = y } . F or any multiclass loss ϕ : R K + J × [ K + J ] → R ≥ 0 , our loss e ℓ ϕ : R K + J × Y × M J → R ≥ 0 is: e ℓ ϕ ( θ , y , m ) = ϕ ( θ , y ) + ϕ  θ , argmax j ∈ [ m = y ] θ j + K + K  . (9) The second term is 0 if [ m = y ] = ∅ . The most intuitiv e difference between PiCCE (9) and the discontinuous loss (7) is that (9) remov es the multi- plicative indicator term. Ho wev er , this is not the result of manual exclusion, but a natural consequence of the transition of candidate expert sets: for any expert j ∈ [ m = y ] , the indicator term 1 [ m j = y ] equals 1, and thus naturally integrates into the formulation (9). The benefit of this distinction is obvious since it removes the potential step-like discontinuities, which enables the construction of continuous surrogates: Theorem 2 (Continuity of PiCCE) The pr oposed formulation (9) is continuous if ϕ is continuous and is sym- metric w .r .t. its last J inputs, i.e., P ϕ ( θ ) = ϕ ( P θ ) for permutation matrices P ∈ R K + J × K + J that P i,i = 1 for i ∈ [ K ] , and ϕ ( θ ) = [ ϕ ( θ , 1) , · · · , ϕ ( θ , K + J )] ⊤ . The symmetry requirement on the base multiclass loss ϕ is mild, and it cov ers most multiclass losses used in existing multi-expert L2D losses [39, 19]. For example, cross-entropy loss is a natural choice that satisfies the symmetry , which is used in the multi-expert CE loss [39, 25]. Similarly , the base loss for the multi-expert OvA loss, as we will show in Section 5, also satisfies the symmetry . 4.2 Underfitting-resistance of PiCCE While we hav e shown that the refinement of candidate expert set yields continuity , which is beneficial from an optimization perspecti ve, a more critical issue lies in its statistical ef ficiency , in particular its robustness against underfitting, which constitutes the primary target of our design. In this section, we show that PiCCE is indeed underfitting-resistant by demonstrating its dummy distribu- tion. T o begin with, we first analyze the risk formulation of PiCCE: Lemma 3 (Risk of PiCCE) Suppose ϕ is symmetric w .r .t. its last J inputs. Denote by σ any permutation of [ J ] such that θ σ 1 + K ≥ · · · ≥ θ σ J + K . Then the risk of PiCCE is: R e ℓ ϕ | x ( θ ) = K X y =1 η y ( x ) ϕ ( θ ,y ) + J X j =1 A j σ ( x ) ϕ ( θ ,σ j + K ) (10) wher e A j σ ( x ) = Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) . This risk is also a weighted sum of multiclass losses, which allows an analysis based on its corresponding ( K + J ) -class dummy distribution e p . Specifically , its label counterpart ( y ∈ [ K ] ) can be formulated as: e p ( y | x ) = η y ( x ) K P y =1 η y ( x ) + J P j =1 A j σ ( x ) = η y ( x ) 1 + J P j =1 A j σ ( x ) While the dummy distribution can be explicitly formulated, the denominator 1 + P J j =1 A j σ ( x ) is less intuitiv e than those in (6) and (8). Specifically , the aggregation term A ( x ) in (6) clearly scales as O ( J ) because it represents the sum of expert accuracies, while the corresponding term Acc b j ∗ ( x ) in (8) is simply O (1) . In contrast, the term P J j =1 A j σ ( x ) is significantly harder to quantify . While it superficially appears to be O ( J ) as a summation of J terms, the internal dependencies between the A j σ ( x ) components suggest the potential for further simplification. This lack of transparency complicates the study of underfitting, particularly when quantifying the degree of label distrib ution flattening. Fortunately , the lemma belo w sho ws P J j =1 A j σ ( x ) can be simplified by exploiting their dependencies: 6 Lemma 4 F or any permutation σ and x ∈ X : 1 + J X j =1 A j σ ( x ) = 1 + Pr  [ j ∈ [ J ] M j = Y | X = x  . (11) Compared with dummy distribution (6) that severely flattens the label distribution since A ( x ) can increase up to J , Lemma 4 indicates that PiCCE successfully resolv es this issue by compressing the sum of expert accuracy A ( x ) = O ( J ) into Pr  S j ∈ [ J ] M j = Y | X = x  = O (1) , which leads to a dummy distrib ution that remains informativ e with increasing e xpert number , and thus is free from the multi-e xpert underfitting issue. In the next section, we further analyze the statistical consistency of PiCCE, i.e., if the minimization of the risk (10) can recov er the Bayes optimal solution for multi-expert L2D (3). 5 Consistency Guarantee In Section 5.1, we first sho w that the classifier counterpart in the L2D system is guaranteed to be consistent, i.e., the label with the highest score is always the most probable label. Then we further justify the consistenc y of the whole L2D system under a mild condition. Besides, our analyses focus on the follo wing two multiclass losses: ϕ ( θ , y ) = − log softmax( θ ) y , (12) where softmax( θ ) y = exp( θ y ) P K + J y =1 exp( θ y ) , ϕ ( θ , y ) =    − log s ( θ y ) + log (1 − s ( θ y )) , y > K , − log s ( θ y ) − P y ′  = y log (1 − s ( θ y ′ )) , else, (13) where s ( x ) = 1 1+exp( − x ) is the sigmoid function. When (12) and (13) is used in (4), the y reco ver the multi- expert CE/OvA-log losses [39], respecti vely . 5.1 Consistency Analysis of the Classifier W e analyze the properties of the first K dimensions of the optimal scoring function, i.e., the optimal classifier , which is characterized by the following lemma: Lemma 5 (Consistency of Optimal Classifiers) When using (12) and (13) in (9) , their corresponding risk (10) is minimizable for any x ∈ X . Furthermor e, denote by θ ∗ any minimizer of (10) and any x ∈ X , when (12) and (13) is used as multiclass loss ϕ in PiCCE (9) , argmax y ∈ [ K ] θ ∗ y ∈ Argmax y ∈ [ K ] η y ( x ) , and: (A). When (12) is used in (9) , denote by η ∗ y = softmax( θ ∗ ) y and u ∗ j = softmax( θ ∗ ) K + j : η ∗ y = η y ( x )(1 − e V ( x )) , ∀ y ∈ [ K ] . wher e e V ( x ) : = P J j =1 u ∗ j = V ( x ) 1+ V ( x ) and V ( x ) = Pr( ∪ j ∈ [ J ] M j = Y | X = x ) . (B). When (13) is used in (9) , s ( θ ∗ y ) = η y ( x ) , ∀ y ∈ [ K ] . This lemma immediately establishes the consistency of PiCCE with respect to (12) and (13), implying that the optimal L2D system serves as an effecti ve classifier . Furthermore, the method yields Fisher-consistent class probability estimators. Specifically , Lemma 5 (A) shows that when (12) is used for PiCCE, the term softmax( θ ) y 1 − P K + J i = K +1 softmax( θ ) i con ver ges to the class probability η y ( x ) as θ approaches θ ∗ . Analogously , Lemma 5 (B) demonstrates that s ( θ y ) con verges to η y ( x ) when the OvA-log loss (13) is employed. 7 T able 1: The mean of the system error (Err, rescaled to 0-100) and cov erage (Cov) for 3 trails on the CIF AR-100 and ImageNet datasets. Dataset CIF AR-100 ImageNet Expert Pattern Animal Expert Overlapped Animal Expert Dog Expert Overlapped Dog Expert Loss Formulation V anilla PiCCE V anilla PiCCE V anilla PiCCE V anilla PiCCE Method # Exp Err Cov Err Cov Err Cov Err Cov Err Cov Err Cov Err Cov Err Cov CE 4 18.48 74.58 18.32 77.50 16.10 64.50 15.10 67.80 42.74 89.08 42.10 89.85 41.56 88.37 41.23 89.16 8 18.91 74.35 18.21 76.35 16.24 64.59 16.10 71.35 44.16 88.71 41.96 89.91 42.65 87.99 41.72 89.23 12 19.08 75.18 18.19 77.43 16.99 66.09 15.84 69.86 46.36 88.50 41.50 89.78 44.37 87.85 41.77 89.14 16 19.14 76.39 18.16 79.14 18.72 67.21 15.61 73.62 47.01 88.45 41.40 89.99 45.17 87.54 41.79 88.94 20 21.13 73.57 18.11 78.33 19.22 69.31 15.58 73.95 48.00 88.51 41.32 90.01 46.37 87.34 42.07 88.80 OvA 4 19.46 83.72 19.10 86.43 17.07 79.28 16.58 83.85 45.77 88.45 42.41 89.17 44.56 87.30 42.03 88.30 8 20.09 83.30 18.98 86.83 18.03 79.69 17.71 83.45 52.62 86.66 42.11 89.29 52.05 85.56 42.15 88.51 12 20.70 82.67 18.86 86.89 18.45 79.36 17.77 84.93 57.66 85.74 42.45 89.20 56.64 84.51 42.20 88.74 16 21.84 83.09 18.70 87.11 19.85 77.30 17.76 85.02 62.16 84.67 42.17 89.12 59.43 83.68 42.03 88.70 20 22.91 77.71 18.63 86.69 20.95 72.72 17.74 86.19 66.56 83.90 42.25 89.08 62.91 82.93 42.10 88.77 T able 2: The mean of the system error (Err, rescaled to 0-100) and cov erage (Cov) for 3 trails on the MiceBone and Chao yang dataset. Dataset MiceBone Chaoyang Method CE PiCCE-CE OvA PiCCE-OvA CE PiCCE-CE OvA PiCCE-OvA # Exp Err Co v Err Cov Err Cov Err Cov Err Cov Err Cov Err Co v Err Co v 2 15.17 60.92 15.23 69.28 14.91 72.26 14.06 83.34 1.32 53.86 1.22 63.05 1.51 32.33 1.46 37.76 4 15.88 55.09 15.17 68.11 15.02 68.24 13.80 70.33 1.75 51.68 1.60 55.86 1.90 20.37 1.56 27.76 6 15.99 51.11 14.97 62.22 15.89 63.84 13.09 66.10 2.48 45.99 1.75 52.02 2.04 14.39 1.31 23.58 8 16.72 43.62 13.35 60.72 16.72 57.62 13.03 59.96 3.35 41.23 2.04 50.90 2.48 13.17 1.02 20.38 5.2 Consistency Analysis of the L2D System Based on the consistency of the classifier in Section 5.1, we now e xamine the consistenc y of the whole L2D system, i.e., whether the system can recognize the best expert and choose the better one between it and the optimal classifier . For any M ′ ⊆ M , C M ′ : = {∃ j ∈ M ′ : M j = Y } is the event that there exists correct experts in M ′ . W e show that consistenc y holds based on the follo wing mild condition: Condition 1 (Information Adv antage of Optimal Expert) F or any x ∈ X , assume the optimal expert is unique, i.e., | Argmax j ∈ [ J ] Acc j ( x ) | = 1 and we denote it as j ∗ . F or any e xpert j  = j ∗ and any expert set M ′ ⊆ [ J ] / { j, j ∗ } : Pr  C M ′ ∪{ j ∗ } | X = x  > Pr  C M ′ ∪{ j } | X = x  (14) In essence, this condition implies that the optimal expert yields the greatest improv ement to the system’ s po- tential accuracy . This is a natural expectation in practical scenarios, as the optimal e xpert typically possesses predictiv e advantages that are not fully subsumed by the collective knowledge of the other experts. A funda- mental example is the deterministic expert setting [17, 19], where each expert outputs a fixed, distinct label. In this case, incorporating the optimal expert (who predicts the most likely label) in v ariably expands the effecti ve cov erage of the expert pool. W e further demonstrate the validity of this condition in broader stochastic settings in Appendix A. Serving as a sufficient condition for the consistency of PiCCE, this assumption leads to the following theorem: Theorem 6 (Consistency and Expert Accuracy Estimator) When Condition 1 holds and (12) and (13) is used as multiclass loss ϕ in (9) , for any x ∈ X and θ ∗ minimizes (10) . the prediction link φ defined in (5) and θ ∗ r epr oduces the Bayes optimal decision f ∗ ( x ) , i.e.: φ ( θ ∗ ) = f ∗ ( x ) , and Argmax j ∈ [ J ] θ ∗ j + K = Argmax j ∈ [ J ] Acc j ( x ) = { j ∗ } . Furthermor e: (A). When (12) is used, u ∗ j ∗ = Acc j ∗ ( x ) e V ( x ) . (B). When (13) is used, s ( θ ∗ j ∗ ) = Acc j ∗ ( x ) . This theorem formally v alidates the consistenc y of the L2D system driven by PiCCE. Beyond system-le vel consistency , it facilitates the consistent estimation of the optimal expert’ s accuracy , which is a vital component for uncertainty quantification [38]. Notably , the OvA log loss formulation allows PiCCE to nati vely extract expert accuracy via argmax j ∈ [ J ] s ( θ K + j ) without extra post-processing. This aligns with the insights of [38, 3], which highlighted the ef ficacy of OvA-based methods in modeling expert performance. Finally , we remark that 8 (a) CIF AR-100: Animal Expert (b) ImageNet: Dog Expert (c) MiceBone (d) CIF AR-100: Overlapped Animal Expert (e) ImageNet: Overlapped Dog Expert (f) Chaoyang Figure 2: Classifier accuracy vs. number of e xperts on synthetic (CIF AR-100, ImageNet) and real-w orld expert datasets (MiceBone, Chaoyang). Solid lines denote methods deriv ed from (4), while dashed lines correspond to those deriv ed from (9). Across all datasets, existing methods exhibit a performance drop as the number of e xperts increases, whereas PiCCE remains stable. Condition 1 is merely a sufficient condition, not a necessary one. This suggests that the theoretical guarantees of PiCCE likely e xtend to scenarios exceeding these assumptions, pointing towards a promising direction for future exploration. 6 Experiments In this section, we ev aluate PiCCE on both synthetic and real-world expert settings to validate our theoreti- cal findings and empirical effecti veness in multi-expert L2D. Additional e xperimental details and results are provided in Appendix D. 6.1 Experimental Setup Models and Datasets For synthetic experts, we conduct experiments on CIF AR-100 [14] and ImageNet [10]. For ImageNet, we consider a 120-class dog subset and construct synthetic domain-specialized dog experts under sev eral controlled settings, including disjoint domains, overlapped domains, and varying accuracies. For CIF AR-100, a similar e xpert construction is adopted for biological-domain e xperts over 50 biological classes. Details are provided in Appendix D. Following previous works [38, 26], we use a 28-layer W ideResNet [41] and MobileNet-v2 [34] as base models for CIF AR-100 and for ImageNet, respecti vely . Baselines Our e valuation focuses on jointly training methods, as post-hoc approaches require additional re- training. W e ev aluate combinations of the existing formulation (4) with cross-entropy and one-vs-all base losses, which correspond to the multi-expert CE and OvA surrogates. Our PiCCE method (9) instantiates the same base losses for a direct comparison, denoted as PiCCE-CE and PiCCE-OvA, respectiv ely . 6.2 Experimental Results System’ s Accuracy and Co verage In T able 1, we report the system error w .r .t. the tar get system loss (1) and cov erage, defined as the ratio of non-deferred samples, under two different expert patterns (disjoint/ov erlapped) using the multi-expert CE and OvA surrogate losses. The best results are highlighted in boldface. Across all settings, PiCCE consistently outperforms the vanilla CE and OvA formulations, achieving lower system error while maintaining higher co verage. Moreover , this performance improvement becomes more pronounced as the number of experts increases, which aligns with our theoretical analysis showing that existing surrogate formulations (4) can suffer from more severe underfitting with a larger expert aggregation term. Overall, these results demonstrate that PiCCE yields more robust system-level performance in multi-expert L2D settings. Results for the varying-accurac y expert pattern, which e xhibit consistent trends, are provided in Appendix D. 9 In T able 2, we further provide the tar get system loss and coverage results on datasets with real-world experts, i.e., MiceBone and Chaoyang. Consistent with the synthetic results, PiCCE achieves improved system error and higher cov erage across different numbers of experts, indicating that our proposed approach remains ef fecti ve in practical settings with real-world human experts. Classifier’ s Accuracy From Figure 2, we observe that under both synthetic expert settings (CIF AR-100 and ImageNet), as well as real-world experts (MiceBone and Chaoyang), the classifier prediction accuracy of CE and OvA (solid lines) consistently degrades as the number of experts increases. This degradation becomes increas- ingly pronounced with more experts and is particularly sev ere for OvA-based formulations, whose classifier accuracy drops sharply across all expert settings. In contrast, methods deri ved from our PiCCE formulation (dashed lines) remain insensitiv e to the size of expert-set in the L2D system, maintaining consistently higher classifier accurac y across both synthetic and real-world datasets. These consistent trends across di verse settings indicate that the underfitting observed in existing multi-expert surrog ate losses (4) is driv en by the expert aggre- gation term, and that PiCCE ef fectiv ely mitigates this underfitting issue by leveraging the empirical information to av oid such aggreg ation. 7 Conclusion In this work, we hav e provided both empirical evidence and theoretical insights into the inherent classifier underfitting challenges of multi-expert L2D. W e show that while underfitting in single-expert L2D requires additional assumptions, it arises naturally and inherently in the multi-expert setting due to the expert aggre- gation term. This fundamental dif ference renders existing L2D remedies inef fecti ve. W e address this issue by proposing PiCCE, a surrogate-based formulation that bypasses expert aggregation by regulating selection through ground-truth information. Our theoretical analysis and extensi ve experiments across synthetic and real- world benchmarks show that PiCCE ef fecti vely resolves multi-e xpert underfitting and consistently outperforms state-of-the-art methods. Refer ences [1] Bansal, G., Nushi, B., Kamar , E., Horvitz, E., and W eld, D. S. Is the most accurate AI the best teammate? optimizing AI for teamwork. In AAAI , volume 35, pp. 11405–11414, 2021. Cited on page: 1 [2] Bartlett, P . L. and W egkamp, M. H. Classification with a reject option using a hinge loss. J. Mac h. Learn. Res. , 9:1823–1840, 2008. Cited on page: 13 [3] Cao, Y ., Mozannar , H., Feng, L., W ei, H., and An, B. In defense of softmax parametrization for calibrated and consistent learning to defer . In NeurIPS , 2023. Cited on pages: 1, 3, 8, and 13 [4] Charoenphakdee, N., Cui, Z., Zhang, Y ., and Sugiyama, M. Classification with rejection based on cost- sensitiv e classification. In ICML , volume 139, pp. 1507–1517, 2021. Cited on page: 13 [5] Charusaie, M., Mozannar, H., Sontag, D. A., and Samadi, S. Sample efficient learning of predictors that complement humans. In ICML , pp. 2972–3005, 2022. Cited on pages: 1, 3 [6] Charusaie, M.-A. and Samadi, S. A unifying post-processing frame work for multi-objecti ve learn-to-defer problems. In NeurIPS , 2024. Cited on page: 1 [7] Chow , C. On optimum recognition error and reject tradeoff. IEEE T ransactions on Information Theory , 16(1):41–46, 1970. doi: 10.1109/TIT .1970.1054406. Cited on page: 13 [8] Cortes, C., DeSalvo, G., and Mohri, M. Learning with rejection. In ALT , volume 9925, pp. 67–82, 2016. Cited on page: 13 [9] De, A., Okati, N., Zarezade, A., and Rodriguez, M. G. Classification under human assistance. In AAAI , volume 35, pp. 5905–5913, 2021. Cited on page: 1 [10] Deng, J., Dong, W ., Socher , R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A lar ge-scale hierarchical image database. In CVPR , pp. 248–255, 2009. Cited on page: 9 10 [11] Hemmer, P ., Thede, L., Vössing, M., Jakubik, J., and Kühl, N. Learning to defer with limited expert predictions. In AAAI , pp. 6002–6011, 2023. Cited on page: 3 [12] Jitkrittum, W ., Gupta, N., Menon, A. K., Narasimhan, H., Rawat, A. S., and Kumar , S. When does confidence-based cascade deferral suffice? In NeurIPS , 2023. Cited on page: 1 [13] Jitkrittum, W ., Narasimhan, H., Raw at, A. S., Juneja, J., W ang, Z., Lee, C., Shenoy , P ., Panigrahy , R., Menon, A. K., and Kumar , S. Uni versal model routing for efficient LLM inference. CoRR , abs/2502.08773, 2025. Cited on page: 1 [14] Krizhevsky , A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Cited on page: 9 [15] Liu, S., Cao, Y ., Zhang, Q., Feng, L., and An, B. Mitigating underfitting in learning to defer with consistent losses. In AIST A TS , volume 238, pp. 4816–4824, 2024. Cited on pages: 1, 2, 3, 4, and 5 [16] Madras, D., Pitassi, T ., and Zemel, R. S. Predict responsibly: Improving f airness and accurac y by learning to defer . In NeurIPS , 2018. Cited on page: 1 [17] Mao, A., Mohri, C., Mohri, M., and Zhong, Y . T wo-stage learning to defer with multiple experts. In NeurIPS , 2023. Cited on pages: 1, 3, and 8 [18] Mao, A., Mohri, M., and Zhong, Y . Ranking with abstention. arXiv pr eprint arXiv:2307.02035 , 2023. Cited on page: 13 [19] Mao, A., Mohri, M., and Zhong, Y . Principled approaches for learning to defer with multiple experts. In ISAIM , volume 14494, pp. 107–135, 2024. Cited on pages: 1, 3, 6, and 8 [20] Mao, A., Mohri, M., and Zhong, Y . Realizable h -consistent and bayes-consistent loss functions for learn- ing to defer . In NeurIPS , 2024. Cited on page: 1 [21] Mao, A., Mohri, M., and Zhong, Y . Mastering multiple-expert routing: Realizable h -consistency and strong guarantees for learning to defer . arXiv pr eprint arXiv:2506.20650 , 2025. Cited on page: 1 [22] Montreuil, Y ., Carlier , A., Ng, L. X., and Ooi, W . T . Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees. In ICML , 2025. Cited on page: 1 [23] Montreuil, Y ., Carlier, A., Ng, L. X., and Ooi, W . T . Why ask one when you can ask k ? two-stage learning-to-defer to the top- k experts. CoRR , abs/2504.12988, 2025. Cited on page: 1 [24] Montreuil, Y ., Heng, Y . S., Carlier , A., Ng, L. X., and Ooi, W . T . A two-stage learning-to-defer approach for multi-task learning. In ICML , 2025. Cited on page: 1 [25] Mozannar, H. and Sontag, D. A. Consistent estimators for learning to defer to an expert. In ICML , pp. 7076–7087, 2020. Cited on pages: 1, 3, 4, and 6 [26] Narasimhan, H., Jitkrittum, W ., Menon, A. K., Rawat, A. S., and Kumar , S. Post-hoc estimators for learning to defer to an expert. In NeurIPS , 2022. Cited on pages: 1, 2, 3, 4, and 9 [27] Narasimhan, H., Menon, A. K., Jitkrittum, W ., Gupta, N., and Kumar , S. Learning to reject meets long-tail learning. In ICLR , 2024. Cited on page: 13 [28] Narasimhan, H., Menon, A. K., Jitkrittum, W ., and Kumar , S. Plugin estimators for selecti ve classification with out-of-distribution detection. In ICLR , 2024. Cited on page: 13 [29] Narasimhan, H., Jitkrittum, W ., Rawat, A. S., Kim, S., Gupta, N., Menon, A. K., and Kumar , S. Faster cascades via speculativ e decoding. In ICLR , 2025. Cited on page: 1 [30] Palomba, F ., Pugnana, A., Alvarez, J. M., and Ruggieri, S. A causal framework for evaluating deferring systems. In AIST A TS , volume 258, pp. 2143–2151, 2025. Cited on page: 1 [31] Paszke, A., Gross, S., Massa, F ., Lerer , A., Bradbury , J., Chanan, G., Killeen, T ., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperativ e style, high-performance deep learning library . In NeurIPS , 2019. Cited on page: 20 11 [32] Pugnana, A., Massidda, R., Giannini, F ., Barbiero, P ., Zarlenga, M. E., Pellungrini, R., Dominici, G., Giannotti, F ., and Bacciu, D. Deferring concept bottleneck models: Learning to defer interventions to inaccurate experts. arXiv preprint , 2025. Cited on page: 1 [33] Ruggieri, S. and Pugnana, A. Things machine learning models know that they don’t kno w . In AAAI , volume 39, pp. 28684–28693, 2025. Cited on page: 1 [34] Sandler, M., Ho ward, A., Zhu, M., Zhmoginov , A., and Chen, L.-C. MobileNetv2: In verted residuals and linear bottlenecks. In CVPR , pp. 4510–4520, 2018. Cited on page: 9 [35] Schmarje, L., Grossmann, V ., Zelenka, C., Dippel, S., Kiko, R., Oszust, M., P astell, M., Stracke, J., V alros, A., V olkmann, N., and K och, R. Is one annotation enough? A data-centric image classification benchmark for noisy and ambiguous label estimation. In NeurIPS , 2022. Cited on page: 21 [36] Szegedy , C., V anhoucke, V ., Ioffe, S., Shlens, J., and W ojna, Z. Rethinking the inception architecture for computer vision. In CVPR , pp. 2818–2826, 2016. Cited on page: 4 [37] T ailor , D., P atra, A., V erma, R., Manggala, P ., and Nalisnick, E. T . Learning to defer to a population: A meta-learning approach. In AIST A TS , volume 238, pp. 3475–3483, 2024. Cited on page: 1 [38] V erma, R. and Nalisnick, E. T . Calibrated learning to defer with one-vs-all classifiers. In ICML , pp. 22184–22202, 2022. Cited on pages: 1, 3, 4, 8, and 9 [39] V erma, R., Barrejón, D., and Nalisnick, E. T . Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In AIST ATS , volume 206, pp. 11415–11434, 2023. Cited on pages: 1, 2, 3, 4, 6, 7, and 13 [40] W ei, Z., Cao, Y ., and Feng, L. Exploiting human-AI dependence for learning to defer . In ICML , 2024. Cited on page: 1 [41] Zagoruyko, S. and Komodakis, N. Wide residual networks. In BMVC , pp. 87.1–87.12, 2016. Cited on page: 9 [42] Zhang, Z., Nguyen, C., W ells, K., Do, T .-T ., and Carneiro, G. Learning to complement with multiple humans. arXiv pr eprint arXiv:2311.13172 , 2023. Cited on page: 21 [43] Zhu, C., Chen, W ., Peng, T ., W ang, Y ., and Jin, M. Hard sample aw are noise robust learning for histopathology image classification. IEEE T rans. Medical Imaging , 41(4):881–894, 2022. Cited on page: 21 12 A Condition 1 with Stochastic Experts The stochastic expert setting is more complex compared with the deterministic expert cases and learning with rejection tasks [7, 2, 8, 4, 3, 18, 27, 28], where the former assumes in v ariant experts with fixed decision rules and the latter assumes constant rejection costs. W e sho w in this section that Condition 1 holds in broader stochastic settings, by giving the follo wing examples: Example 1 (Dominant Expert) Suppose j ∗ is a dominant e xpert that its conditional accuracy equals or is higher than other experts on all classes, i.e., Pr( M j ∗ = Y | Y = y , X = x ) ≥ Pr( M j = Y | Y = y , X = x ) , and the inequality holds strictly on at least one class y . Furthermore , the expert pr edictions are conditional independent given label Y (whic h is a commonly used condition in pre vious multi-expert L2D studies [39]). Then we can learn for any subset M and j ∈ M : Pr( C M∪{ j } | X = x ) = K X y =1 η y ( x )Pr( C M∪{ j } || Y = y , X = x ) = K X y =1 η y ( x ) (Pr( C M | Y = y , X = x ) + Pr( M j = Y | Y = y , X = x ) − Pr( C M | Y = y , X = x ) ∗ Pr( M j = Y | Y = y , X = x )) = K X y =1 η y ( x )    Pr( C M | Y = y , X = x ) + Pr( M j = Y | Y = y , X = x ) (1 − Pr( C M | Y = y , X = x )) | {z } > 0 if M does not include j ∗ .    Then for any j ′  = j ∗ : Pr( C M∪{ j ∗ } | X = x ) − Pr( C M∪{ j } | X = x ) = K X y =1 η y ( x ) ((Pr( M j ∗ = Y | Y = y , X = x ) − Pr( M j = Y | Y = y , X = x ))(1 − Pr( C M | Y = y , X = x ))) > 0 , and thus Condition 1 holds. This example sho ws that Condition 1 can hold when expert predictions have randomness, where a dominant expert outperforms others across all classes. Furthermore, Condition 1 can be extended to milder scenarios. The following example demonstrates that the condition remains v alid ev en when the expert is not globally superior , in which we focus on 3-class classification case for better presentation: Example 2 (Expert Only Dominant at the Major Class) Suppose the class number is 3, and the the class conditional distribution and e xpert class-conditional accuracy ar e as below: Class y η y ( x ) Pr( M 1 = Y | Y = y , X = x ) Pr( M 2 = Y | Y = y , X = x ) Pr( M 3 = Y | Y = y , X = x ) y = 1 (Major) 0.80 0.90 0.60 0.60 y = 2 (Minor) 0.10 0.40 0.90 0.90 y = 3 (Minor) 0.10 0.40 0.90 0.90 Her e we can see that Acc 1 ( x ) = 0 . 80 > Acc 2 ( x ) = Acc 3 ( x ) = 0 . 66 , and thus j ∗ = 1 . However , we can learn the conditional accuracy of expert M 1 is lower than those of M 2 and M 3 , ther eby Example 1 cannot be applied her e. However , we can infer Condition 1 still holds her e. When j = 2 (the conclusion below holds symmetrically for j = 3 since M 2 and M 3 has the same distribution): Pr( C M∪{ j ∗ } | X = x ) − Pr( C M∪{ j } | X = x ) = K X y =1 η y ( x ) ((Pr( M j ∗ = Y | Y = y , X = x ) − Pr( M j = Y | Y = y , X = x ))(1 − Pr( C M | Y = y , X = x ))) = K X y =1 η y ( x ) ((Pr( M 1 = Y | Y = y , X = x ) − Pr( M 2 = Y | Y = y , X = x ))(1 − Pr( M 3 = Y | Y = y , X = x ))) 13 = 0 . 8 ∗ (0 . 9 − 0 . 6)(1 − 0 . 6) + 0 . 1 ∗ (0 . 4 − 0 . 9)(1 − 0 . 9) + 0 . 1 ∗ (0 . 4 − 0 . 9)(1 − 0 . 9) = 0 . 8 ∗ 0 . 3 ∗ 0 . 4 − 0 . 1 ∗ 0 . 5 ∗ 0 . 1 ∗ 2 = 0 . 086 > 0 , and thus Condition 1 holds. This example rev eals that Condition 1 covers a broad range of scenarios where expert stochasticity interacts with the data distribution. Intuitiv ely , the optimal expert only needs to outperform others on the majority classes, while being allowed to be less accurate on rare classes. These examples underscore the generality of Condition 1. A promising direction for future work is to further explore the necessary and suf ficient conditions for Condition 1 to hold. B Pr oofs of Conclusions in Section 4 B.1 Proof of Theor em 2 Proof . For a gi ven θ 0 ∈ R K + J , ∀ ϵ > 0 , ∃ σ > 0 , such that ∀ θ ∈ B ( θ 0 , σ ) = { v ∈ R K + J : ∥ θ 0 − v ∥ < σ } : | ℓ ϕ ( θ , y , m ) − ℓ ϕ ( θ 0 , y , m ) | =      ϕ ( θ , y ) + ϕ ( θ , argmax j ∈ [ m ]= y e θ j + K ) − ϕ ( θ 0 , y ) + ϕ ( θ 0 , argmax j ∈ [ m = y ] e θ 0 j + K ) !      = | ϕ ( θ , y ) + ϕ ( θ , j ∗ + K ) − ( ϕ ( θ 0 , y ) + ϕ ( θ 0 , j ∗ 0 + K )) | = | ( ϕ ( θ , y ) − ϕ ( θ 0 , y )) + ( ϕ ( θ , j ∗ + K ) − ϕ ( θ 0 , j ∗ 0 + K )) | where j ∗ and j ∗ 0 denote the best expert under θ and θ 0 , respecti vely . W e then complete the proof by considering the following tw o cases. Case 1: j ∗ 0 is unique. Denote suboptimal solution by e J ∗ 0 := argmax j ∈ [ m = y ] / { j ∗ 0 } e θ 0 j . F or any e j ∗ 0 ∈ e J ∗ 0 , we assume that ∆ := ∥ θ 0 j ∗ − θ 0 e j ∗ ∥ , then for δ = ∆ , we can obtain that j ∗ 0 = j ∗ and | ℓ ϕ ( θ , y , m ) − ℓ ϕ ( θ 0 , y , m ) | = | ( ϕ ( θ , y ) − ϕ ( θ 0 , y )) + ( ϕ ( θ , j ∗ 0 + K ) − ϕ ( θ 0 , j ∗ 0 + K )) | ≤ | ϕ ( θ , y ) − ϕ ( θ 0 , y ) | + | ϕ ( θ , j ∗ 0 + K ) − ϕ ( θ 0 , j ∗ 0 + K ) | ≤ 2 ϵ The last inequality holds since ϕ is continuous at θ 0 . Case 2: j ∗ 0 is not unique and belongs to J ∗ 0 := argmax j ∈ [ m = y ] e θ 0 j . Denote suboptimal solution by e J ∗ 0 := argmax j ∈ [ m = y ] /J ∗ 0 e θ 0 j . For any e j ∗ 0 ∈ e J ∗ 0 , we assume that ∆ := ∥ θ 0 j ∗ − θ 0 e j ∗ ∥ , then for δ = ∆ , we can obtain j ∗ ∈ J ∗ 0 and | ℓ ϕ ( θ , y , m ) − ℓ ϕ ( θ 0 , y , m ) | = | ( ϕ ( θ , y ) − ϕ ( θ 0 , y )) + ( ϕ ( θ , j ∗ + K ) − ϕ ( θ 0 , j ∗ 0 + K )) | ≤ | ϕ ( θ , y ) − ϕ ( θ 0 , y ) | + | ϕ ( θ , j ∗ + K ) − ϕ ( θ 0 , j ∗ 0 + K ) | = | ϕ ( θ , y ) − ϕ ( θ 0 , y ) | + | ϕ ( θ , j ∗ + K ) − ϕ ( θ 0 , j ∗ + K ) + ϕ ( θ 0 , j ∗ + K ) − ϕ ( θ 0 , j ∗ 0 + K ) | ≤ | ϕ ( θ , y ) − ϕ ( θ 0 , y ) | + | ϕ ( θ , j ∗ + K ) − ϕ ( θ 0 , j ∗ + K ) | + | ϕ ( θ 0 , j ∗ + K ) − ϕ ( θ 0 , j ∗ 0 + K ) | Because ϕ is symmetric w .r .t. its last J inputs, it follows that | ϕ ( θ 0 , j ∗ + K ) − ϕ ( θ 0 , j ∗ 0 + K ) | = 0 . Thus we hav e | ℓ ϕ ( θ , y , m ) − ℓ ϕ ( θ 0 , y , m ) | ≤ | ϕ ( θ , y ) − ϕ ( θ 0 , y ) | + | ϕ ( θ , j ∗ + K ) − ϕ ( θ 0 , j ∗ + K ) | ≤ 2 ϵ The last inequality holds since ϕ is continuous at θ 0 . Combing the conclusions abov e, we conclude the proof. B.2 Proof of Theor em 3 Proof . Denote by Σ( θ ) = { σ : e θ σ 1 ≥ · · · ≥ e θ σ J } the set of all descending permutations of θ , we then obtain that: R ℓ ϕ | X = x ( θ ) = E Y , M | X = x [ ϕ ( θ , Y ) + ϕ ( θ , argmax j ∈ [ M = Y ] e θ j + K )] 14 = K X y =1 η y ϕ ( θ , y ) + X σ ∈ Σ( θ ) Pr( σ )  Pr( M σ 1 = Y | X = x ) ϕ ( θ , σ 1 + K ) + Pr( M σ 1  = Y , M σ 2 = Y | X = x ) ϕ ( θ , σ 2 + K ) + · · · + Pr( M σ 1: J − 1  = Y , M σ J = Y | X = x ) ϕ ( θ , σ J + K )  = K X y =1 η y ϕ ( θ , y ) + X σ ∈ Σ( θ ) Pr( σ )   J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) ϕ ( θ , σ j + K )   Since ϕ is symmetric to the last J inputs, for any σ ∈ Σ( θ ) , there exists a permutation matrix e P ∈ R J × J such that J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) ϕ ( θ , σ j + K ) = h Pr( M σ 1 = Y ) | X = x ) , · · · , Pr( M σ 1: J − 1  = Y , M σ J = Y | X = x ) ih ϕ ( θ , σ 1 + K ) , · · · , ϕ ( θ , σ J + K ) i ⊤ = h Pr( M σ 1 = Y ) | X = x ) , · · · , Pr( M σ 1: J − 1  = Y , M σ J = Y | X = x ) i e P ϕ ( e θ ) = h Pr( M 1 = Y , \ j ∈{ i ∈ [ J ]: e θ i ≥ e θ 1 } M j  = Y | X = x ) , · · · , Pr( M J = Y , \ j ∈{ i ∈ [ J ]: e θ i ≥ e θ J } M j  = Y | X = x ) i ϕ ( e θ ) where ϕ ( e θ ) = [ ϕ ( θ , 1 + K ) , · · · , ϕ ( θ , J + K )] ⊤ . Therefore, X σ ∈ Σ( θ ) Pr( σ )   J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) ϕ ( θ , σ j + K )   =   X σ ∈ Σ( θ ) Pr( σ )   h Pr( M 1 = Y , \ j ∈{ i ∈ [ J ]: e θ i ≥ e θ 1 } M j  = Y | X = x ) , · · · , Pr( M J = Y , \ j ∈{ i ∈ [ J ]: e θ i ≥ e θ J } M j  = Y | X = x ) i ϕ ( e θ ) = J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) ϕ ( θ , σ j + K ) , for any σ ∈ Σ( θ ) , which completes the proof. B.3 Proof of Theor em 4 Proof . K X y =1 η y + J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) = 1 + J X j =1 Pr( \ i 0 The last inequality is obtained since log loss is a proper loss. and thus F ( u ) > F ( e u ) , which indicates that e V ( x ) = V ( x ) 1+ V ( x ) . Combining Step 1 and Step 2 and we can conclude the proof. 16 C.1.2 Proof of Lemma 5, (B). Proof . For any θ ∈ R K + j , we abuse the notation s i : = s ( θ i ) , which omits the dependence on θ . Furthermore, we denote by u j : = s j + K . According to the risk formulation in Theorem 3, the risk minimization problem can be formulated as: min s , u − K X y =1 η y log s y − K X y =1 (1 − η y ) log(1 − s y ) | {z } F ( s 1: K ) − J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u σ j − J X j =1 (1 − Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x )) log(1 − u σ j ) s.t. s i , u j ∈ [0 , 1] , ∀ i ∈ [ K ] , j ∈ [ J ] . First of all, the minimizer is attainable since the feasible region is closed and the target is continuous. Further- more, the optimization problem is separable since s 1: K and s K +1: K + J are independent in both the tar get and feasible region. Notice that F ( s 1: K ) is minimized at s y = η y since s y ∈ [0 , 1] , which concludes the proof. C.2 Proof of Theor em 6 C.2.1 Proof of Theorem 6, (a). Proof . Denote by b η y = softmax( θ ) y , u j = softmax( θ ) K + j , S the set that u j ∈ [0 , 1] , and P j ∈ [ J ] u j = e V ( x ) . • Step 1: First we prov e that j ∗ ∈ Argmax j ∈ [ J ] u ∗ j . W e prov e it by contradiction. Assuming j ∗ ∈ Argmax j ∈ [ J ] u ∗ j , then we can find a permutation σ that u ∗ σ 1 ≥ · · · ≥ u ∗ σ J and u ∗ σ 1 > u ∗ j ∗ . According to Lemma 5, the risk can be written as: − K X y =1 η y log η y (1 − e V ( x )) − J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u ∗ σ j | {z } P ( u ∗ ) , and we focus on P ( u ∗ ) since the first part is unchanged for any u ∈ S . Denote by u ′ that u ′ σ 1 = u ∗ j ∗ , u ′ j ∗ = u ∗ σ 1 , and u ′ j = u ∗ j for j ∈ { σ 1 , j ∗ } . Denote by j ′ the element that σ j ′ = j ∗ . Then u ′ σ ′ 1 ≥ · · · ≥ u ′ σ ′ J for σ ′ that σ ′ 1 = j ∗ , σ ′ j ′ = σ 1 , and σ ′ j = σ j for j ∈ { 1 , j ′ } . Thus P ( u ′ ) can be written as: P ( u ′ ) = − J X j =1 Pr( M σ ′ 1: j − 1  = Y , M σ ′ j = Y | X = x ) log u ′ σ ′ j = − J X j =1 Pr( M σ ′ 1: j − 1  = Y , M σ ′ j = Y | X = x ) log u ∗ σ j Denote by T ( p ) = − P J j =1 p j log u ∗ σ j . Also denote by p ∗ = [Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x )] J j =1 and p ′ = [Pr( M σ ′ 1: j − 1  = Y , M σ ′ j = Y | X = x )] J j =1 , we can learn T ( p ∗ ) = P ( u ∗ ) and T ( p ′ ) = P ( u ′ ) . Denote by S ∗ j = P j j ′ =1 p ∗ j ′ = Pr( C σ 1: j | X = x ) and S ′ j = P j j ′ =1 p ′ j ′ = Pr( C σ ′ 1: j | X = x ) . Also since u ∗ is not a constant vector , we can find ˜ j that u ∗ σ ˜ j > u ∗ σ ˜ j +1 . Then we can further decompose P ( u ∗ ) − P ( u ′ ) into: P ( u ∗ ) − P ( u ′ ) = T ( p ∗ ) − T ( p ′ ) = − J X j =1 ( p ∗ j − p ′ j ) log u ∗ σ j 17 = J − 1 X j =1 ( S ∗ j − S ′ j )(log u ∗ σ j +1 − log u ∗ σ j ) (Summation by parts) = J − 1 X j =1 (Pr( C σ 1: j | X = x ) − Pr( C σ ′ 1: j | X = x )) | {z } < 0 according to Condition 1 (log u ∗ σ j +1 − log u ∗ σ j ) | {z } ≤ 0 since u ∗ σ j is in descending order ≥ (Pr( C σ 1: ˜ j | X = x ) − Pr( C σ ′ 1: ˜ j | X = x )) | {z } < 0 according to Condition 1 (log u ∗ σ ˜ j +1 − log u ∗ σ ˜ j ) | {z } < 0 > 0 . Then we can learn P ( u ∗ ) > P ( u ′ ) , which contradicts that u ∗ is the minimizer . • Step 2: Second we prove that Argmax j ∈ [ J ] u ∗ j = { j ∗ } . Suppose | Argmax j ∈ [ J ] u ∗ j | = J ′ > 1 , Then there exists a permutation σ that σ 1 = j ∗ , u ∗ σ 1 , · · · , u ∗ σ J ′ = u ∗ j ∗ , and u ∗ σ J ′ +1 ≥ · · · u ∗ σ J . Then P ( u ∗ ) can be written as: P ( u ∗ ) = − J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u ∗ σ j = − J ′ X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u ∗ j ∗ − J X j = J ′ +1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u ∗ σ j Denote by δ > 0 that δ < min  u ∗ j ∗ − u ∗ σ J ′ +1 u ∗ j ∗ , 1 − u ∗ u ∗  = b . Denote by u ′ that u ′ σ 1 = (1 + δ ) u ∗ j ∗ , u ′ σ J ′ = (1 − δ ) u ∗ j ∗ , and other elements equals those of u ∗ . In this case, we can still get u ′ σ 1 ≥ · · · ≥ u ′ σ 1 . Then P ( u ′ ) can be written as: P ( u ′ ) = − J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u ′ σ j = − J X j = J ′ +1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u ∗ σ j − Pr( M σ 1: J ′ − 1  = Y , M σ J ′ = Y | X = x ) log (1 − δ ) u ∗ j ∗ − Pr( M j ∗ = Y | X = x ) log (1 + δ ) u ∗ j ∗ Then: P ( u ∗ ) − P ( u ′ ) = Pr( M σ 1: J ′ − 1  = Y , M σ J ′ = Y | X = x ) log (1 − δ ) + Pr( M j ∗ = Y | X = x ) log (1 + δ ) Notice Pr( M σ 1: J ′ − 1  = Y , M σ J ′ = Y | X = x ) ≤ Pr( M σ J ′ = Y | X = x ) < Pr( M j ∗ = Y | X = x ) . Denote by A : = Pr( M j ∗ = Y | X = x ) , B : = Pr( M σ 1: J ′ − 1  = Y , M σ J ′ = Y | X = x ) . W e can learn that for any δ ∈  0 , min n b, A − B A + B o : P ( u ∗ ) − P ( u ′ ) = log  (1 − δ ) B (1 + δ ) A  = F ( δ ) . Notice that F ′ ( δ ) = A 1+ δ − B 1 − δ > 0 on  0 , A − B A + B  , and F (0) = 0 , then we can learn F ( δ ) > 0 on  0 , min n b, A − B A + B o , and then P ( u ∗ ) > P ( u ′ ) for such u ′ , which has unique argmax element j ∗ . In conclusion, for any u ∗ with multiple argmax elements, we can always find on u ′ with unique argmax element j ∗ that has a lower risk, which concludes the proof. Step 1 and Step 2 concludes that Argmax j ∈ [ J ] u ∗ j is uniquely obtained at j ∗ . 18 • Step 3: Finally we pro ve u ∗ j ∗ = Acc j ∗ ( x )(1 − e V ( x )) . According to Step 1, we have that j ∗ ∈ Argmax j ∈ [ J ] u ∗ j . Denote by S σ : = { u ∈ S : u σ 1 ≥ · · · u σ J } , we can learn that u ∗ ∈ S σ for some σ that σ 1 = j ∗ . Then we turn to prove that for any σ that σ 1 = j ∗ , u ′ j ∗ = Acc j ∗ ( x )(1 − e V ( x )) for any u ′ ∈ Argmin u ∈S σ P ( u ) . Notice that σ 1 = j ∗ , and thus Pr( M σ 1 = Y | X = x ) > Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) for any j > 1 . Then we formulate the optimization problem as belo w: min u − J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u σ j s.t. J X j =1 u j = e V ( x ) u σ j +1 − u σ j ≤ 0 Notice that the objective is con ve x and constraints are affine, and thus KKT conditions are sufficient and necessary . Then we conduct KKT analysis. For clear representation, let µ σ 0 = µ σ J = 0 . The KKT conditions can be written as:      Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) = u σ j ( λ + µ σ j − 1 − µ σ j ) , ( Stationary condition ) µ σ j ≥ 0 , ( Dual feasibility ) µ σ j ( u σ j +1 − u σ j ) = 0 . ( Complementary slackness ) Summing up the stationary condition for j = 1 , · · · , J and we can learn: V ( x ) = λ e V ( x ) + J X j =1 u σ j ( µ σ j − 1 − µ σ j ) = λ e V ( x ) + J X j =1 µ σ j ( u σ j − 1 − u σ j ) | {z } =0 according to complementary slackness Then we can learn λ = 1 + V ( x ) . Again according to stationary condition we can learn: Acc j ∗ ( x ) = u j ∗ (1 + V ( x ) − µ j ∗ ) . Then we can learn µ j ∗ = 1 + V ( x ) − Acc j ∗ ( x ) u j ∗ . then we can learn µ j ∗ = 0 since u σ 2 − u j ∗ < 0 . and thus u j ∗ = Acc j ∗ ( x ) 1+ V ( x ) = Acc j ∗ ( x )(1 − e V ( x )) . Then we can conclude the proof since softmax operator is order-preserving. C.2.2 Proof of Theorem 6, (b). Proof . Since θ ∗ j has been solved in Lemma 5, we focus on the follo wing problem: min u ∈ [0 , 1] J P ( u ) = − J X j =1 Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x ) log u σ j − J X j =1 (1 − Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x )) log(1 − u σ j ) Denote by u ∗ any minimizer of P ( u ) , which exists according to Lemma 5: • Step 1: W e first prov e that j ∗ ∈ Argmax j ∈ [ J ] u ∗ j . Denote by σ a descending permutation of u ∗ . Suppose σ 1 , · · · , σ i ∈ Argmax j ∈ [ J ] u ∗ j and σ i ∗ = j ∗ ∈ Argmax j ∈ [ J ] u ∗ j . Then swapping u ∗ σ 1 and u ∗ j ∗ and we construct u ′ , which has a descending permutation σ ′ that σ ′ 1 = j ∗ . Denote by T ( p ) = − P J j =1 p j log u ∗ σ j − P J j =1 (1 − p j ) log(1 − u ∗ σ j ) . Also denote by p ∗ = [Pr( M σ 1: j − 1  = Y , M σ j = Y | X = x )] J j =1 , p ′ = [Pr( M σ ′ 1: j − 1  = Y , M σ ′ j = Y | X = x )] J j =1 , 19 S ∗ j = P j j ′ =1 p ∗ j ′ = Pr( C σ 1: j | X = x ) , and S ′ j = P j j ′ =1 p ′ j ′ = Pr( C σ ′ 1: j | X = x ) , we can learn P ( u ∗ ) = T ( p ∗ ) and P ( u ′ ) = T ( p ′ ) . W e have the follo wing condition: P ( u ∗ ) − P ( u ′ ) = T ( p ∗ ) − T ( p ′ ) = − J X j =1 ( p ∗ j − p ′ j ) log u ∗ σ j − J X j =1 ( p ′ j − p ∗ j ) log(1 − u ∗ σ j ) = J − 1 X j =1 ( S ∗ j − S ′ j )(log u ∗ σ j +1 − log u ∗ σ j ) + J − 1 X j =1 ( S ′ j − S ∗ j )(log(1 − u ∗ σ j +1 ) − log(1 − u ∗ σ j )) (Summation by pa rts) = J − 1 X j =1 (Pr( C σ 1: j | X = x ) − Pr( C σ ′ 1: j | X = x )) | {z } < 0 according to Condition 1 (log u ∗ σ j +1 − log u ∗ σ j ) | {z } ≤ 0 since u ∗ σ j is in descending order + J − 1 X j =1 (Pr( C σ ′ 1: j | X = x ) − Pr( C σ 1: j | X = x )) | {z } > 0 according to Condition 1 (log(1 − u ∗ σ j +1 ) − log(1 − u ∗ σ j )) | {z } ≥ 0 since u ∗ σ j is in descending order ≥ (Pr( C σ 1: i ∗ − 1 | X = x ) − Pr( C σ ′ 1: i ∗ − 1 | X = x )) | {z } < 0 according to Condition 1 (log u ∗ σ i ∗ − log u ∗ σ i ∗ − 1 ) | {z } < 0 + (Pr( C σ ′ 1: i ∗ − 1 | X = x ) − Pr( C σ 1: i ∗ − 1 | X = x )) | {z } > 0 according to Condition 1 (log(1 − u ∗ σ i ∗ ) − log(1 − u ∗ σ i ∗ − 1 )) | {z } > 0 > 0 , which means u ′ decreases the risk,and thus concludes the proof of this step. • Step 2: W e prov e that { j ∗ } = Argmax j ∈ [ J ] u ∗ j , and u ∗ j ∗ = Acc j ∗ ( x ) . Notice that j ∗ ∈ Argmax j ∈ [ J ] u ∗ j : – Suppose u j ∗ = max j u j < Acc j ∗ ( x ) , then we can increase it to u ′ j ∗ = η j ∗ with u ′ j = u ∗ j for j  = j ∗ , which strictly decreases the risk value due to the properness of log loss. – Suppose u j ∗ = max j u j > Acc j ∗ ( x ) , construct the following u ′ : u ′ j = ( Acc j ∗ ( x ) , j = j ∗ , max  max j ∈{ 2 , ··· ,J } Pr( M σ 1: j − 1  = Y , M j = Y | X = x ) , u ∗ j  , else . Then we can learn the permutation of u ∗ and u ′ remain unchanged, and the risk decreases strictly according to the definition of log loss. Furthermore, Acc j ∗ ( x ) > Acc j ≥ Pr( M σ 1: j − 1  = Y , M j = Y | X = x ) , which means u ∗ j > u ∗ σ 2 . According to the discussions abov e, we prov ed the claim of this step. Then we can conclude the proof since sigmoid function is in vertible. D Details of Experiments W e implemented all the methods by Pytorch [31] and conducted all the e xperiments on NVIDIA H200 GPUs. D.1 Experimental Setup on Synthetic Expert Datasets Optimizers All models are trained using SGD with a momentum of 0 . 9 and a weight decay of 5 × 10 − 4 . The initial learning rate is set to 0 . 1 and is scheduled by the cosine annealing. Each model is trained for 200 epochs (batch size 128) on CIF AR-100 and 90 epochs (batch size 512) on ImageNet. 20 CIF AR-100: Animal Experts W e consider a subset of CIF AR-100 consisting of 50 biological classes , e.g., mammals, fish, and insects. F ollo wing common practice, we refer to this subset as the biological (or “animal”) subset for simplicity . And we then simulate experts with v arying levels of competence for CIF AR-100 in the following w ays: 1) Animal Expert : Each expert’ s domain includes 2 disjoint biological classes; accuracy is 94% on domain, 75% on other biological species, and random on the rest. 2) Overlapped Animal Expert : W e fix the overlap length to 5 and re-define overlapping expert domains for each expert count so that their union exactly covers all biological classes; accuracy follows the same setting as abov e. 3) V arying-Accuracy Animal Expert : Each expert’ s domain includes 2 disjoint biological classes, with in-domain accurac y linearly increasing from 88% to 94% ; accuracy on other animals is 75% , and random otherwise. ImageNet: Dog Experts W e consider a subset of ImageNet consisting of 120 dog classes , e.g., Labrador retriev er and German shepherd. Based on this subset, we simulate domain-specialized dog experts with v arying lev els of competence for ImageNet in the follo wing three ways: 1) Dog Expert : Each expert’ s domain includes 5 disjoint dog classes; accuracy is 88% on domain, 75% on other dog species, and random on the rest. 2) Overlapped Dog Expert : W e fix the ov erlap length to 50 and re-define ov erlapping expert domains for each expert count so that their union exactly covers all dog classes; accuracy follows the same setting as abov e. 3) V arying-Accuracy Dog Expert : Each expert’ s domain includes 5 disjoint dog classes, with in-domain accuracy linearly increasing from 75% to 88% ; accuracy on other animals is 75% , and random otherwise. D.2 Experimental Setup on Real-w orld Expert Datasets Datasets W e further conduct experiments on two datasets with real-world experts, MiceBone [35] and Chaoyang [43], to demonstrate the practical effecti veness of our PiCCE method. • The MiceBone dataset consists of 7,240 second-harmonic generation microscop y images cate gorized into three classes: similar collagen fiber orientation, dissimilar collagen fiber orientation, and not of interest. The dataset contains annotations from 79 professional human annotators, with each image annotated by up to fiv e professional annotators. Among these human annotators, only eight provide labels for the entire dataset. Following Zhang et al. [42], we consider these eight annotators as human experts. Details of their prediction performance are provided in T able 3. In our experiments, we vary the number of experts, choosing # Exp ∈ { 2 , 4 , 6 , 8 } , and consider the first 2, 4, 6, and 8 experts according to the order in T able 3. The dataset is partitioned into fi ve folds. W e use the first four folds for training and the remaining fold for testing, resulting in 5,697 training images and 1,543 test images. T able 3: Prediction accuracy (Acc, % ) of the eight selected experts on the MiceBone dataset. Expert ID 047 290 533 534 580 581 966 745 T rain Acc 84.64 85.01 87.43 88.13 81.73 85.96 87.05 85.45 T est Acc 84.64 84.71 86.33 85.68 79.59 84.64 87.88 84.90 • The Chaoyang dataset consists of colon histopathology image patches collected at Chaoyang Hospital in China, cate gorized into four classes: normal, serrated, adenocarcinoma, and adenoma. Each image is annotated by three pathologists with prediction accuracies 87% , 91% , and 99% , respectiv ely . Follo wing Zhang et al. [42], we use the first two pathologists as experts and exclude the near-oracle annotator . T o ev aluate settings with dif ferent expert cardinalities, we simulate additional experts. Specifically , for # Exp ∈ { 4 , 6 , 8 } , additional e xperts are introduced with identical settings to the a v ailable ones. Experts’ predictions are generated independently . 21 T able 4: The mean of the system error (Err, rescaled to 0-100) and coverage (Cov) for 3 trails on the CIF AR-100 and ImageNet datasets. The best performance is highlighted in boldface. Expert Pattern V arying-Acc Animal Expert Loss Formulation V anilla PiCCE Dataset Method # Exp Err Cov Err Cov CIF AR-100 CE 4 18.96 75.34 18.40 77.16 8 19.56 76.59 18.38 77.76 12 20.18 75.31 18.21 78.09 16 22.46 76.10 18.13 78.65 20 22.56 76.13 18.09 78.45 OvA 4 19.07 85.04 19.01 86.33 8 20.65 84.27 18.95 85.10 12 22.65 83.80 18.88 85.99 16 24.77 79.21 18.83 85.85 20 26.08 78.02 18.69 85.78 Expert Pattern V arying-Acc Dog Expert Loss Formulation V anilla PiCCE Dataset Method # Exp Err Cov Err Cov ImageNet CE 4 42.58 89.21 41.87 90.02 8 44.27 88.82 41.28 89.93 12 46.28 88.27 41.18 90.01 16 47.02 88.71 41.21 90.07 20 47.98 87.81 41.39 90.10 OvA 4 45.03 88.24 42.12 89.40 8 52.38 86.71 42.35 89.51 12 58.13 85.16 42.51 89.26 16 61.02 84.54 42.14 89.32 20 65.27 83.81 42.13 89.24 Models and Optimizers W e use ResNet-18 as base model for both MiceBone and Chaoyang datasets. All models are trained using AdamW optimizer with an initial learning rate of 3 × 10 − 4 and a weight decay of 5 × 10 − 4 . the initial learning rate. Each model is trained for 100 epochs with a batch size of 128. D.3 Additional Experimental Results System Error and Coverage From T able 4, it can be seen that PiCCE method consistently outperforms the multi-expert CE and OvA surrogate losses under the varying-accuracy synthetic expert setting across both CIF AR-100 and ImageNet datasets, achieving lower system error together with substantially higher cover - age. As the number of experts increases, the vanilla framework exhibits pronounced performance degradation, whereas PiCCE remains rob ust. Overall, these results sho w that PiCCE ef fectively impro ves the ov erall system performance in multi-expert L2D. Classifier Err or As shown in Figure 3, for both the v arying-accuracy animal expert setting on CIF AR-100 and the v arying-accuracy dog e xpert setting on ImageNet, the classifier accuracy of CE and OvA degrades steadily as the number of experts increases. In particular , OvA exhibits a pronounced accuracy drop with more experts, while CE also suffers from a consistent decline. In contrast, the PiCCE-based variants maintain stable classifier accuracy across different numbers of experts and consistently outperform their vanilla counterparts on both datasets. This observation further suggests that PiCCE effecti vely resovles the underfitting induced by expert aggre gation in practical multi-expert L2D. 22 (a) V arying-Accuracy Animal Expert (b) V arying-Accuracy Dog Expert Figure 3: W e report the classifier’ s accuracy on CIF AR-100 and ImageNet datasets. Solid lines are the methods from existing formulation (4) while dashed lines are the methods deriv ed from our PiCCE formulation (9). 23

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment