Mastering the Minority: An Uncertainty-guided Multi-Expert Framework for Challenging-tailed Sequence Learning

IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGU A GE PR OCESSING. 1 Mastering the Minority: An Uncertainty-guided Multi-Expert Frame w ork for Challenging-tailed Sequence Learning Y e W ang, Zixuan W u, Lifeng Shen, Jiang Xie, Xiaoling W ang, Hong Y u, and Guoyin W ang Abstract —Imbalanced data distrib ution r emains a critical chal- lenge in sequential learning, leading models to easily recognize frequent categories while failing to detect minority classes ade- quately . The Mixture-of-Experts model offers a scalable solution, yet its application is often hindered by parameter inefﬁciency , poor expert specialization, and difﬁculty in r esolving prediction conﬂicts. T o Master the Minority classes effecti vely , we pro- pose the Uncertainty-based Multi-Expert fusion network (UME) framework. UME is designed with three core innovations: First, we employ Ensemble LoRA for parameter -efﬁcient modeling, signiﬁcantly r educing the trainable parameter count. Second, we introduce Sequential Specialization guided by Dempster - Shafer Theory (DST), which ensures effective specialization on the challenging-tailed classes. Finally , an Uncertainty-Guided Fusion mechanism uses DST’ s certainty measures to dynamically weigh expert opinions, resolving conﬂicts by prioritizing the most conﬁdent expert for reliable ﬁnal pr edictions. Extensive experiments across f our public hierarchical text classiﬁcation datasets demonstrate that UME achiev es state-of-the-art per - formance. W e achieve a performance gain of up to 17.97% over the best baseline on individual categories, while reducing trainable parameters by up to 10.32%. The ﬁndings highlight that uncertainty-guided expert coordination is a principled strategy for addressing challenging-tailed sequence learning . Our code is av ailable at https://github.com/CQ UPTWZX/Multi- experts. Index T erms —Sequence learning, Uncertainty-fusion, Multi- expert Learning, Hierarchical T ext Classiﬁcation. I . I N T RO D U C T I O N I N sequence learning, imbalanced data distrib utions pose a major challenge, leading to situations where some cate- gories are easily recognized while others are dif ﬁcult to detect [1]. Consequently , models often achieve high accuracy on fre- quent categories but perform poorly on rare or minority ones. Overlooking these minority categories can have sev ere conse- quences. For example, in medical applications, rare diseases are typically underrepresented in data, and failing to detect them undermines the reliability of evidence-based diagnosis [2]. Similarly , in ﬁnancial security , anomalous transactions and nov el attack patterns occur infrequently , and neglecting their detection can expose systems to signiﬁcant risks [3]. This work was sponsored by the National Natural Science Foundation of China (62306056,62136002,62221005). Y e W ang, Zixuan W u, Lifeng Shen, Jiang Xie, and Hong Y u are with Ke y Laboratory of Cyberspace Big Data Intelligent Security , Ministry of Education; School of Artiﬁcial Intelligence, Chongqing Uni versity of Post and T elecommunications, Chongqing 400065, China. Xiaoling W ang is with the School of Computer Science and T echnology , East China Normal University , Shanghai, 200062, China. Guoyin W ang is with the Chongqing Ke y Laboratory of Brain-Inspired Cognitiv e Computing and Educational Rehabilitation for Children with Spe- cial Needs, Chongqing Normal Univ ersity , Chongqing, 401331, China. Corresponding author: Hong Y u (yuhong@cqupt.edu.cn) and Lifeng Shen (shenlf@cqupt.edu.cn). Root Entertainment Business Movie TV Policy Company Finance Management Documentary Action 2nd Level Category 3rd Level Category 1st Level Category (a) A three-le vel label tree is used in the news recommendation scene . 1st Level Catego ry 2nd Level Category 3rd Level Cat ego ry (b) Long-tailed label distribution and challenging-tailed perf ormance. Fig. 1: As the classiﬁcation hierarchy becomes deeper , the number of samples per label decreases signiﬁcantly , leading to a pronounced challenging-tailed problem and a sharp drop in the classiﬁcation performance of minority classes. The problem caused by imbalanced data distributions be- comes particularly pronounced in hierarchical classiﬁcation tasks, where the data distribution is inherently long-tailed [4], [5]. Speciﬁcally , in Figure 1(a), a label tree of three lev els is shown from the news recommendation scene, while in Figure 1(b), we show the corresponding distribution of hierarchical labels. The objective of sequence learning here is to predict the speciﬁc leaf nodes to which a giv en text belongs. Figure 1 clearly exhibits that the more speciﬁc categories are typically concentrated at the lower lev els of the hierarchy , making them harder to learn due to limited training samples. If these categories are o verlooked, many leaf nodes in the classiﬁcation tree will not be correctly co vered, which compromises the integrity of the hierarchy and causes the loss of critical ﬁne- grained information [6]–[8]. The Mixture of Experts (MoE) framework has recently emerged as a key technique for addressing data imbalance [9], [10]. It lev erages multiple specialized sub-models, each designed to capture distinct patterns or handle speciﬁc data subsets. Howe ver , the main challenge is to ensure the proper coordination among these experts, which encourages sufﬁcient specialization while maintaining global consistency . Determin- IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PR OCESSING. 2 ing ho w to assign data to the most suitable experts and how to aggregate their outputs remains a critical problem. W e propose that an effecti ve multi-expert system must address three fundamental challenges. First, parameter inefﬁ- ciency arises when combining se veral large e xperts, leading to a sharp increase in parameters and high training and memory costs. Second, expert specialization is difﬁcult to achie ve. W ithout an explicit mechanism for task allocation, e xperts may con verge to similar patterns, causing redundancy , overﬁtting, and poor cov erage of data div ersity . Third, expert opinion conﬂict appears when experts giv e inconsistent predictions, requiring a reliable strategy to handle these differences and produce stable results [11]. T o address these challenges, we propose the Uncertainty- based Multi-Expert fusion network (UME). First, we tackle parameter inefﬁciency with Ensemble LoRA, which lev erages Low-Rank Adaptation (LoRA) to enable lightweight multi- expert modeling. Each expert is ﬁne-tuned with only a small number of trainable parameters. Second, to promote expert specialization, we propose a Sequential Specialization strategy based on the Dempster -Shafer evidence theory (DST) [12]. In this strategy , experts are trained one by one rather than at the same time. This approach encourages a natural division of labor . Later experts are speciﬁcally trained to focus on samples that earlier experts fail to classify correctly . As a result, experts develop distinct uncertainty proﬁles: early-stage experts sho w high uncertainty on dif ﬁcult samples, while late- stage experts exhibit lo w uncertainty on the same samples they hav e been trained to master . Finally , UME integrates these signals through an Uncertainty-Guided Fusion mechanism, which dynamically assigns decision weights based on expert conﬁdence and resolves conﬂicts by prioritizing the most certain expert for each sample. The main contrib utions of this w ork are summarized as follows: (1) W e propose an efﬁcient and effecti ve multi-expert fusion network named UME, which impro ves the recognition of challenging-tailed categories in sequential learning. (2) W e introduce a sequential specialization mechanism to promote expert div ersity . Later added experts focus more on challenging-tailed samples, identifying through their predictiv e uncertainty . (3) T o alleviate conﬂicts among experts, we le verage DST theory to dynamically integrate multiple experts based on their uncertainty , ensuring consistent and reliable decision fusion. (4) T o enhance parameter ef ﬁciency , we incorporate lo w-rank experts, signiﬁcantly reducing the number of trainable parameters by 10.32% and 7.19% compared with recent competitiv e baselines HiTIN (A CL ’23) and HiAdv (COL- ING’24). Extensiv e experiments on four public datasets demonstrate that UME achieves state-of-the-art perfor- mance, particularly on challenging-tailed categories. I I . R E L A T E D W O R K A. Hier ar chical T ext Classiﬁcation Hierarchical T ext Classiﬁcation (HTC) is an important re- search topic in natural language processing, where collections of documents contain hierarchically structured concepts. Cur- rently , HTC methods ev olve into tw o main approaches, as deﬁned by Silla and Freitas [13]. The ﬁrst is local methods, which train specialized classiﬁers for speciﬁc nodes or levels without considering the entire hierarchy [14]. The second is global methods, which utilize the full hierarchical structure through penalties or hierarchy-aware models [15]. Early re- search mainly focuses on capturing local information, while recent studies emphasize le veraging global information by transferring knowledge across the hierarchy and incorporating it into predictions [16]. Additionally , with the advancement of pretrained language models, recent work continues to enhance the ov erall classiﬁcation accuracy of HTC models. For e xample, Zhang et al. [17] enhance hierarchical text classi- ﬁcation performance through multi-label negati ve supervision and asymmetric loss by constructing a Hierarchy-A ware and Label-Balanced model (HALB). T o address the challenges of long-document classiﬁcation, Liu et al. [18] utilized an interactiv e graph network to align local and discourse-level textual features with label structure information. In contrast, Zhu et al. [19] present a strategy called Hierarchy-aware In- formation Lossless contrastiv e Learning (HILL) that improves HTC performance by retaining both semantic and syntactic information. Even though previous techniques have achieved good performance in HTC, training classiﬁcation models for tailed labels remains more challenging than for head labels due to the long-tail label distribution, leading to underﬁtting of tailed labels. Xu et al. [20] propose the Label-Speciﬁc Feature Augmentation (LSF A) framew ork for text classiﬁcation to improv e tail label representations. It enhances rare labels by generating positiv e feature-label pairs and transferring intra- class semantic variations from head labels using a prototype- based V AE. Furthermore, challenging the reliance on a ﬁxed label order, Y an et al. [21] explored a random generati ve method to learn the label hierarchy , demonstrating that a predeﬁned order may not always be optimal. Howe ver , their approach does not account for the impact of hierarchy in addressing the underﬁtting issue of tailed labels. The hierarchy le verages multi-le vel semantic aggregation to build semantic bridges, enabling tail labels to learn richer context from head labels and enhance their representations. Thus, we propose to lev erage multi-expert ensembles to im- prov e classiﬁcation accuracy by allo wing experts to specialize in different tasks. By incorporating uncertainty-based weight- ing among experts, our method provides better support for challenging tail-label samples at lo wer levels of the hierarchy . B. Uncertainty-based T ext Classiﬁcation In recent years, uncertainty-aware methods have been in- creasingly explored to enhance text classiﬁcation performance. For example, Mukherjee et al. [22] lev erage Bayesian deep learning to estimate uncertainty for fe w-shot text classiﬁcation. Li et al. [23] propose uncertainty-based criteria to identify hard samples during training, while Hu et al. [24] introduce an ev- idential uncertainty network for out-of-distribution detection. These studies demonstrate the effecti veness of Bayesian or evidential uncertainty modeling in improving robustness. IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 3 While Bayesian approaches provide a principled treatment of uncertainty , many sampling-based methods (e.g., Monte Carlo Dropout [25]) require repeated stochastic forward passes at inference time, leading to increased computational cost. In contrast, we employ Dempster–Shafer Theory to model uncertainty directly from output evidence. Crucially , DST explicitly represents conﬂict between e xperts, allowing us to distinguish ignorance from disagreement. This capability is essential for our sequential fusion strategy , yet remains underexplored in long-tailed hierarchical text classiﬁcation. C. Multi-Expert T ext Classiﬁcation Multi-expert learning fuses multiple single e xperts’ clas- siﬁcation results to improve the ﬁnal performance [9], [26]. For example, Zhao et al. [27] propose a multi-expert method that assigns different models to learn global and local features separately , allo wing for better classiﬁcation of long-tailed data using a hierarchical approach. Subsequently , the optimal deci- sion threshold within the multi-e xpert method [28] is proposed, introducing a gating network to enhance the performance in classiﬁcation. The Divide, Conquer , and Combine paradigm has been explored in sequence learning. F or instance, W ang et al. [29] utilize it for zero-shot dialogue tracking. They explicitly partition data into semantic clusters and combine experts via distance-based mapping. In contrast, our UME framework adapts this paradigm for long-tailed classiﬁcation with dis- tinct mechanisms. Instead of semantic clustering, we employ Sequential Specialization to divide the problem by sample difﬁculty . Furthermore, rather than distance-based inference, we utilize Dempster-Shafer Theory to combine experts. This approach speciﬁcally addresses the issue of conﬂicting expert predictions in imbalanced scenarios. Howe ver , regarding multi-expert in HTC, the issue occurs when different experts have varying opinions on the classi- ﬁcation results, causing conﬂicts in the overall multi-expert ensemble outcomes. Compared to previous approaches, the proposed method provides a more ﬂexible solution for HTC, particularly in addressing the challenge of long-tailed category classiﬁcation. In multi-expert networks, experts may produce conﬂicting opinions, making it crucial to effecti vely integrate their predictions. T o tackle this, we propose to introduce an uncertainty-based multi-e xpert fusion framework that lev er - ages DST for dynamic expert weighting. I I I . M E T H O D O L O G Y In this Section, we formally elaborate on the proposed Uncertainty-based Multi-Expert fusion network (UME). W e ﬁrst introduce the T ask Deﬁnition and the formal deﬁnitions of Evidence and Uncertainty in Section 3.1 Preparation. Section 3.2 introduces how to obtain representations of T ext and Label Hierarchy , while proposing the Low-rank Experts constructed in Section 3.3. In our multi-e xpert framew ork, each e xpert extracts e vidence from the input and formulates a classiﬁcation opinion. W e then construct the Uncertainty-based Multi-Expert Fusion Network (UME) in Section 3.4. The main workﬂo w is illustrated in Figure 2, with the fusion process compris- ing three ke y steps (detailed in Section 3.4): i) uncertainty quantiﬁcation; ii) uncertainty combination; and iii) likelihood- based optimization. Section 3.5 presents our training objective, which demonstrates how experts dynamically participate dur- ing training. A. Pr eliminaries In the HTC task, we are giv en a label set L = { l 1 1 , l 1 2 , · · · , l H K } . organized into a hierarchical structure, typ- ically represented as a label tree. Here, H is the height of the hierarchical label tree, and K denotes the total number of labels in the set. The goal of HTC is to predict a subset of this label set for a gi ven input text S = { c 1 , c 2 , · · · , c N } , where c i ( i = 1 , . . . , N ) represents the input tokens, and N is the total number of tokens in the text. The task aims to assign the appropriate labels to the input text based on its content while considering the hierarchical relationships between labels. In the following, we provide the formal deﬁnitions of Evidence and Uncertainty used in the proposed fusion framework. Deﬁnition 1. Evidence. In neur al networks, logits ar e the outputs of the ﬁnal layer in classiﬁcation tasks and r epresent the model’ s raw pr edictions befor e applying functions like softmax. These logits reﬂect the model’ s conﬁdence in each class. In Dempster-Shafer Theory (DST), e vidence r efers to the de gree of support for a speciﬁc pr oposition, such as a sample belonging to a particular category . W e tr eat logits as evidence because they pr ovide a quantiﬁable measure of support for each class. F ormally , we deﬁne evidence e = l og its where e = [ e 1 , e 2 , · · · , e K ] for a total of K classes. This mapping aligns well with DST’ s r equir ement for quantiﬁable support, as the logits directly r epr esent the model’s assessment of how likely a sample belongs to each cate gory . By using logits as evidence, we bridge the outputs of neural networks with DST , enabling the inte gration of uncertainty modeling and evidence fusion into our framework. This allows the model to better handle challenging classiﬁcation tasks, suc h as those in volving long-tailed distributions. Deﬁnition 2. Uncertainty . Subjective Logic (SL) formalizes DST’ s notion of belief assignments over a frame of discern- ment as a Dirichlet Distribution [30]. Hence, it allows one to use the principles of evidential theory to quantify belief masses and uncertainty thr ough a well-deﬁned theoretical frame work. SL considers a frame by assigning belief mass b k to the k-th class, k = 1 , . . . , K , ther eby determining the overall uncertainty mass u . These K + 1 mass values ar e all non-ne gative and sum to 1, ensuring a complete and consistent distribution of belief and uncertainty . Uncertainty quantiﬁes the trustworthiness of the classiﬁcation results by distributing belief mass b k = α k − 1 S acr oss possible labels. Her e, α k r epr esents the Dirichlet parameters for the k-th class, and S = P K k =1 ( e k + 1) is the Dirichlet str ength, r eﬂecting the total amount of evidence. A lar ger S indicates mor e accumulated evidence , r educing uncertainty and leading to more conﬁdent belief assignments. When no evidence is pr esent ( e k = 0 for all k ), the belief for each class is IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 4 Frozen T rainable T axonomic Hierarchy T ext Fusion Embedding BER T Encoder Uncertainty Fusion Sigmoid Predict Label Attention Mask T okens Key T okens Multiplication e u C Evidence Uncertainty Conflict Metric H x x ^ Dynamical weight w T oken Representation Logits Multi-Hea d attention Add & Norm FFN Add & Norm Dynamical Uncertainty e = 2 e , ⋯ , e 1 2 k 2 e = 1 e , ⋯ , e 1 1 k 1 e = 3 e , ⋯ , e 1 3 k 3 u 1 u u 1− C 1 1 2 Linear Linear Linear w = m + 1 w u 1 − C m 1 m m = e ^ exp{ w / η } ∑ m =1 M m exp{ w / η }⋅ e ∑ m =1 M m m h T ext Representation Fig. 2: The structure of our proposed model UME. The uncertainty of every single expert is attained for multi-expert learning. Multi-expert Joint Uncertainty is dynamically ensembled by the degree of uncertainty among experts. zer o, and uncertainty r eaches its maximum (i.e., u = 1 ). The Dirichlet parameters ar e calculated as α k = e k + 1 , wher e e k is the evidence for the k-th class, adding 1 pr events zero belief when no evidence is observed and ensur es a smooth uncertainty estimation. Evidence e k quantiﬁes the model’s support for eac h class. A higher e vidence value e k r esults in a lar ger Dirichlet parameter α k , leading to increased belief mass b k , which corresponds to gr eater conﬁdence in the class pr ediction. Conver sely , lower evidence implies greater uncertainty . The unallocated belief mass u , which repr esents uncertainty , is deﬁned as: u = 1 − X K k =1 b k = K S , (1) wher e S = P K k =1 α k . This ensures that: u + X K k =1 b k = 1 , (2) wher e u captur es the system’ s lack of conﬁdence in all cate gories. In summary , higher e vidence leads to higher belief mass b k and lower uncertainty u , wher eas lower evidence r esults in incr eased uncertainty . B. Repr esentations of T e xt and Label Hierar chy T ext representation denoted by H is a hidden representation of the input text, while key tokens’ representation ˆ H refer to the representation related to the hierarchical structure, obtained by combining it with the graph encoder . W e ﬁrst utilize the Embedding layer of BER T [31]: H = Embedding( x ) , (3) where H ∈ R n × d h represents hidden representation and d h describe the dimension of the hidden layer; x deﬁnes input token sequence. Meanwhile, [32] is used as the graph encoder with the guidance of labels. Here, key tokens are constructed by masking unimportant tokens. Combining and contrasting the two representations can inject hierarchical information into the BER T Encoder . Following the approach of HGCLR [33], we adopt Gumbel-Softmax [34] instead of the standard Soft- max function to make the sampling operation dif ferentiable. T ok ens for which the probability of belonging to label y i exceeds the threshold γ are constructed as ke y tokens ˆ x : ˆ x =    x i , if X j ∈ y P ij > γ ; else 0    , (4) where P ij represents the probability of sampling. Similarly , the ke y tokens undergo the same embedding layer to obtain its corresponding representation ˆ H = Embedding( ˆ x ) . C. Low-r ank Experts For BER T Encoder, both H and ˆ H are taken as inputs for calculating the token representations and hierarchy represen- tations. Note that ˆ H is used to enhance H ’ s representation via contrastive learning that will be introduced in Equation IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 5 (19). W e freeze the parameters of the Transformer blocks in the BER T Encoder and subsequently apply uncertainty fusion using the LoRA technique in the last FFN layer of the BER T Encoder . Speciﬁcally , we ﬁne-tune the FFN layer using M low-rank experts deﬁned by: h out = W 0 h + ∆ W h = W 0 h + M X m =1 B m A m h, (5) where h denotes the input hidden state to the FFN layer . W 0 represents the original parameter matrix. ∆ W is the matrix that needs to be updated in the original FFN layer . A m and B m denote the low-rank decomposition matrices of W 0 for the m -th expert. A m is initialized using random Gaussian values, while B m is initialized to zero. D. Uncertainty-based Multi-Expert Fusion Network Giv en an input sample, each expert produces e vidence scores for hierarchical labels. These scores are con verted into uncertainty-aware representations using Dempster–Shafer Theory . W e then compute prediction conﬂict and uncertainty between adjacent experts, which are used to dynamically deter - mine each expert’ s contribution. Experts with high conﬁdence dominate simple samples, while high-conﬂict samples are progressiv ely routed to later experts for further reﬁnement. Finally , weighted evidence from all experts is fused to produce the ﬁnal prediction. (i) Uncertainty quantiﬁcation serves as the foundation for integrating multiple experts. W e employ Dempster-Shafer Theory (DST) [30] to model hierarchical label prediction, dis- tinguishing between belief mass b k (evidence) and uncertainty u [35]. Speciﬁcally , we model the belief mass using a Dirichlet distribution D ( · ) [36]: D ( p | α ) = ( 1 B ( α ) Q K k =1 p αk − 1 k for p ∈ S K 0 ; otherwise , (6) where α represents the distribution parame- ters, B ( · ) is the beta function, and S K = n p | P k = 1 K p k = 1 , 0 ≤ p k ≤ 1 o denotes the K - dimensional unit simplex. Based on this distrib ution, evidence-based uncertainty is quantiﬁed according to Deﬁnitions 1 and 2. (ii) Uncertainty fusion combines the uncertainty u = { u 1 , . . . , u M } and evidence e = { e 1 , . . . , e M } from all experts to dynamically weight their contrib utions. This mechanism ad- justs expert inﬂuence based on sample complexity: prioritizing preceding experts for simple samples while lev eraging the full ensemble for hard, long-tail categories. First, we sequentially fuse expert uncertainties u : u = u 1 ⊕ u 2 ⊕ · · · ⊕ u M = Q M m =1 u m Q M m =1 (1 − C m ) , (7) where C m = P i  = j b m i b m − 1 j represents the conﬂict metric between adjacent experts (with C 1 = 0 ), and b m i denotes the conﬁdence of the m -th expert for the i -th category . A conﬂict metric of 0 indicates consistency . T o promote diversity , we employ a Sequential Specialization strategy where subsequent experts explicitly focus on hard samples characterized by high predicti ve uncertainty . Based on the conﬂict C m and uncertainty u m , we deﬁne the dynamical weight for each expert: w m +1 = w m ⊕ u m = 1 1 − C m w m u m . (8) Here, w m quantiﬁes the accumulated uncertainty passed from previous experts to expert m (initialized as w 1 = 1 , w 2 = u 1 ). These dynamical weights are then applied to aggregate the evidence from all M experts: ˆ e = P M m =1 exp { w m /η } · e m P M m =1 exp { w m /η } , (9) where η is a temperature factor used to scale the weights via a Softmax-like operation. During inference, the combined evidence ˆ e is con verted into probabilities using the Sigmoid function: σ ( ˆ e ) = 1 1 + exp( − ˆ e ) . (10) Finally , labels with a probability σ ( ˆ e ) exceeding a predeﬁned threshold are selected as predictions. The following section details ho w these dynamical weights guide the training pro- cess. (iii) Likelihood-based Optimization. W e propose to im- prov e the uncertainties from each single expert and com- bined multi-expert via likelihood-based optimization within the multi-expert system. Before getting the evidence e i and the one-hot class v ector y i , the adjusted Dirichlet distribu- tion e D ( p i | e i ) is used as a priori of the multiple likelihood P ( y i | p i ) . Then the negati ve logarithm of the marginal likeli- hood is given by L ml = − log " Z K Y k =1 p y ik ik 1 B ( e i ) K Y k =1 p e ik − 1 ik dp i # (11) This means that the correct class can receiv e more evidence. Additionally , to solve the problem of high overall evidence of wrong labels, we introduce L ev idence deﬁned as L ev idence = K L ( D ( p i | ˜ α i ) ∥ D ( p i | 1)) , (12) where ˜ α i = 1+ (1 − y i ) ⊙ e i indicates the adjusted Dirichlet parameter . By reducing the dif ference between the target distribution and the adjusted distribution, the evidence for incorrect classes is diminished. Thus, the objecti ve goal of the single expert is given by : L single = L ml + λ ( t ) L ev idence , (13) where λ ( t ) = min { 1 , t/T } is the annealing factor (t is the current epoch). Intuiti vely , when t is lar ger , the effect of KL div ergence is fully applied. This ensures that experts can capture differentiated e vidence in the early stages. Finally , the joint goal of dynamic learning evidence in the multi-expert network is giv en: L = N X i =1 M X m =1 1 { w m i > ε }L single . (14) IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 6 E. T r aining Objectives Apart from the training loss of a single expert in Equation (14), we introduce two losses to enhance the optimization of the multi-expert fusion network. The y are 1) Classiﬁcation loss L C ; and 2) Contrastive learning loss L con for enhanced text representation. Classiﬁcation loss L C . The probability of text x i appearing on label j is: p ij = sig moid ( W c h i + b c ) j , (15) where W C ∈ R k × d h and b c ∈ R k illustrate weights and devi- ations, respectiv ely . h = E nconder ( H ) , ˆ h = E nconder ( ˆ H ) represents the text representations and hierarchical represen- tations after passing through the Encoder layer . W e employ a binary cross-entropy loss function: L C ij = − y ij log( p ij ) − (1 − y ij ) log(1 − p ij ) , (16) L C = N X i =1 k X j =1 L C ij , (17) where y ij is the real label; ˆ h i is used instead of h i , and the classiﬁcation loss of the key sample signiﬁes ˆ L C . Contrastive learning loss L cl . T o calculate the contrasti ve learning loss, we construct negati ve samples for tokens and key tokens: c i = W 2 F ( W 1 h i ) , ˆ c i = W 2 F ( W 1 ˆ h i ) . (18) Here, W 1 ∈ R d h × d h , W 2 ∈ R d h × d h , F takes the ReLU function and d h expresses the number of hidden layers. W e deﬁne z ∈ { c i } ∪ { ˆ c i } , and the NT -Xent loss function [37] is used to force the distance between the positive and negati ve examples to become larger , and the NT -Xent loss of z i is calculated: L cl i = − log exp( sim ( z i , z j ) /τ ) P 2 N k =1 ,k  = i exp( sim ( z i , z k ) /τ ) , (19) where sim is the cosine similarity function and τ is the temperature hyperparameter . z i and z j are mutually real and key tokens. Then, the contrast loss is averaged by: L cl = 1 2 N 2 N X m =1 L cl i . (20) Thus, the objectiv e of a single expert is updated based on the previous equation 13 to: ˜ L single = L ml + λ kl ( t ) L ev idence + L C + ˆ L C + L cl . (21) Finally , the overall training objectiv e is: L f inal = N X i =1 M X m =1 1 { w m i > ε } ˜ L single . (22) T ABLE I: The statistical details of datasets. | Y | represents the number of unique labels in each dataset. Av g ( y i ) the av erage value of the labels for each entry across the dataset. Depth indicates the depth of the hierarchical structure in the dataset. The last three columns of the table include the number of instances in the training, v alidation, and testing sets, respectively . Dataset | Y | Avg ( y i ) Depth # Train # De v # T est WOS 141 2.0 2 30,070 7,518 9,397 RCV1-V2 103 3.24 4 18,520 4,629 781,265 AAPD 54 2.41 2 43,872 10,968 1,000 BGC 146 3.01 4 58,800 14,700 18,394 I V . E X P E R I M E N T S In this section, we conduct experiments to verify the effec- tiv eness of the proposed multiple-expert fusion network for hierarchical text classiﬁcation. Speciﬁcally , we will answer the following research questions: RQ1 : Does the proposed UME outperform the recent strong baselines? RQ2 : What is the performance on challenging-tailed samples? RQ3 : What are the effects of the number of experts? RQ4 : How multiple experts handle challenging-tailed samples? RQ5 : What are the ef fects of components and fusion strategies in UME? RQ6 : Is it still efﬁcient by using multiple-expert fusion? RQ7 : Can LLMs replace the proposed supervised framework in hierarchical text classiﬁcation? A. Datasets W e select four public HTC datasets in the experiments: W eb of Science (WOS) [38], RCV1-V2 [39], AAPD [40], and BGC 1 . Speciﬁcally , WOS and AAPD contain abstracts of papers published in the W eb of Science database and Arxiv respectiv ely , along with the corresponding subject categories. RCV1-V2 is a news classiﬁcation corpus, while the BGC dataset consists of book introductions and metadata. W OS is used for single-path hierarchical text classiﬁcation, while RCV1-V2, AAPD, and BGC include multi-path classiﬁcation labels. Statistical details are listed in T able I. W e conduct a statistical analysis on the class distrib utions of four benchmark datasets: AAPD, WOS, RCV1, and BGC. As illustrated in Figure 3, all datasets exhibit se vere long-tailed characteristics with distinct degrees of imbalance. AAPD shows a moderate imbalance ratio (IR) of 49.0. In contrast, WOS and RCV1 rev eal extreme data sparsity in tail classes, where the minimum sample count ( N min ) drops to 1, resulting in high IRs of 750.0 and 2682.0, respectively . BGC presents the most signiﬁcant disparity , with an IR reaching 6854.0, where the head class contains over 34,000 samples while the tail class has only 5. These statistics quantitatively conﬁrm the intrinsic difﬁculty and the sev ere class imbalance issue present in these datasets. 1 BGC dataset is av ailable at: www .inf.uni- hamburg.de/en/inst/ab/lt/ resources/data/blurb- genre- collection.html IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 7 T ABLE II: The comparison of different models on WOS, RCV1-V2, AAPD, BGC and NYT . Model WOS RCV1-V2 AAPD BGC NYT Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Hierarchy-A ware Models HiA GM (A CL2020) 85.11 ± 1.4 80.02 ± 0.7 83.17 ± 1.8 62.35 ± 1.9 75.54 ± 2.1 58.86 ± 1.2 76.61 ± 0.9 57.27 ± 0.7 — — HTCInfoMax (NAA CL2021) 84.86 ± 0.6 79.12 ± 0.2 83.26 ± 0.8 60.66 ± 1.5 77.42 ± 0.7 57.72 ± 0.7 74.26 ± 0.4 56.81 ± 1.0 — — HiMatch (A CL2021) 85.04 ± 0.9 80.37 ± 1.3 84.42 ± 0.9 63.99 ± 1.1 76.70 ± 0.3 59.86 ± 0.5 76.63 ± 0.8 57.87 ± 1.0 — — Pretrained Language Models T5(2019) 85.63 ± 0.5 79.09 ± 0.9 84.04 ± 0.5 65.08 ± 0.8 74.99 ± 0.7 61.10 ± 0.3 74.82 ± 0.9 61.09 ± 0.8 — — RoBER T a (2019) 84.76 ± 0.8 79.11 ± 0.3 83.45 ± 0.2 65.01 ± 0.9 75.08 ± 0.6 60.31 ± 0.8 74.16 ± 0.9 60.49 ± 1.0 — — HGCLR (A CL2022) 86.75 ± 0.3 80.93 ± 1.0 86.40 ± 0.7 68.32 ± 0.5 78.90 ± 0.3 62.75 ± 0.4 78.34 ± 0.8 64.96 ± 0.7 — — HPT (EMNLP2022) 86.83 ± 1.1 81.24 ± 0.6 86.51 ± 1.2 69.16 ± 0.9 77.83 ± 0.5 62.94 ± 0.6 79.98 ± 1.0 65.30 ± 0.4 — — HiTIN (A CL2023) 87.52 ± 0.2 81.96 ± 0.4 86.79 ± 0.6 69.65 ± 0.8 78.60 ± 0.3 63.22 ± 0.4 79.98 ± 0.6 65.08 ± 0.5 — — HJCL (EMNLP2023) 87.67 ± 0.4 81.45 ± 0.3 86.27 ± 0.4 69.40 ± 0.9 78.55 ± 0.8 63.25 ± 0.6 80.36 ± 0.4 66.60 ± 0.2 — — T5-InterMRC (IPM2024) 86.04 ± 0.7 80.73 ± 0.5 84.96 ± 1.0 66.05 ± 0.4 77.22 ± 0.9 61.78 ± 1.1 75.83 ± 0.8 63.11 ± 0.7 74.89 ± 0.7 65.14 ± 0.8 HiAdv (COLING2024) 87.05 ± 0.2 80.97 ± 1.0 86.71 ± 0.2 69.29 ± 0.4 78.94 ± 0.3 62.52 ± 0.8 79.60 ± 0.9 65.63 ± 0.8 79.05 ± 0.3 69.13 ± 0.5 HALB (KBS2024) 86.75 ± 1.0 81.44 ± 0.8 86.14 ± 0.9 68.38 ± 0.8 78.10 ± 0.4 63.10 ± 0.5 80.05 ± 0.4 66.21 ± 0.6 79.09 ± 0.2 69.20 ± 0.3 HILL (NAA CL2024) 87.44 ± 0.2 81.83 ± 0.5 86.23 ± 0.4 69.50 ± 0.6 79.02 ± 0.3 63.24 ± 0.4 79.97 ± 0.3 65.84 ± 0.7 79.25 ± 0.6 69.54 ± 0.6 HiGen (EA CL2024) 87.39 ± 0.2 81.65 ± 0.2 86.53 ± 0.4 69.24 ± 0.3 78.71 ± 0.5 63.09 ± 0.5 80.39 ± 0.3 66.54 ± 0.6 78.62 ± 0.5 68.81 ± 0.6 UME (Ours) 87.71 ± 0.4 82.29 ± 0.3 86.88 ± 0.4 69.70 ± 0.2 79.21 ± 0.2 63.46 ± 0.3 80.46 ± 0.6 66.74 ± 0.4 79.41 ± 0.3 69.76 ± 0.5 (a) WOS (b) RCV1-V2 (c) AAPD (d) BGC Fig. 3: Class distribution analysis across four benchmark datasets. The histograms use a logarithmic scale to visualize the sev ere long-tailed nature and extreme class imbalance ratios (IR) present in WOS, RCV1-V2, AAPD, and BGC. B. Baseline and P arameter Settings W e select 14 competitive baseline models for comparisons, including HiA GM [15], HTCInfoMax [41], HiMatch [4], T5 [42], RoBER T a [43], HGCLR [33], HPT [44], HJCL [45], HiTIN [46], T5-InterMRC [47], HiAdv [48], HALB [17], HILL [19] and HiGen [49]. Speciﬁcally , HiAGM uses label- dependent prior probability to aggreg ate node information for global-lev el awareness. HTCInfoMAX addresses the issues by utilizing information maximization to optimize text-label relationships and label representations. HiMatch maps text and tags to a common embedded space for better matching. The abov e three models all belong to hierarchy-aware models and don’t use pretrained language models. In addition, T5 uniﬁes natural language tasks into a text-to-te xt format, simplifying T ABLE III: Parameter description and details Parameters Description r LoRA Rank. Default: 64 lr Learning rate batch Batch size epoch Default: 10 early-stop Epoch before early stop update Gradient accumulate steps warmup W arm-up steps graph Whether to use graph encoder . Default: T rue multi Whether the task is multi-label classiﬁcation. Should keep default since all datasets are multi-label classiﬁcations. Default: True thre Threshold for keeping tokens. Denoted as gamma in the paper wandb Use wandb for logging ta If preﬁx weight ≤ ε , the loss of expert m on the sample will be eliminated eta Eta is a temperature factor that adjusts the sensiti vity of preﬁx weights the ﬁne-tuning process. RoBER T a improves the performance of BER T by increasing the model size and enhancing training methods. T5-InterMRC proposes an interpretable model based on T5 that enhances the explainability of evidence in the task. HiGen optimizes HTC performance through dynamic text representation and le vel-oriented loss functions. As shown in T able III, we present the detailed training parameters, including LoRA rank, the number of epochs, and other hy- perparameters, along with their meanings. Additionally , we conduct experiments to reﬂect the impact of different LoRA rank sizes on performance. Furthermore, we specify the use of BER T as the backbone for te xt and Graphormer as the classiﬁcation hierarchy . Please refer to our GitHub repository for the complete implementation and reproducibility details. C. Main Results on Hierar chical T ext Classiﬁcation (RQ1) The main average results by ﬁv e-fold cross-v alidation are summarized in T able II. Particularly , UME achiev es the high- IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 8 T ABLE IV: The performance of different models on challenging-tailed categories in the RCV1-V2 dataset. W e report results regarding Win-Lose numbers. Category Count Ours HiTIN HJCL HGCLR HPT g157 1991 55.6 5% 55.25% 51.28% 49.42% 50.77% c16 1871 58.04 % 54.62% 57.96% 53.82% 55.41% gwelf 1818 48.01% 51.65 % 49.87% 45.21% 45.21% e311 1658 79.79 % 61.46% 58.74% 61.82% 59.18% c331 1179 77.26% 67.77% 79.55 % 54.20% 74.36% e143 1172 77.39 % 65.61% 66.20% 65.87% 65.31% c313 1074 47.21% 42.64% 49.27 % 39.11% 29.23% e132 922 58.79% 64.97 % 59.01% 57.92% 64.22% gobit 831 72.20 % 57.16% 58.18% 47.65% 54.11% gtour 657 66.33% 59.06% 67.06 % 54.49% 64.51% e61 376 69.41 % 66.76% 58.67% 64.10% 64.36% e141 364 58.52% 61.54 % 60.38% 39.56% 59.92% gfas 307 2.61 % 0.00% 0.00% 0.00% 0.00% e142 192 50.93% 46.35% 51.25 % 46.88% 49.98% e313 108 4.63 % 0.00% 2.78% 0.00% 0.00% e312 52 3.85 % 0.00% 0.00% 0.00% 0.00% Result W in-Lose 9-7 3-13 4-12 0-16 0-16 est performance across all datasets and overall av erages. Compared to HiMatch, the model with the highest av erage performance among Hierarchy-A ware models without using pretrained models, UME achiev es improvements of 3.83% in Micro-F1 and 8.87% in Macro-F1 on BGC. On one hand, UME leverages the rich semantic information from pretrained models, which enhances its generalization ability . On the other hand, the propagation of uncertainty across multiple experts helps in identifying challenging samples at the lower lev els. Additionally , among the baselines, HJCL shows the best av erage performance. It uses supervised contrastiv e learning to pro vide more comprehensiv e training for classes with a larger number of samples. In contrast, UME focuses on fewer experts that handle all samples. When encountering challenging minority samples, it activ ates more experts for ensemble classiﬁcation. As a result, UME achie ves an average improv ement of 0.36% in Micro-F1 and 0.37% in Macro- F1 across the four datasets. Overall, UME consistently out- performs other strong baselines in all F1-related e valuations, demonstrating the effecti veness of using multiple low-rank experts in hierarchical text classiﬁcation (HTC). In T able ?? , we repeat the experiments for the proposed method and the baseline HiTIN ﬁ ve times (using different random seeds). W e conduct experiments on the WOS and RCV1-V2 datasets. It can be seen that these improv ements are statistically signiﬁcant based on the paired t-test at the 95% signiﬁcance lev el. D. Results on Challenging-tailed Categories (RQ2) W e discuss the performance of the models on those challenging-tailed categories. Speciﬁcally , we intentionally pick up the categories of less than 2000 samples in RCV1- V2 as shown in T able IV. Compared with HiTIN, HJCL, HGCLR and HPT , our proposed UME outperforms them regarding the overall win-lose numbers, achieving the best performance in 9/16 challenging-tailed cate gories. In detail, for categories such as “e311”, “gtour”, and “e61”, UME achieves the best performance with much higher accuracy percentages than others. In the “e311” category , UME achiev es an impres- siv e accuracy of 79.79%, surpassing the accuracy of others. T ABLE V: The Macro-F1 performance on tailed labels for the T op N least frequent labels across ﬁv e datasets. Datasets T op N T ailed Labels HGCLR HPT HiTIN HJCL HILL Ours WOS N=8 28.47 27.78 37.83 41.67 42.08 46.60 N=16 48.67 49.04 53.16 57.67 51.78 62.66 N=32 61.49 65.36 61.43 71.62 68.44 74.70 RCV1-V2 N=8 18.92 16.88 20.24 22.40 18.38 26.93 N=16 41.83 41.37 47.21 43.22 46.72 50.45 N=32 56.67 54.77 57.98 57.20 61.46 63.69 AAPD N=8 28.35 36.50 38.95 37.60 40.46 41.71 N=16 45.47 44.69 51.90 49.83 49.97 53.49 N=32 54.76 54.82 57.69 56.59 57.63 59.11 BGC N=8 20.12 26.45 21.88 27.96 21.30 32.40 N=16 45.83 45.33 41.68 37.89 36.27 46.78 N=32 54.29 52.90 51.16 54.05 50.58 56.16 NYT N=8 9.38 12.83 13.52 12.82 14.31 15.36 N=16 22.43 21.21 28.70 34.39 35.31 35.78 N=32 48.36 45.21 47.39 50.53 50.23 50.75 (a) HGCLR (b) UME Fig. 4: The confusion matrix highlights the contrast between HGCLR and multi-experts in terms of prediction accuracy . Moreov er , we ﬁnd that all other baselines cannot correctly classify samples from categories of “gfas” and “e312” (all baselines’ are 0’ s on these categories), UME still works in these categories. In addition, we test the Macro-F1 of the samples corresponding to the top N least frequent tail labels across ﬁ ve datasets, as sho wn in T able V. Compared with HiTIN, HJCL, HGCLR, HPT , and HILL, we achieve state-of- the-art performance on the tail classes. On the WOS dataset, we achieve an av erage improvement of 11.03%, 10.60%, and 9.03% on the T op 8, T op 16, and T op 32 least frequent tail labels, respectiv ely . This demonstrates that the use of uncertainty-based multi-expert fusion can alle viate the issue of challenging-tailed label distributions in HTC tasks. In Figure 5, we present two real examples from the BGC dataset. One is a simple sample and the other is a challeng- ing sample. Compared to the baseline, UME demonstrates higher softmax classiﬁcation probabilities and produces the correct prediction results. W e further illustrate the reliability of the proposed UME by applying Dempster Shafer Evidence Theory . This framew ork helps multiple experts reach accurate uncertainty estimates. W e speciﬁcally analyzed the Conﬂict Metrics to understand expert disagreement. F or the simple sample, the conﬂict scores C 2 and C 3 remain low at 0.08 and 0.14 respectively . This indicates a high consensus among experts. In contrast, the challenging sample exhibits a slight increase in C 2 to 0.11 and a sharp spike in C 3 to 0.42. IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 9 Fig. 5: T wo examples illustrate the effecti veness of uncertainty-based multi-expert fusion. Compared with base- lines, our classiﬁcation results are more reliable. Fig. 6: The classiﬁcation error rate distribution across different Conﬂict Metric ( C m ) intervals. N denotes the number of samples in each bin. This signiﬁcant disagreement correctly signals the semantic ambiguity of the text. The fusion mechanism effecti vely han- dles this conﬂict to output conﬁdent and reliable classiﬁcation results. As illustrated by our binning analysis in Figure 6, the classiﬁcation error rate exhibits a strong positiv e correlation with the conﬂict metric C m beyond the initial interval, where the error rate sharply escalates from 4.9% in lo w-conﬂict regions to over 50% in high-conﬂict buckets ( C m > 0 . 6 ). The relatively higher error rate in the 0.0-0.2 range (26.8%) reﬂects a conﬁdently wrong phenomenon, where experts reach a unanimous but incorrect consensus due to shared biases or misleading features. This overall performance trend em- pirically v alidates C m as a reliable indicator of prediction risk and justiﬁes our use of Dempster-Shafer Theory . By identifying these challenging samples where expert evidence div erges—typically occurring in ﬁne-grained or long-tailed categories—the UME framew ork can dynamically resolve con- ﬂicts and enhance decision reliability through its uncertainty- guided fusion mechanism. (a) WOS (b) RCV1-V2 (c) AAPD (d) BGC Fig. 7: Bar charts show the dif ferent performances with varying numbers of experts among four datasets. T ABLE VI: Analysis of expert capacity and utilization on the WOS dataset with expert numbers ( M ). The metrics include T ail Micro-F1 (computed on the bottom 20% least frequent classes), the A verage Conﬂict of the ﬁnal expert ( C last ), and the proportion of samples routed to the ﬁnal expert (deﬁned as weight > 0 . 5 ). Experts ( M ) T ail Micro-F1 A vg. Conﬂict ( C last ) Last Expert Util. 3 76.21 0.1724 20.26 4 75.32 0.1597 1.80 5 74.05 0.1175 1.28 E. Ef fects of the Number of Experts? (RQ3) W e further observe that increasing the number of experts leads to higher av erage conﬂict and fewer samples routed to later experts. In Figure 7, we present an analysis with different numbers of experts. Clearly , as the number of experts increases, the Micro-F1 score performance initially rises to its highest with three experts and then gradually decreases. Furthermore, the observ ed improv ements are consistent across multiple runs, as reﬂected by the reported standard deviations, suggesting these differences represent meaningful shifts in predictiv e capability rather than random chance. W e attribute the peak at three experts to our Sequential Specialization strategy : since the datasets possess a hierarchical depth of only 2 to 4 lev els, three experts naturally align with the Head, Medium, and T ail cate gories, ef fectiv ely saturating the model’ s capacity to capture valid semantic distinctions. Consequently , adding experts beyond this point forces the model to overﬁt to residual noise or ambiguous outliers. These redundant experts introduce high conﬂict values during the DST fusion process, which slightly degrades global inference quality rather than providing further beneﬁt. As illustrated in T able VI, the model achieves optimal IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 10 Fig. 8: The participation percentage among e xperts in the head, medium, and tail categories on RCV1-V2. performance at M = 3 . T o in vestig ate the performance degradation at M > 3 , we analyzed the Last Expert Utilization (the proportion of samples where the ﬁnal expert’ s weight exceeds 0.5) and the A vera ge Conﬂict Metric ( C last ), which quantiﬁes the prediction discrepancy between the ﬁnal expert and its predecessor . At M = 3 , the ﬁnal expert is utilized for 20.26 of the samples, maintaining a signiﬁcant conﬂict signal between the second and third experts ( C last = 0 . 1724 ). This indicates that the ﬁnal expert receiv es suf ﬁcient discordant samples to learn effecti ve representations for tail classes, resulting in the highest T ail Micro-F1 of 76.21. Howe ver , as M increases to 4 and 5, the utilization rate of the ﬁnal expert drops precipitously to 1.80 and 1.28, respectively . Simultaneously , the conﬂict between the last two e xperts diminishes to 0.1175. The preceding experts hav e already resolved the majority of semantic ambiguities, leaving the ﬁnal expert with minimal conﬂict signals and insufﬁcient training data. Consequently , the additional experts fail to capture meaningful patterns and instead introduce parameter noise, leading to a decline in tail performance. Thus, M = 3 represents the optimal trade-off between expert specialization and data sufﬁciency . F . Experts P articipation P er centages (RQ4) T o analyze the inv olvement of each expert in the proposed model, we visualize the percentage of dynamical weights { w m } assigned to dif ferent experts in Equation (8). These weights are used to predict samples from the head, medium, and tail of the hierarchy classes in the RCV1-V2 dataset. The head, medium, and tail samples correspond to the 1st, 2nd, and 3rd level categories, respectively . Figure 8 illustrates that the initial experts exhibit relativ ely balanced participation across the classes. Howe ver , as we gradually increase the number of experts, they tend to focus more on the challenging-tailed samples at the bottom le vel. This observ ation aligns with our initial design concept, where uncertainty is propagated from preceding experts to later experts, leading to increased attention on the challenging-tailed samples. (a) Impact of tempera- ture f actor η (b) Impact of conﬂict met- ric threshold ϵ Fig. 9: Sensitivity analysis on WOS and RCV1-V2. G. Component Analysis (RQ5) W e ﬁrst analyze the contrib ution of each component. Specif- ically , we compare the full UME model against three v ariants. These include: i) without L ev idence ; ii) without L ev idence & L ml ; and iii) without L ev idence & L ml & c L C & L cl . Ablation results are summarized in T able VII. As sho wn, each component plays a distinct role. The c L C & L cl component is the most inﬂuential because it integrates hierarchical structure information. Additionally , the L ml component signiﬁcantly bolsters the model. It lev erages prior e vidence to delineate objectiv es through marginal likelihood. Conv ersely , L ev idence yields a modest impact. It improves performance by dimin- ishing the evidence of incorrect labels and instilling greater uncertainty in erroneous classiﬁcations. Crucially , we veriﬁed whether the gains stem from our speciﬁc fusion design or merely from ensemble capacity . W e further verify whether the performance gains stem from uncertainty-aware fusion rather than ensemble capacity . The ﬁrst is Direct Integration, which simply averages expert outputs. The second is Gated MoE, which employs a learnable gating network for expert routing. As shown in the bottom section of T able VII, both methods consistently underperform compared to UME across all datasets. F or example, Gated MoE achieves 86.74 on WOS versus 87.71 for UME. This decline indicates that standard routing mechanisms or simple av eraging fail to coordinate sequentially trained experts ef fec- tiv ely . They struggle to account for the speciﬁc uncertainty patterns of residual experts handling hard tail samples. In contrast, UME’ s DST guided fusion dynamically prioritizes experts based on conﬁdence. This indicates that the uncertainty aware mechanism is the key driver of performance, not just the number of experts. T o ev aluate the robustness of UME, we conduct a com- prehensiv e sensitivity analysis on the temperature factor η and the conﬂict metric threshold ϵ , which regulate the trade- off between early conﬁdent decisions and deeper expert in- volv ement. As illustrated by our experiments in Figure 9a, the temperature factor η modulates the sharpness of the fusion weights, achieving peak performance at η = 0 . 9 with Micro-F1 scores of 87.69 on WOS and 86.90 on RCV1- V2, while maintaining high stability across the entire interval. Similarly , the conﬂict threshold ϵ analyzed in Figure 9b deter- mines when accumulated uncertainty triggers the participation of subsequent experts through our Sequential Specialization IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 11 T ABLE VII: Ablation study results for Micro-F1. Ablation Study WOS RCV1-V2 AAPD BGC w/o L evidence 87.52 86.61 78.86 79.66 w/o ( L evidence & L ml ) 86.98 86.37 78.15 79.27 w/o ( L evidence & L ml & ˆ L C & L cl ) 85.91 85.72 77.28 78.51 - r.p. Direct Integration 87.18 85.73 78.53 79.51 - r.p. Gated MoE 86.74 86.12 77.91 79.08 UME 87.71 86.88 79.21 80.46 Fig. 10: Compare model parameters, training time and Macro- F1 scores on the WOS dataset. mechanism. Optimal results for this parameter are observed at ϵ = 0 . 5 , yielding peak scores of 87.72 for WOS and 86.92 for RCV1-V2, with minimal performance ﬂuctuations across tested ranges. These ﬁndings empirically validate that the conﬂict metric C m serves as a stable indicator for expert coordination, ensuring that UME achiev es state-of-the-art re- sults without the need for exhausti ve ﬁne-tuning. H. Computational Costs (RQ6) In this section, we report the number of parameters, Macro- F1 scores, and classiﬁcation accuracies of UME compared to T5 [42], HPT [44], HiTIN [46], and HiAdv [48] on the RCV1- V2 dataset. As shown in Figure 10, we show the number of trainable parameters and training time for the model with three experts, demonstrating that this conﬁguration achiev es the best and most stable performance. UME contains fe wer parameters than the other models but achiev es the best performance. This is mainly because we modify only the FFN layer of the BER T encoder to design a multi-expert architecture. Furthermore, we use LoRA to freeze most of the expert parameters and ﬁne- tune the multi-expert model with lo w-rank matrices. Compared to T5, UME reduces parameters by 39.49% while improving the Macro-F1 score by 3.20%. Parameter explosion in multi- expert fusion networks typically prev ents the direct fusion of expert networks. T o address this, we employ lo w-rank experts to improve efﬁcienc y , increasing parameters by only 1.18% for each additional expert while achieving competitiv e performance. For total training time, UME is shorter than all models except HiTIN. Although our training time increases by 2.78%, we reduce parameters by 10.37% while achieving higher performance. Additionally , we measure inference speed and ﬁnd no signiﬁcant differences between models. T ABLE VIII: Performance comparison between LLMs and our UME framew ork on WOS and RCV1-V2 datasets. The LLMs are ev aluated using the zero-shot prompt templates sho wn. Model WOS RCV1-V2 Micro-F1 Macro-F1 Micro-F1 Macro-F1 GPT -4 59.74 37.80 56.12 36.54 LlaMa-3 61.91 27.33 54.67 28.50 UME (Ours) 87.71 82.29 86.88 69.70 T ABLE IX: Prompt template for hierarchical text classiﬁca- tion. Prompt T emplate Y ou are an expert hierarchical text classiﬁer . Y our task is to classify the given text into the correct category path from the predeﬁned taxonomy . [T axonomy Deﬁnition] The categories are deﬁned in a hierarchy formatted as: Lev el 1 > Lev el 2 > Lev el 3 - The symbol “ > ” represents the parent-child relationship. - Y ou must output the full path from the root to the speciﬁc leaf category . [Allowed Category Paths] (Insert your full list of category paths here, e.g., T echnology > AI > NLP) [Constraints] 1. Output ONL Y the category path. No explanation, no preamble. 2. If the text ﬁts multiple paths, output the most speciﬁc and relev ant one. 3. Use the exact string format provided in the [Allowed Category Paths]. [Input T ext] {{ text content }} [Prediction] I. Comparison with LLMs (RQ7) W e further inv estigate the performance of Large Language Models (LLMs) on the HTC task to e v aluate their zero-shot reasoning capabilities against our supervised approach. T able VIII presents the results of GPT -4 and Llama3, e valuated using the standardized prompt templates detailed in T able IX. As shown in T able VIII, GPT -4 achiev es a Micro-F1 of 59.74 and Macro-F1 of 37.80 on the WOS dataset, with similar results on RCV1-V2 where it records 56.12 Micro- F1 and 36.54 Macro-F1. Llama3 demonstrates comparable performance, recording 61.91 Micro-F1 and 27.33 Macro- F1 on WOS, and 54.67 Micro-F1 and 28.50 Macro-F1 on RCV1-V2. These results indicate that instruction-tuned LLMs still signiﬁcantly underperform compared to supervised ﬁne- tuning models like UME, as UME achieves a Micro-F1 score exceeding 87 on WOS. The primary limitation is that the general reasoning capabilities of LLMs struggle to fully com- prehend and align with the complex, structured dependencies of hierarchical label trees. They often fail to strictly adhere to the parent-child constraints required in HTC. Consequently , our proposed UME framework, which explicitly models hier- IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 12 archy and uncertainty , proves to be both necessary and highly effecti ve for achieving state-of-the-art precision. V . C O N C L U S I O N This work proposes the Uncertainty-based Multi-Expert fu- sion network (UME) for challenging-tailed sequence learning. By integrating Ensemble LoRA, Sequential Specialization, and Uncertainty-Guided Fusion under Dempster–Shafer Theory , UME improves tail-class performance while maintaining a compact parameter footprint. Although sequential special- ization introduces a modest increase in training time due to reduced parallelism, this overhead represents a deliberate efﬁcienc y–performance trade-off that enables e xperts to focus on progressiv ely harder samples. Experimental results demon- strate that this design yields substantial gains on minority classes with reduced trainable parameters. In future work, we plan to further optimize training efﬁ- ciency by exploring dynamic expert selection strategies that selectiv ely activ ate experts based on uncertainty . Such mech- anisms may alleviate sequential latency while preserving the effecti veness of uncertainty-guided specialization in resource- constrained scenarios. R E F E R E N C E S [1] Y . Zhang, B. Kang, B. Hooi, S. Y an, and J. Feng, “Deep long-tailed learning: A survey , ” IEEE transactions on pattern analysis and machine intelligence , v ol. 45, no. 9, pp. 10 795–10 816, 2023. [2] O. Rennie, “Navigating the uncommon: challenges in applying evidence- based medicine to rare diseases and the prospects of artiﬁcial intelligence solutions, ” Medicine, Health Car e and Philosophy , vol. 27, no. 3, pp. 269–284, 2024. [3] M. U. Hassan, M. H. Rehmani, and J. Chen, “ Anomaly detection in blockchain networks: A comprehensive survey , ” IEEE Communications Surveys & T utorials , vol. 25, no. 1, pp. 289–318, 2022. [4] H. Chen, Q. Ma, Z. Lin, and J. Y an, “Hierarchy-a ware label semantics matching network for hierarchical text classiﬁcation, ” in Pr oceedings of the 59th Annual Meeting of the Association for Computational Linguistics , 2021, pp. 4370–4379. [5] E. Y u, W . Han, Y . Tian, and Y . Chang, “T ohre: A top-down classiﬁcation strategy with hierarchical bag representation for distantly supervised relation extraction, ” in Proceedings of the COLING , 2020, pp. 1665– 1676. [6] T .-Y . W u, P . Morgado, P . W ang, C.-H. Ho, and N. V asconcelos, “Solving long-tailed recognition with deep realistic taxonomic classiﬁer , ” in Computer V ision–ECCV 2020 . Springer, 2020, pp. 171–189. [7] L. Xiao, X. Zhang, L. Jing, C. Huang, and M. Song, “Does head label help for long-tailed multi-label text classiﬁcation, ” in Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , 2021, pp. 14 103–14 111. [8] Y . Cao, J. Kuang, M. Gao, A. Zhou, Y . W en, and T .-S. Chua, “Learning relation prototype from unlabeled texts for long-tail relation extraction, ” IEEE T ransactions on Knowledg e & Data Engineering , v ol. 35, no. 02, pp. 1761–1774, 2023. [9] Y . Liu, K. Zhang, Z. Huang, K. W ang, Y . Zhang, Q. Liu, and E. Chen, “Enhancing hierarchical text classiﬁcation through knowledge graph in- tegration, ” in Findings of the Association for Computational Linguistics: ACL 2023 , 2023, pp. 5797–5810. [10] W .-C. Chang, H.-F . Y u, K. Zhong, Y . Y ang, and I. S. Dhillon, “T aming pretrained transformers for extreme multi-label text classiﬁcation, ” in Pr oceedings of the 26th ACM SIGKDD international conference on knowledge disco very & data mining , 2020, pp. 3163–3171. [11] O. Y aghubi Agreh and A. Ghaffari-Hadigheh, “ Application of dempster- shafer theory in combining the experts’ opinions in dea, ” Journal of the Operational Research Society , vol. 70, no. 6, pp. 915–925, 2019. [12] K. Zhao, L. Li, Z. Chen, R. Sun, G. Y uan, and J. Li, “ A survey: Optimization and applications of evidence fusion algorithm based on dempster–shafer theory , ” Applied Soft Computing , v ol. 124, p. 109075, 2022. [13] C. N. Silla and A. A. Freitas, “ A survey of hierarchical classiﬁcation across different application domains, ” Data mining and knowledge discovery , v ol. 22, pp. 31–72, 2011. [14] S. Banerjee, C. Akkaya, F . Perez-Sorrosal, and K. Tsioutsiouliklis, “Hierarchical transfer learning for multi-label text classiﬁcation, ” in Pr o- ceedings of the 57th annual meeting of the association for computational linguistics , 2019, pp. 6295–6300. [15] J. Zhou, C. Ma, D. Long, G. Xu, N. Ding, H. Zhang, P . Xie, and G. Liu, “Hierarchy-aware global model for hierarchical text classiﬁcation, ” in Pr oceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics , 2020, pp. 1106–1117. [16] R. Plaud, M. Labeau, A. Saillenfest, and T . Bonald, “Revisiting hierar - chical text classiﬁcation: Inference and metrics, ” in Proceedings of the 28th Conference on Computational Natural Languag e Learning , 2024, pp. 231–242. [17] J. Zhang, Y . Li, F . Shen, C. Xia, H. T an, and Y . He, “Hierarchy-aw are and label balanced model for hierarchical text classiﬁcation, ” Knowledge- Based Systems , p. 112153, 2024. [18] T . Liu, Y . Hu, J. Gao, Y . Sun, and B. Y in, “Hierarchical multi- granularity interaction graph con volutional network for long document classiﬁcation, ” IEEE/ACM T ransactions on Audio, Speech, and Lan- guage Processing , vol. 32, pp. 1762–1775, 2024. [19] H. Zhu, J. W u, R. Liu, Y . Hou, Z. Y uan, S. Li, Y . Pan, and K. Xu, “Hill: Hierarchy-aware information lossless contrastive learning for hierarchical text classiﬁcation, ” in Pr oceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long P apers) , 2024, pp. 4731–4745. [20] P . Xu, L. Xiao, B. Liu, S. Lu, L. Jing, and J. Y u, “Label-speciﬁc feature augmentation for long-tailed multi-label text classiﬁcation, ” in Pr oceedings of the AAAI conference on artiﬁcial intelligence , vol. 37, no. 9, 2023, pp. 10 602–10 610. [21] J. Y an, P . Li, H. Chen, J. Zheng, and Q. Ma, “Does the order matter? a random generati ve way to learn label hierarchy for hierarchical text classiﬁcation, ” IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 32, pp. 276–285, 2023. [22] S. Mukherjee and A. A wadallah, “Uncertainty-aware self-training for few-shot text classiﬁcation, ” Advances in Neural Information Processing Systems , v ol. 33, pp. 21 199–21 212, 2020. [23] B. Li, Z. Han, H. Li, H. Fu, and C. Zhang, “Trustworthy long-tailed classiﬁcation, ” pp. 6970–6979, 2022. [24] Y . Hu and L. Khan, “Uncertainty-aware reliable text classiﬁcation, ” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining , 2021, pp. 628–636. [25] Y . Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning, ” in international confer ence on machine learning . PMLR, 2016, pp. 1050–1059. [26] J. T . Pintas, L. A. Fernandes, and A. C. B. Garcia, “Feature selection methods for te xt classiﬁcation: a systematic literature re vie w , ” Artiﬁcial Intelligence Re view , vol. 54, no. 8, pp. 6149–6200, 2021. [27] H. Zhao, S. Guo, and Y . Lin, “Hierarchical classiﬁcation of data with long-tailed distributions via global and local granulation, ” Information Sciences , v ol. 581, pp. 536–552, 2021. [28] A. Guti ´ errez-L ´ opez, F .-J. Gonz ´ alez-Serrano, and A. R. Figueiras-V idal, “Optimum bayesian thresholds for rebalanced classiﬁcation problems using class-switching ensembles, ” P attern Recognition , vol. 135, p. 109158, 2023. [29] Q. W ang, L. Ding, Y . Cao, Y . Zhan, Z. Lin, S. W ang, D. T ao, and L. Guo, “Divide, conquer, and combine: Mixture of semantic-independent ex- perts for zero-shot dialogue state tracking, ” in Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , 2023, pp. 2048–2061. [30] A. Jsang, Subjective Lo gic: A formalism for r easoning under uncertainty . Springer Publishing Company , Incorporated, 2018. [31] J. D. M.-W . C. Kenton and L. K. T outanova, “Bert: Pre-training of deep bidirectional transformers for language understanding, ” in Proceedings of N AA CL-HLT , vol. 1, 2019, p. 2. [32] C. Y ing, T . Cai, S. Luo, S. Zheng, G. Ke, D. He, Y . Shen, and T .-Y . Liu, “Do transformers really perform badly for graph representation?” Ad- vances in Neur al Information Processing Systems , vol. 34, pp. 28 877– 28 888, 2021. [33] Z. W ang, P . W ang, L. Huang, X. Sun, and H. W ang, “Incorporating hier- archy into text encoder: a contrastive learning approach for hierarchical text classiﬁcation, ” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , 2022, pp. 7109–7119. IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 13 [34] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax, ” in International Confer ence on Learning Represen- tations , 2017. [35] M. Sensoy , L. Kaplan, and M. Kandemir , “Evidential deep learning to quantify classiﬁcation uncertainty , ” Advances in neural information pr ocessing systems , vol. 31, 2018. [36] S. K otz, N. Balakrishnan, and N. L. Johnson, Continuous multivariate distributions, V olume 1: Models and applications . John Wiley & Sons, 2004, v ol. 1. [37] T . Chen, S. Kornblith, M. Norouzi, and G. Hinton, “ A simple framework for contrasti ve learning of visual representations, ” in International confer ence on machine learning . PMLR, 2020, pp. 1597–1607. [38] K. Ko wsari, D. E. Bro wn, M. Heidarysafa, K. J. Meimandi, M. S. Gerber , and L. E. Barnes, “Hdltex: Hierarchical deep learning for text classiﬁcation, ” in ICMLA . IEEE, 2017, pp. 364–371. [39] D. D. Lewis, Y . Y ang, T . Russell-Rose, and F . Li, “Rcv1: A new bench- mark collection for text categorization research, ” Journal of machine learning r esearch , vol. 5, no. Apr, pp. 361–397, 2004. [40] P . Y ang, X. Sun, W . Li, S. Ma, W . Wu, and H. W ang, “Sgm: Sequence generation model for multi-label classiﬁcation, ” in Pr oceedings of the 27th International Conference on Computational Linguistics , 2018, pp. 3915–3926. [41] Z. Deng, H. Peng, D. He, J. Li, and S. Y . Philip, “Htcinfomax: A global model for hierarchical text classiﬁcation via information maximization, ” in Pr oceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics , 2021, pp. 3259–3265. [42] C. Raffel, N. Shazeer , A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W . Li, and P . J. Liu, “Exploring the limits of transfer learning with a uniﬁed text-to-text transformer , ” Journal of machine learning r esear ch , vol. 21, no. 140, pp. 1–67, 2020. [43] Y . Liu, “Roberta: A robustly optimized bert pretraining approach, ” arXiv pr eprint arXiv:1907.11692 , 2019. [44] Z. W ang, P . W ang, T . Liu, B. Lin, Y . Cao, Z. Sui, and H. W ang, “Hpt: Hierarchy-aware prompt tuning for hierarchical text classiﬁcation, ” in Pr oceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , 2022, pp. 3740–3751. [45] S. C. L. Y u, J. He, V . G. Basulto, and J. Z. Pan, “Instances and labels: Hierarchy-aware joint supervised contrastive learning for hierarchical multi-label text classiﬁcation, ” in The 2023 Conference on Empirical Methods in Natural Language Processing , 2023. [46] H. Zhu, C. Zhang, J. Huang, J. W u, and K. Xu, “Hitin: Hierarchy- aware tree isomorphism network for hierarchical text classiﬁcation, ” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2023, p. 7809–7821. [47] B. Guan, X. Zhu, and S. Y uan, “ A t5-based interpretable reading com- prehension model with more accurate evidence training, ” Information Pr ocessing & Management , v ol. 61, no. 2, p. 103584, 2024. [48] Z. W ang, P . W ang, and H. W ang, “Utilizing local hierarchy with adversarial training for hierarchical text classiﬁcation, ” in Pr oceedings of the 2024 J oint International Confer ence on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , 2024, pp. 17 326–17 336. [49] V . Jain, M. Rungta, Y . Zhuang, Y . Y u, Z. W ang, M. Gao, J. Skolnick, and C. Zhang, “Higen: Hierarchy-aw are sequence generation for hier- archical text classiﬁcation, ” in Pr oceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long P apers) , 2024, pp. 1354–1368. Y e W ang Y e W ang recei ved the B.S. de gree in Microelectronics from Chongqing University of Posts and T elecommunications, Chongqing, China in 2011, the M.S. degree in Electrical Engineering from the University of T exas at Dallas, Richardson, TX, USA in 2014, and the Ph.D. de gree in Computer Engineering from T exas A&M University , College Station, TX, USA in 2019. He joined the School of Artiﬁcial Intelligence at Chongqing University of Posts and T elecommunications since 2020. His cur - rent research is mainly focused on natural language processing and computer vision. Zixuan Wu Zixuan W u received the B.S. de- gree in Information and Computing Science from Chongqing University of Posts and T elecommunica- tions, Chongqing, China in 2022, and is pursuing the M.S. degree in Computer Science and T echnology from the same univ ersity , with an expected gradu- ation in 2025. His research is mainly focused on natural language processing and autonomous dri v- ing. Lifeng Shen Lifeng Shen (Member, IEEE) received the PhD degree in Artiﬁcial Intelligence from the Hong Kong University of Science and T echnology , in 2024. He is an associate professor with the Department of Computer Science and Engineering, Chongqing Uni versity of Posts and T elecommu- nications (CQUPT). His current research interests include machine learning, deep learning, granular computing and time-series modeling and their ap- plications. Jiang Xie received the MS and PhD degrees in com puter science from Chongqing Univ ersity , in 2015 and 2019, respectiv ely . He is currently a lec- turer with the College of Computer Science and T echnology at Chongqing University of Posts and T elecommu nications. His research interests include clustering analysis and data mining. Xiaoling W ang (Member , IEEE) receiv ed the B.E., M.S. and Ph.D. de grees from Southeast Uni versity in 1997, 2000 and 2003, respectively . She is currently a Professor with the School of Computer Science at East China Normal Univ ersity . Her research interests mainly include graph data processing and intelligent data analysis. Hong Y u Hong Y u (Member, IEEE) received the BE degree in physics from Nanchang Hangkong Univ ersity , Nanchang, China, in 1994, the MS de- gree in signal and information processing from the Chongqing University of Posts and T elecom- munications, Chongqing, China, in 1997, and the PhD degree in computer software and theory from Chongqing Univ ersity , Chongqing, China, in 2003. She w orked with the Uni versity of Regina, Canada, as a visiting scholar during 2007–2008. She is cur- rently a full professor with the Chongqing Univ er- sity of Posts and T elecommunications, Chongqing, China. She received a Chongqing Natural Science and T echnology A ward. She has published several books and some peer -revie wed research articles. Her paper was selected as one of T op Articles in Outstanding S&T Journal of China. Her research interests include rough sets, industrial Big Data, data mining, knowledge discovery , granular computing, three-way clustering, three-way decisions and intelligent recommendation. IEEE TRANSA CTIONS ON A UDIO, SPEECH AND LANGUA GE PROCESSING. 14 Guoyin W ang Guoyin W ang (Senior Member, IEEE) received the BS, MS, and PhD degrees from Xi’an Jiaotong Uni versity , Xian, China, in 1992, 1994, and 1996, respectively . He worked with the University of North T exas, and the University of Regina, Canada, as a visiting scholar during 1998–1999. He had worked with the Chongqing Univ ersity of Posts and T elecommunications during 1996–2024, where he was a professor, the vice- president with the University , the director of the Chongqing K ey Laboratory of Computational Intel- ligence, the director of the Key Laboratory of Cyberspace Big Data Intelligent Security , Ministry of Education. He has been serving as the president of Chongqing Normal Univ ersity since June 2024. He is the author of more than 10 books, the editor of dozens of proceedings of international and national conferences and has more than 300 reviewed research publications. His research interests include rough sets, granular computing, machine learning, knowledge technology , data mining, neural network, cognitive computing, etc. He was the president of International Rough Set Society (IRSS) 2014–2017, and a council member of the China Computer Federation (CCF) 2008–2023. He is a vice-president with the Chinese Association for Artiﬁcial Intelligence (CAAI). He is a fellow of IRSS, CAAI and CCF .

Mastering the Minority: An Uncertainty-guided Multi-Expert Framework for Challenging-tailed Sequence Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment