Path-Constrained Mixture-of-Experts

Path-Constrained Mixture-of-Experts Zijin Gu , Tatiana Likhomanenko , Vimal Thilak , Jason Ramapuram † , Navdeep Jaitly † Apple, † Google † Work done at Apple. Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer’s experts independently, creating N L possible expert paths—for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose PathMoE , which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with PathMoE producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths. Correspondence: Zijin Gu: zijin@apple.com Date: March 20, 2026 1 Introduction Scaling driv es progress in deep learning, but dense mo dels incur computational costs that gro w linearly with parameter coun t. Mixture-of-Exp erts (MoE) architectures address this b y activ ating only a subset of parameters for eac h input, decoupling mo del capacity from computational cost and enabling models with trillions of parameters within practical budgets ( F edus et al. , 2022 ; Jiang et al. , 2024 ). N L 1 0 2 9 possible paths T raining data ( 1 0 1 1 t o k e n s ) (a) Independent Routing Constrained path space T raining data covers most paths (b) Path-Constrained Routing Explored paths Unexplored paths Figure 1 Statistical ineﬃciency of independent routing. MoE architectures scale neural netw orks eﬃciently b y routing each tok en through a subset of exp erts at each la y er. A tok en’s journey through an MoE net w ork can b e viewed as an exp ert p ath —a sequence of exp ert selections ( e 1 , e 2 , . . . , e L ) across L la y ers (top-1 exp ert is used). This path p erspective rev eals a fundamen tal c hallenge for conv entional MoE routing with independent expert selection at each la yer: with N exp erts per la y er, there are N L p ossible paths to learn, see Figure 2 (a). 1 1 Concretely , a model with 24 la y ers and 16 experts p er lay er has ≈ 10 29 possible paths—orders of magnitude larger than the num b er of tokens reported in t ypical large model training runs: e.g. a compute-optimal 7B parameter model is trained on ∼ 140B ( ∼ 10 11 ) tokens ( Hoﬀmann et al. , 2022 ). Th us the v ast ma jority of paths receive no learning signal during training. 1 r 1 1 r 2 2 r 3 3 r 4 4 r 5 5 r 6 6 r 7 7 r 8 8 (a) Independent Routing r 1 1 r 1 2 r 1 3 r 1 4 r 2 5 r 2 6 r 2 7 r 2 8 (b) Block-wise Parameter Sharing r 1 2 3 4 5 6 7 8 same decision (c) Fully Decision Shar ed Routing N L p o s s i b l e p a t h s Coher ent within blocks N p o s s i b l e p a t h s No Constraints Extreme Constraints Figure 2 Sp ectrum of routing constrain ts in MoE architectures. (a) Indep enden t routing: eac h la y er has its o wn router r i , creating N L p ossible paths for N exp erts and L la y ers. (b) Blo c k-wise parameter sharing routing, dubb ed PathMoE : la yers within a block share router parameters. (c) F ully decision shared routing: all la y ers share one router and its decision. This c ombinatorial explosion suggests that conv entional MoE routing ma y b e statistically ineﬃcient ( Figure 1 ). Y et these mo dels train w ell in practice—we ﬁnd that tokens following the same path naturally cluster by linguistic function, suggesting that emergent structure mitigates the com binatorial c hallenge. T o further impro v e sample eﬃciency and optimization, w e argue for c onstr aining the exp ert p ath sp ac e through architectural design. A naiv e approac h is to share routing decisions across all la y ers ( Figure 2 (c)), forcing eac h tok en to use the same exp ert throughout the net w ork. This collapses the path space to N paths (linear in the num b er of exp erts) but ma y hurt performance on complex tasks. A more ﬂexible approach is to shar e r outer p ar ameters acr oss al l layers , which does not force iden tical decisions but encourages consistency through a shared routing function. Recent w ork has shown this to b e eﬀectiv e for sp eec h recognition ( Gu et al. , 2025 ), where acoustic features follow structured path patterns. As text tokens carry more div erse syn tactic and semantic patterns than speech frames, diﬀerent tok ens may require fundamentally diﬀeren t processing at diﬀeren t netw ork depths. W e thus propose a middle ground: sharing r outer p ar ameters acr oss blo cks of c onse cutive layers , dubb e d PathMoE , see Figure 2 (b). This creates an inductive bias where the same routing function is applied to gradually-evolving token represen tations within each block. Since nearb y la y ers pro cess similar represen tations (thanks to residual connections), they naturally receive similar (though not iden tical) routing—encouraging path coherence without enforcing it, while allowing diﬀeren t blocks to adapt to the changing nature of representations. W e v alidate PathMoE on language mo deling and ﬁnd that it yields k ey b eneﬁts o ver con v en tional routing: 1. Consisten t p erformance gains. Experiments across 0.9B and 16B parameter mo dels sho w b oth impro v ed accuracy on downstream tasks and low er p erplexit y . 2. No auxiliary load balancing loss. PathMoE main tains balanced exp ert utilization without auxiliary losses, eliminating an extra hyperparameter. Through analysis w e further conﬁrm (i) impro v ed cross-la y er co ordination : PathMoE ac hiev es 31% higher routing consistency compared to conv entional routing; and as a result (ii) b etter sp ecialization with greater robustness : PathMoE ac hiev es 11% low er routing entrop y while b eing 22.5 × more robust to routing p erturbations. Bey ond p erformance, our exp ert path view on MoE rev eals that paths exhibit in terpretable structure: tokens follo wing the same path naturally cluster by linguistic function—punctuation, named entities, temp oral expressions—with PathMoE pro ducing more concen trated clusters than conv en tional routing. 2 2 Related Work MoE Architectures. The concept of MoE dates bac k to Jacobs et al. ( 1991 ), but has seen renew ed interest with the scaling of deep learning models. Shazeer et al. ( 2017 ) in troduced sparsely-gated MoE la y ers for neural machine translation, demonstrating that conditional computation could dramatically increase mo del capacit y with mo dest computational ov erhead. This work has b een extended to large language mo dels with architectures such as Switch T ransformers F edus et al. ( 2022 ), GLaM Du et al. ( 2022 ), and recently , Mixtral Jiang et al. ( 2024 ) and DeepSeek-MoE Dai et al. ( 2024 ). While these arc hitectures hav e demonstrated impressiv e scaling prop erties, they all emplo y indep enden t routing at each la y er, treating exp ert selection as a p er-la yer decision without considering cross-lay er structure. Routing Mechanisms in MoE. The routing function is critical to MoE p erformance, determining how inputs are distributed to experts. Classical approaches emplo y top- k routing with load balancing constrain ts Shazeer et al. ( 2017 ); Lepikhin et al. ( 2020 ). Recent work has explored v arious impro v emen ts: Raposo et al. ( 2024 ) prop osed mixture-of-depths that dynamically adjusts computation depth; Dai et al. ( 2022 ) introduced deterministic routing for training stability; Zhou et al. ( 2022 ) prop osed expert choice routing where exp erts select tok ens; and Puigcerver et al. ( 2023 ) inv estigated soft mixtures that com bine all exp erts with learned weigh ts. These approac hes impro v e individual routing decisions but do not address cross-la yer coordination, which we sho w is critical for exp ert path eﬃciency . Expert Specialization and Redundancy. A gro wing b ody of w ork has in v estigated what exp erts learn and whether they develop meaningful sp ecializations. Chi et al. ( 2022 ) show ed that sparse MoE can suﬀer from represen tation collapse where exp erts learn redundan t functions. Y ang et al. ( 2024 ) prop osed compression tec hniques exploiting inter-expert redundancy . Zadouri et al. ( 2023 ) prop osed metho ds to encourage expert div ersit y through regularization. Our PathMoE addresses these concerns architecturally: b y encouraging tok ens to follow consistent expert paths, experts naturally sp ecialize for diﬀerent tok en t ypes without explicit regularization. Cross-Layer Routing Coordination. While most MoE arc hitectures emplo y independent routers per la y er, recen t work has begun exploring cross-la y er co ordination. Qiu et al. ( 2024 ) prop osed a recurrent router that conditions eac h lay er’s routing on previous decisions. Gu et al. ( 2025 ) demonstrated that sharing router parameters across all lay ers impro v es speech recognition, where acoustic features follow structured patterns. Orthogonally , Dai et al. ( 2022 ) iden tiﬁed routing ﬂuctuation across training steps and prop osed distilling a stable router, addressing temp oral rather than spatial co ordination. Our PathMoE oﬀers a middle ground b et ween indep enden t and fully shared routing: b y sharing router parameters within blo c ks of consecutiv e la y ers, we encourage path consistency while allowing adaptation as represen tations evolv e through the netw ork. 3 Methodology 3.1 Preliminaries MoE Routing. In a transformer mo del with MoE la y ers, eac h MoE la y er contains N exp ert net w orks { F 1 , . . . , F N } and a router r . Given input tok en representation x for an MoE lay er, the router computes exp ert probabilities p : p = softmax[ r ( x )] = softmax [ Wx ] , p ∈ R N . (3.1) Then the output y is a weigh ted sum o v er the top- k exp erts: y = X i ∈ T opK( p ) p i · F i ( x ) . (3.2) Let us consider an MoE net w ork with L la y ers and N exp erts per lay er. F or the purp ose of analysis w e also consider a top-1 exp ert selection instead of top- k . 3 Auxiliary Load Balancing Loss. T o preven t exp ert collapse where only a few exp erts receiv e most tok ens, MoE training typically adds an auxiliary loss ( F edus et al. , 2022 ): L aux = α · N · N X i =1 f i · P i , (3.3) where f i is the fraction of tok ens routed to expert i , P i is the av erage routing probability for exp ert i , and α is a weigh ting co eﬃcien t. This loss encourages uniform exp ert utilization but in tro duces a hyperparameter that needs tuning. Independent Routing. In MoE with indep enden t routing, a router r l at la y er l has its own learnable W l parameters. W e call this indep endent r outing . Expert Path. Let [ N ] = { 1 , . . . , N } , data follo w distribution x ∼ D , and E l b e a random v ariable of the exp ert selection at lay er l . Denote exp ert selections across L la y ers as E = ( E 1 , . . . , E L ) , called an exp ert p ath , whic h is a random v ariable taking v alues in [ N ] L . Each E l follo ws a categorical distribution o ver N exp erts: E l | x l ∼ Categorical [softmax( W l x l )] . Let e = ( e 1 , . . . , e L ) be a speciﬁc realization of E . T o quantify the eﬀe ctive size of the exp ert p ath sp ac e utilized b y the MoE netw ork, we consider the entrop y on the marginal distribution of exp ert paths o v er the data distribution D . Routing Entropy. Let P π ( e ) denote the marginal probabilit y of observing a particular expert path e ov er the data distribution D , deﬁned as P π ( e ) = E x ∼D [ P ( e | x )] = E x ∼D [ P ( E 1 = e 1 , . . . , E L = e L | x )] . The r outing entr opy is then deﬁned as the Shannon entrop y on this marginal distribution of expert paths o ver the data distribution D : H ( E ) = H ( E 1 , . . . , E L ) = − X e ∈ [ N ] L P π ( e ) log P π ( e ) . (3.4) A higher H ( E ) indicates that the mo del utilizes a more div erse set of exp ert combinations across the data. Computationally , the marginal distribution is intractable, th us in practice we compute an empirical routing en trop y on ﬁnite data. Expert Path for Independent Routing. F or indep enden t routing each E l follo ws a categorical distribution ov er N exp erts: E l | x l ∼ Categorical [softmax( W l x l )] , where all W l are considered to b e independent matrices as ev ery lay er has its o wn learnable parameters. 3.2 PathMoE : Block-wise Parameter Shared Routing PathMoE Routing. Compared to indep endent r outing , PathMoE shares router parameters across blo c ks of consecutiv e la y ers. Given L MoE lay ers and blo c k size B , we partition lay ers in to ⌈ L/B ⌉ blo c ks. All lay ers within a blo c k share a single router: r l = r ( ⌈ l/B ⌉ ) shared , ∀ l ∈ { 1 , . . . , L } . (3.5) Expert Path for PathMoE . F or PathMoE routing, each E l still follows a categorical distribution ov er N exp erts. Ho w ev er, for an y lay er l ′ in the same blo c k b = ⌈ l ′ /B ⌉ w e ha ve E l ′ | x l ′ ∼ Categorical [softmax( W b x l ′ )] , where W b from diﬀerent blocks are independent matrices. 3.3 Intuition behind PathMoE Belo w, we provide an informal approximation of routing en trop y to illustrate the intuition b ehind wh y PathMoE routing ma y reduce the eﬀectiv e size of the expert path space compared to indep enden t routing. 2 Subsequen tly , w e compute empiric al routing en tropy and observe that it is indeed low er for PathMoE than for indep enden t routing, oﬀering a p oten tial explanation for the improv ed optimization of PathMoE . 2 Note that: (i) in the worst case, PathMoE can degenerate to uniform exp ert routing (e.g. W b = 0 ); (ii) in the b est case, independent routing can learn the same W l within the same block. 4 Using the chain rule of entrop y , the routing entrop y can be decomposed and further approximated using the ﬁrst-order Marko v dep endency (meaningful approximation in residual net w orks): H ( E ) = L X l =1 H ( E l | E 1 . . . E l − 1 ) ≈ L X l =1 H ( E l | E l − 1 ) . Let S b e the set of la yer indices that start a new blo c k, H indep ( E ) b e the independent routing en trop y and H PathMoE ( E ) b e the PathMoE routing entrop y . Then ∆ H = H indep ( E ) − H PathMoE ( E ) ≈ X l / ∈S [ H indep ( E l | E l − 1 ) − H PathMoE ( E l | E l − 1 )] . Using the deﬁnition of conditional en trop y H ( E l | E l − 1 ) = H ( E l ) − I ( E l ; E l − 1 ) , where I ( E l ; E l − 1 ) is the m utual information b et w een consecutiv e la y ers, w e hav e: 3 ∆ H ≈ X l / ∈S [ I PathMoE ( E l ; E l − 1 ) − I indep ( E l ; E l − 1 )] . Giv en the sp eciﬁc representations x l and x l − 1 , the routing decisions are sampled indep enden tly via the sto c hastic mechanism: P ( E l , E l − 1 | x l , x l − 1 ) = P ( E l | x l ) P ( E l − 1 | x l − 1 ) . Let p ( l ) k ( x l ) = P ( E l = k | x l ) b e the probability function for e x p ert k . The marginal join t distribution is deﬁned as: P ( E l = i, E l − 1 = j ) = E x ∼D h p ( l ) i ( x l ) · p ( l − 1) j ( x l − 1 ) i . Then the lay ers are independent only if the routing probabilities are uncorrelated across the data. Indep endent R outing. Giv en that (i) router parameters W l and W l − 1 are initialized indep enden tly and stay indep enden t for some time during training; (ii) due to residual connections representations ev olve slowly x l ≈ x l − 1 ; the decision b oundaries of routing function at la y er l are statistically orthogonal to those at la y er l − 1 across the data. Thus P ( E l = i, E l − 1 = j ) ≈ P ( E l = i ) P ( E l − 1 = j ) = E x ∼D h p ( l ) i ( x l ) i E x ∼D h p ( l − 1) j ( x l − 1 ) i = ⇒ I indep ( E l ; E l − 1 ) ≈ 0 . PathMoE R outing. As lay ers l, l − 1 are within a shared block, the router parameters are iden tical W l = W l − 1 . Due to residual connections, representations ev olv e slo wly x l ≈ x l − 1 , thus routing probabilit y functions are close: p ( l ) k ( x l ) ≈ p ( l − 1) k ( x l − 1 ) ≜ ϕ k ( x l ) . Then the joint probabilit y of selecting the same expert k at b oth la y ers: P ( E l = k , E l − 1 = k ) ≈ E x ∼D  ϕ k ( x l ) 2  . By Jensen’s inequality , since f ( z ) = z 2 is strictly conv ex and the routing probabilit y ϕ k ( x l ) v aries across the data distribution, we get that E l and E l − 1 are correlated and this correlation is concen trated on the diagonal: E x ∼D  ϕ k ( x l ) 2  > ( E x ∼D [ ϕ k ( x l )]) 2 P ( E l = k , E l − 1 = k ) > P ( E l = k ) · P ( E l − 1 = k ) = ⇒ P ( E l , E l − 1 )  = P ( E l ) · P ( E l − 1 ) = ⇒ I PathMoE ( E l ; E l − 1 ) ≫ 0 . Giv en I PathMoE ( E l ; E l − 1 ) ≫ I indep ( E l ; E l − 1 ) ≈ 0 , we obtain ∆ H > 0 . This sho ws that PathMoE ma y introduce strong inter-la yer correlations for the marginal distribution of exp ert paths ov er the data distribution D , signiﬁcan tly constraining the eﬀective expert path space. 3 Every entrop y is deﬁned on the marginal distribution of the corresponding random v ariable o v er the data distribution D . 5 Table 1 Main results on comparison with baselines on Fineweb-100B with 0.9B total / 0.37B active MoE arc hitecture. Throughput is rep orted p er GPU and memory reports peak active GPU memory . Routing ARC-E BoolQ HSwag LAMBADA OBQA PIQA So cIQA WinoGr. A vg. PPL Throughput Memory (k tok/s/GPU) (GiB) Indep endent R outing Indep-MoE 44.57 56.45 45.99 46.19 29.80 66.87 38.84 51.54 47.53 12.91 55.21 66.94 Rand-MoE 40.49 61.10 37.95 38.83 28.20 64.47 38.89 50.99 45.12 16.14 75.17 47.49 X-MoE 43.27 60.55 46.26 45.22 32.20 67.63 39.61 52.80 48.44 13.21 56.37 67.06 R e current Routing Recurrent-MoE 44.07 54.50 46.05 47.12 30.80 67.74 38.79 53.67 47.84 12.92 54.65 69.26 Path-Constr aine d Routing MonoB8-MoE 42.55 57.80 46.36 46.73 31.20 65.67 37.56 54.06 47.74 13.34 56.53 62.86 LowRank-MoE 44.36 59.97 46.33 46.11 31.80 66.65 39.20 52.96 48.42 12.98 53.84 66.96 PathMoE 45.50 55.44 47.82 47.89 32.40 66.21 38.49 53.43 48.40 12.92 55.71 66.74 PathB8-MoE 43.73 59.97 46.65 46.92 31.40 66.81 40.69 54.93 48.89 12.53 55.20 66.94 PathB4-MoE 44.70 60.40 47.95 49.45 31.60 66.32 40.94 55.64 49.62 12.29 55.67 66.83 Empirical Routing Entropy. W e compute empirical routing entrop y o v er ∼ 7M tokens for b oth indep enden t routing and PathMoE ( B = 4 ) routing, using a netw ork with L = 24 la y ers and N = 16 experts. PathMoE ac hiev es an empirical en trop y of 21.14 bits, while indep enden t routing yields 22.20 bits—a 1-bit reduction that halves the eﬀectiv e path space. Smaller routing en trop y for PathMoE means tokens concen trate in to few er exp ert paths, so eac h path receiv es more training signal—impro ving sample eﬃciency . As a result, in Section 7 w e sho w that av erage correlation betw een consecutiv e lay ers is 85.6% for PathMoE while it is 62% for indep enden t routing. Stronger inter-la y er correlations b et w een exp ert selections in PathMoE mean exp erts can sp ecialize kno wing what inputs to exp ect from co ordinated predecessors, rather than handling arbitrary represen tations. 4 T ogether, these properties oﬀer an explanation wh y PathMoE w ould yield faster learning, more robust sp ecialization, and graceful degradation under p erturbations compared to independent routing—b eha vior we v alidate empirically in Section 7 . 4 Experiments 4.1 Experimental Setup W e train and ev aluate PathMoE based on the T ransformer architecture. Mo del hyperparameter settings closely follo w Qiu et al. ( 2024 ). Training Details. Our main exp erimen ts use a 0.9B parameter mo del (0.37B active) with 16 experts and top-4 routing, trained on Finew eb-100B Penedo et al. ( 2024 ) for 400k steps using the Llama 2 tokenizer. W e remo v e load balancing loss for PathMoE v ariants but keep w eight α = 0 . 01 for all other mo dels (see discussion in Section 4.3 ). F ull hyperparameters are given in App endix B . 4.2 Performance Comparison Baseline Methods. W e compare against three categories of routing approac hes. Indep endent r outing metho ds mak e routing decisions without cross-la yer co ordination. Indep-MoE emplo ys indep enden t routers at each lay er, representing the standard MoE approac h. Rand-MoE uses randomly initialized routers that are frozen during training, providing a low er b ound on learned routing. X-MoE Chi et al. ( 2022 ) routes tokens on a lo w-dimen sional L2-normalized represen tation space with learnable temp erature to preven t representation collapse. R e curr ent r outing leverages historical routing information. Recurren t-MoE Qiu et al. ( 2024 ) uses a shared GR U where each lay er’s router receives information from the previous la y er’s hidden state to av oid sub optimal tok en-exp ert com binations. 4 Detailed v alidation results are in App endix A . 6 Path-c onstr aine d r outing metho ds encourage tok ens to follow coheren t paths. W e prop ose several approac hes to constrain the exp ert path space: Lo wRank-MoE uses a shared base router across all la y ers plus a low-rank la y er-sp eciﬁc p erturbation: r l ( x ) = r shared ( x ) + W low l x , where W low l is a low-rank matrix. This encourages similar routing across lay ers while allowing la y er-sp eciﬁc adaptation. F or blo c k-based methods, B n denotes blo c k size n ; omitting the suﬃx means sharing across all lay ers. Mono-MoE shares routing de cisions within blo c ks, forcing each tok en to use the same exp ert within eac h block. PathMoE shares routing p ar ameters within blo c ks, encouraging but not enforcing consistent expert s election as represen tations ev olv e. Evaluation Tasks. W e ev aluate on eigh t downstream tasks spanning div erse capabilities: ARC-Easy ( Clark et al. , 2018 ) (science questions), Bo olQ ( Clark et al. , 2019 ) (bo olean questions), HellaSw ag ( Zellers et al. , 2019 ) (sen tence completion), LAMBADA ( P aperno et al. , 2016 ) (language modeling), OpenBo okQA ( Miha ylo v et al. , 2018 ) (question answering), PIQA ( Bisk et al. , 2020 ) (ph ysical reasoning), So cialIQA ( Sap et al. , 2019 ) (so cial commonsense), and WinoGrande ( Sak aguchi et al. , 2020 ) (commonsense reasoning). All results report accuracy (%); Op enBookQA, PIQA, AR C-Easy , and HellaSw ag use length-normalized accuracy . Results. T able 1 sho ws the p erformance of v arious routing approac hes (the chec kp oin t with the highest a v erage accuracy from the ﬁnal 5 c hec kp oin ts is selected for each model). PathB4-MoE achiev es the highest a v erage p erformance (49.62%) and best v alidation p erplexit y (12.29), outp erforming all baselines on b oth metrics. PathB8-MoE and PathMoE also matc h or improv e up on Indep-MoE’s perplexity , demonstrating that blo c k-wise shared routing impro v es b oth language modeling and do wnstream task p erformance. Notably , PathMoE v ariants (sharing parameters) consisten tly outperform MonoB8-MoE (sharing decisions), conﬁrming that encouraging path coherence—rather than enforcing it—provides better inductive bias. See App endix C.2 for results on DCLM-Pro dataset. F urthermore, w e found that block-wise shared routing pro vides orthogonal b eneﬁts: applying PathMoE on top of X-MoE impro v es av erage accuracy from 48.44% to 48.86%, demonstrating that cross-la y er co ordination complemen ts represen tation-based routing enhancements (see App endix C.3 ). Note that the p erformance diﬀerences exceed the standard deviation ( ∼ 0.28%) across c hec kp oin t selections. Efficiency. W e also rep ort throughput (tok ens p er second p er GPU) and p eak memory usage during training. P athB4-MoE matc hes or sligh tly exceeds Indep-MoE eﬃciency (55.67 vs. 55.21 k tok/s, 66.83 vs. 66.94 GiB), as the only arc hitectural diﬀerence is sharing router parameters within blocks, whic h reduces the total parameter coun t while maintaining identical forw ard and backw ard pass computation. Methods like Recurrent-MoE and Lo wRank-MoE incur additional ov erhead due to their more complex routing mec hanisms (54.65 and 53.84 k tok/s, resp ectiv ely). This demonstrates that PathMoE ’s p erformance gains come without sacriﬁcing training eﬃciency . 4.3 Load Balancing Without Auxiliary Losses Con v en tional MoE training employs auxiliary load balancing losses to prev en t exp ert collapse ( F edus et al. , 2022 ). W e ﬁnd that PathMoE can b e trained without auxiliary losses. Figure 3 compares training dynamics of P athB4-MoE and Indep-MoE on HellaSwag and AR C-Easy , with and without load balancing losses (LBL). P athB4-MoE without LBL (dashed blue) achiev es the b est ﬁnal accuracy on both tasks, while removing LBL from Indep-MoE leads to more erratic training dynamics. This suggests that PathMoE is more robust to the remo v al of auxiliary losses, main taining smo oth con v ergence regardless of the load balancing setting. See full results in App endix C.1 . 4.4 Scaling to 16B Parameters T o v erify that path-constraining beneﬁts persist at larger scales, w e train 16B parameter mo dels (16.2B total, 2.13B activ e) with 64 exp erts and top-6 routing on DCLM-Pro Zhou et al. ( 2024 ) for 200k steps using the GPT-NeoX-20B tok enizer, without load balancing loss. W e additionally ev aluate on MMLU ( Hendryc ks et al. , 2020 ), ARC-Challenge ( Clark et al. , 2018 ), CommonsenseQA ( T almor et al. , 2019 ), and T riviaQA ( Joshi et al. , 2017 ) (5-shot exact matc h). F ull h yperparameters are in Appendix B . 7 0 400 800 1200 1600 T raining T okens (B) 35.0 37.5 40.0 42.5 45.0 47.5 Accuracy (%) HellaSwag 0 400 800 1200 1600 T raining T okens (B) 38 40 42 44 Accuracy (%) ARC-Easy Indep w/o LBL Indep w/ LBL PathB4 w/o LBL PathB4 w/ LBL Figure 3 T raining dynamics comparing PathB4-MoE and Indep-MoE with and without load balancing losses. Curv es are smo othed with exp onen tial mo ving av erage (weigh t=0.6) for clarity . T able 2 sho ws that P athMoE wins on 10 of 12 tasks across commonsense, language, and knowledge categories, with particularly strong gains on CommonsenseQA (+5.73%), AR C-Easy (+5.09%), and Op enBookQA (+3.80%). Table 2 16B mo del results on DCLM-Pro (16.2B total / 2.13B active). Category T ask Indep-MoE P athMoE Commonsense WinoGrande 56.12 57.93 PIQA 68.99 71.38 So cialIQA 40.58 40.07 CommonsenseQA 43.00 48.73 L anguage LAMBAD A 51.97 53.64 HellaSw ag 63.22 63.32 Bo olQ 63.06 64.01 Know le dge Op enBookQA 35.80 39.60 AR C-Easy 63.43 68.52 AR C-Challenge 37.29 40.10 T riviaQA 21.91 21.18 MMLU 35.27 35.64 A v erage 48.39 50.34 5 Emergent Expert Path Structure Bey ond p erformance gains, PathMoE rev eals interpretable structure in ho w tok ens are pro cessed. W e ﬁnd that exp ert paths are not arbitrary computational routes, but dev elop meaningful sp ecializations: tokens follo wing the same path naturally cluster b y linguistic function—punctuation, named en tities, temporal expressions, and other categories each tend to follo w distinct paths through the netw ork. This section examines this emergen t structure: the distribution of paths ( Section 5.1 ), the relationship b et ween path concentration and p erformance ( Section 5.2 ), and the token specialization patterns that arise ( Section 5.3 ). 8 5.1 Expert Path Distribution F or eac h exp ert path e we can deﬁne its fr e quency freq ( e ) as the prop ortion of input tok ens x with represen- tations ( x 1 , . . . , x L ) across lay ers that follo w the path e : freq ( e ) = |{ x : π ( x 1 , . . . , x L ) = e }| |{ x }| . (5.1) T o visualize how tok ens are distributed across paths, we sort paths by frequency and compute the cumulative token c over age : the fraction of tok ens co v ered b y the top- K most frequent paths. 1 2 3 4 5 6 Number of Paths (ranked by frequency) 1e6 0 20 40 60 80 100 Cumulative T oken Coverage (%) PathB4 Indep Figure 4 Cum ulative tok en cov erage as a function of the num ber of paths. W e analyze the empirical distribution of expert paths that emerge from PathMoE . Figure 4 compares the cum ulativ e token co verage b et ween P athB4-MoE and Indep-MoE. W e observ e that PathB4-MoE concen trates tok ens into few er paths compared to Indep-MoE, with the top paths cov ering a larger fraction of tok ens. This concentration reﬂects the blo c k-wise shared routing, whic h encourages tokens to follo w consistent paths through the netw ork. 5.2 Expert Path Concentration T o understand the relationship b et w een path concen tration and model p erformance, we ev aluate models with diﬀerent lev els of path restriction. Notably , the dominan t paths emerge very early in training: w e train P athB8-MoE for only 5k steps (1.25% of total training), sav e a chec kp oin t, and iden tify the most frequen t paths at this early stage. W e then contin ue training with routing restricted to these top-10, top-100, or top-500 paths. T able 3 shows that restricting to only 10 paths signiﬁcantly degrades performance (46.22%), 100 paths recov ers most of the p erformance (48.18%), and 500 paths performs comparably to unrestricted routing (48.86% vs 48.71%). This suggests the mo del naturally concentrates on a mo derate num b er of sp ecialized paths. Table 3 Eﬀect of exp ert path constrain ts on model performance. # P aths AR C-E Bo olQ HSw ag LAMBADA OBQA PIQA So cIQA WinoGr. A vg. 10 40.45 59.54 39.34 40.87 30.40 66.81 37.82 54.54 46.22 100 43.81 56.79 47.04 45.10 30.80 67.46 40.23 54.22 48.18 500 44.74 59.42 48.24 47.58 31.20 66.00 39.97 53.75 48.86 All 44.82 56.48 47.76 46.42 31.60 66.76 39.41 56.43 48.71 5.3 Expert Path Token Specialization Figure 5 reveals that expert paths dev elop distinct linguistic sp ecializations without an y supervision. Some paths predominantly pro cess punctuation, others handle p erson names or sp eec h verbs (“said”, “explained”), and yet others sp ecialize in temp oral expressions or function w ords. This clustering by linguistic function 9 Punctuation Numbers Determiners Pr epositions V erbs Conjunctions Adjectives T emporal W ords Person Names Speech V erbs Common V erbs Pr onouns Figure 5 T oken sp ecialization of represen tativ e expert paths. Each w ord cloud sho ws the most frequen t tokens processed b y a path specialized for that linguistic category . emerges purely from the language mo deling ob jective: the model learns to route syn tactically or semantically similar tokens through shared computational pathw ays. Notably , this sp ecialization is also observ ed in Indep-MoE, suggesting it is a general prop ert y of MoE routing rather than sp eciﬁc to our metho d. How ev er, PathMoE pro duces more concentrated tok en clusters within each path. Thus we observ e an in terpretable routing structure where each expert path dev elops exp ertise for particular linguistic phenomena. 6 Ablation Studies Top- k Routing Analysis. W e inv estigate the eﬀect of the top- k routing parameter across v alues of 1, 2, 4, 8, and 16. Figure 6 (a) shows that top-2 routing ac hieves the b est av erage accuracy (49.99%), consisten t with Mixtral Jiang et al. ( 2024 ). P erformance degrades at b oth extremes: k =1 suﬀers from training instability and limited capacity , while k =16 dilutes exp ert con tributions—both yield similar accuracy ( ∼ 45.7%). The sweet sp ot at k =2 pro vides suﬃcien t capacit y while maintaining sp ecialization pressure, with particularly strong gains on LAMBADA (+10.3%) and HellaSwag (+8.1%) compared to k =1. 1 2 4 8 16 T o p - k 30 40 50 60 70 Accuracy (%) ( a ) T o p - k R o u t i n g 1 2 4 8 12 24 Block Size 30 40 50 60 70 Accuracy (%) (b) Block Size LAMBADA OBQA W inoGr . SocIQA PIQA ARC-E HSwag BoolQ A verage Figure 6 Ablation studies on routing hyperparameters. (a) Eﬀect of top- k routing: top-2 achiev es best a verage accuracy (circled). (b) Eﬀect of blo c k size: B4 achiev es b est av erage accuracy (circled). Dashed lines sho w individual b enc hmarks; solid black line shows the a verage. Block Size Analysis. W e inv estigate the eﬀect of blo c k size across v alues of 1, 2, 4, 8, 12, and 24 lay ers. Blo c k size controls the trade-oﬀ b et w een routing ﬂexibilit y and cross-la y er coordination: small blo c ks (B1, B2) pro vide insuﬃcien t co ordination as la yers learn nearly indep enden tly , while large blocks (B12, B24) 10 o v er-constrain the router to handle representations that change substan tially across lay ers. Figure 6 (b) conﬁrms that intermediate sizes w ork b est: B4 achiev es the highest av erage accuracy (49.62%). W e observ e task-dep enden t preferences—language mo deling and commonsense tasks p eak at B4, ph ysical reasoning at B8, and knowledge-in tensive tasks at B24—suggesting that optimal blo c k size may v ary by domain. 7 Understanding PathMoE : Routing Consistency and Robustness Bey ond performance gains, understanding why PathMoE w orks can guide future MoE design. W e examine tw o k ey prop erties: cross-lay er routing consistency and robustness to perturbations. 7.1 Cross-Layer Routing Consistency Since exp ert indices are arbitrary across la y ers (exp ert 0 at la y er l has no inherent corresp ondence to exp ert 0 at lay er l + 1 ), we ﬁrst align indices by ﬁnding p erm utations that maximize cross-la y er agreement. W e then examine t w o complemen tary metrics. Path consistency measures exp ert reuse: we compute the a verage Jaccard similarit y b et ween expert sets at consecutive la y ers. Sustained engagement measures ho w long a token contin ues using a giv en aligned exp ert: we coun t runs of consecutiv e lay ers and rep ort the fraction lasting ≥ X lay ers. 2 3 4 5 6 7 8 9 10 1 1 12 W indow Size (layers) 0 20 40 60 80 100 Path Consistency (%) (a) Path Consistency vs W indow Size PathB4 Indep 1 2 3 4 5 6 7 8 9 10 1 1 12 Minimum Consecutive Layers 0 20 40 60 80 100 Percentage of Engagements (%) (b) Sustained Engagement Rate PathB4 Indep Figure 7 Cross-la y er routing consistency . (a) Path consistency versus window size. (b) Sustained engagement versus minim um consecutiv e lay ers. Figure 7 (a) shows that PathMoE main tains consistently higher path consistency ( ∼ 79%) compared to Indep- MoE ( ∼ 48%) across all window sizes from 2 to 12 la yers. This gap persists even at non-blo c k-aligned windows (3, 5, 7 lay ers), conﬁrming the beneﬁt stems from the arc hitecture rather than blo c k b oundaries. Panel (b) measures ho w long tok ens con tin ue using the same expert, where PathMoE pro duces substantially more sustained engagements. 7.2 Expert Specialization and Routing Robustness W e examine the sp ecialization-robustness trade-oﬀ to chec k whether PathMoE ’s routing creates brittle depen- dencies. Routing Specialization. W e measure specialization via routing entrop y: for each tok en, w e compute the entrop y of its expert usage distribution across all lay ers, where low er entrop y indicates more concentrated (sp ecialized) routing. Figure 8 (a) sho ws that PathB4-MoE ac hieves consistently lo wer en trop y than Indep-MoE across all la y ers, with 11% low er en trop y at the ﬁnal lay er. Routing Robustness. W e test robustness by randomly p erm uting exp ert indices at each lay er with probabilit y p and measuring p erplexit y degradation. Figure 8 (b) sho ws that PathMoE is dramatically more robust despite b eing more specialized: at full permutation, Indep-MoE degrades by 5,328% v ersus only 237% for PathMoE . Th us, the tw o methods sp ecialize diﬀerently . 11 5 10 15 20 Layer 2.00 2.25 2.50 2.75 3.00 3.25 Routing Entropy (bits) (a) Routing Specialization Indep PathB4 0.0 0.2 0.4 0.6 0.8 1.0 Permutation Probability 0 1 0 2 1 0 3 Perplexity Degradation (%) (b) Routing Robustness Indep PathB4 Figure 8 Routing sp ecialization and robustness. (a) Cum ulative routing entrop y across la y ers. (b) P erplexit y degradation under exp ert p erm utation. 8 Conclusion W e in troduced PathMoE , which constrains the combinatorial expert path space b y sharing router parameters within blocks of consecutiv e la y ers. This simple modiﬁcation yields three k ey beneﬁts: (1) consisten t p erformance gains on do wnstream tasks—+2.1% a v erage accuracy on the 0.9B model and winning 10 of 12 tasks on the 16B model— without auxiliary lo ad b alancing losses ; (2) impro v ed cross-lay er co ordination (79% vs. 48% routing consistency); (3) more specialized yet more robust routing (11% low er en trop y , 22.5 × more robust to p erturbations). Analysis reveals that exp ert paths develop interpretable linguistic sp ecializations, with tokens clustering b y function—punctuation, named entities, temporal expressions—purely from the language mo deling ob jectiv e. The blo c k-wise design provides a middle ground betw een indep enden t routing (to o ﬂexible, N L paths) and fully shared routing (to o rigid), encouraging path coherence while allowing adaptation across netw ork depth. A limitation is that blo c k-wise sharing assumes token-c hoice routing; preliminary exp erimen ts show no beneﬁt for exp ert-c hoice routing, whic h already achiev es consistency through stable exp ert preferences. F uture work could explore learning the expert path predictors jointly with the main model or using the exp ert path structure for mo del compression. Impact Statement This pap er in tro duces an arc hitectural mo diﬁcation to Mixture-of-Exp erts routing that improv es downstream mo del p erformance, cross-lay er co ordination and mo del in terpretability . By encouraging tokens to follo w coheren t exp ert paths, our approach rev eals ho w MoE mo dels naturally develop linguistic specializations— pro viding insights that could aid mo del understanding and debugging. The eﬃciency gains from eliminating auxiliary load balancing losses and achieving b etter parameter utilization could reduce the computational cost of training large mo dels. How ever, our w ork fo cuses on routing mec hanisms rather than capability scaling, and do es not in tro duce risks b ey ond those inheren t to large language mo dels researc h. Acknowledgements W e thank Ronan Collobert, Yizhe Zhang, Samira Abnar, Anastasiia Filippov a, Shuangfei Zhai, Russ W ebb and Barry Theobald for helpful discussions and feedback. 12 References Y onatan Bisk, Row an Zellers, Jianfeng Gao, Y ejin Choi, et al. Piqa: Reasoning ab out ph ysical commonsense in natural language. In Pr o c e e dings of the AAAI c onferenc e on artiﬁcial intel ligenc e , v olume 34, pages 7432–7439, 2020. Zew en Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, P a y al Ba ja j, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts. A dvanc es in Neur al Information Pr o c essing Systems , 35:34600–34613, 2022. Christopher Clark, Kenton Lee, Ming-W ei Chang, T om K wiatk o wski, Michael Collins, and Kristina T outanov a. Bo olq: Exploring the surprising diﬃcult y of natural yes/no questions. arXiv pr eprint arXiv:1905.10044 , 2019. P eter Clark, Isaac Cowhey , Oren Etzioni, T ushar Khot, Ashish Sabharw al, Carissa Sc hoenick, and Oyvind T afjord. Think you hav e solved question answ ering? try arc, the ai2 reasoning c hallenge. arXiv pr eprint arXiv:1803.05457 , 2018. Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and F uru W ei. Stablemo e: Stable routing strategy for mixture of exp erts. arXiv pr eprint arXiv:2204.08396 , 2022. Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, W angding Zeng, Xingk ai Y u, Y u W u, et al. Deepseekmo e: T o wards ultimate exp ert sp ecialization in mixture-of-experts language models. arXiv pr eprint arXiv:2401.06066 , 2024. Nan Du, Y anping Huang, Andrew M Dai, Simon T ong, Dmitry Lepikhin, Y uanzhong Xu, Maxim Krikun, Y anqi Zhou, Adams W ei Y u, Orhan Firat, et al. Glam: Eﬃcient scaling of language mo dels with mixture-of-exp erts. In International c onfer enc e on machine le arning , pages 5547–5569. PMLR, 2022. William F edus, Barret Zoph, and Noam Shazeer. Switc h transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity . Journal of Machine L e arning R ese ar ch , 23(120):1–39, 2022. Zijin Gu, T atiana Likhomanenko, and Na vdeep Jaitly . Omni-router: Sharing routing decisions in sparse mixture-of- exp erts for sp eec h recognition. arXiv pr eprint arXiv:2507.05724 , 2025. Dan Hendryc ks, Collin Burns, Stev en Basart, Andy Zou, Mantas Mazeik a, Da wn Song, and Jacob Steinhardt. Measuring massiv e m ultitask language understanding. arXiv pr eprint arXiv:2009.03300 , 2020. Jordan Hoﬀmann, Sebastian Borgeaud, Arth ur Mensc h, et al. T raining compute-optimal large language models. arXiv pr eprint arXiv:2203.15556 , 2022. Rob ert A Jacobs, Mic hael I Jordan, Steven J No wlan, and Geoﬀrey E Hinton. Adaptiv e mixtures of local exp erts. Neur al c omputation , 3(1):79–87, 1991. Alb ert Q Jiang, Alexandre Sabl a yrolles, An toine Roux, Arth ur Mensch, Blanc he Sa v ary , Chris Bamford, Dev endra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of exp erts. arXiv pr eprint arXiv:2401.04088 , 2024. Mandar Joshi, Eunsol Choi, Daniel S W eld, and Luke Zettlemo y er. T riviaqa: A large scale distantly supervised c hallenge dataset for reading comprehension. arXiv pr eprint arXiv:1705.03551 , 2017. Dmitry Lepikhin, HyoukJoong Lee, Y uanzhong Xu, Dehao Chen, Orhan Firat, Y anping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling gian t mo dels with conditional computation and automatic sharding. arXiv pr eprint arXiv:2006.16668 , 2020. T odor Miha ylo v, Peter Clark, T ushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for op en b ook question answering. arXiv pr eprint arXiv:1809.02789 , 2018. Denis P aperno, Germán Kruszewski, Angeliki Lazaridou, Ngo c-Quan Pham, Raﬀaella Bernardi, Sandro P ezzelle, Marco Baroni, Gemma Boleda, and Raquel F ernández. The lam bada dataset: W ord prediction requiring a broad discourse context. In Pr o c e e dings of the 54th annual meeting of the asso ciation for c omputational linguistics (volume 1: L ong p ap ers) , pages 1525–1534, 2016. Guilherme P enedo, Hynek Kydlíček, Anton Lozhko v, Margaret Mitchell, Colin A Raﬀel, Leandro V on W erra, Thomas W olf, et al. The ﬁneweb datasets: Decan ting the w eb for the ﬁnest text data at scale. A dvanc es in Neur al Information Pr o c essing Systems , 37:30811–30849, 2024. Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsb y . F rom sparse to soft mixtures of exp erts. arXiv pr eprint arXiv:2308.00951 , 2023. 13 Zihan Qiu, Zeyu Huang, Sh uang Cheng, Yizhi Zhou, Zili W ang, Iv an Tito v, and Jie F u. La y erwise recurren t router for mixture-of-exp erts. arXiv preprint , 2024. Da vid Rap oso, Sam Ritter, Blak e Richards, Timoth y Lillicrap, Peter Conw a y Humphreys, and A dam San toro. Mixture- of-depths: Dynamically allocating compute in transformer-based language models. arXiv pr eprint arXiv:2404.02258 , 2024. Keisuk e Sak aguchi, Ronan Le Bras, Chandra Bhaga v atula, and Y ejin Choi. Winogrande: An adversarial winograd sc hema c hallenge at scale. In Pr o c e e dings of the AAAI Confer enc e on Artiﬁcial Intel ligenc e , volume 34, pages 8732–8740, 2020. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Y ejin Choi. So cialiqa: Commonsense reasoning about so cial in teractions. arXiv pr eprint arXiv:1904.09728 , 2019. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quo c Le, Geoﬀrey Hinton, and Jeﬀ Dean. Outrageously large neural netw orks: The sparsely-gated mixture-of-experts lay er. arXiv pr eprint arXiv:1701.06538 , 2017. Alon T almor, Jonathan Herzig, Nicholas Lourie, and Jonathan Beran t. Commonsenseqa: A question answering c hallenge targeting commonsense knowledge. In Pr o c e e dings of the 2019 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies, V olume 1 (L ong and Short Pap ers) , pages 4149–4158, 2019. Cheng Y ang, Y ang Sui, Jinqi Xiao, Lingyi Huang, Y u Gong, Y uanlin Duan, W enqi Jia, Miao Yin, Y u Cheng, and Bo Y uan. MoE-I 2 : Compressing mixture of experts mo dels through in ter-expert pruning and intra-expert lo w-rank decomp osition. arXiv preprint , 2024. T ed Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Lo catelli, and Sara Hooker. Pushing mixture of exp erts to the limit: Extremely parameter eﬃcient mo e for instruction tuning. arXiv pr eprint arXiv:2309.05444 , 2023. Ro w an Zellers, Ari Holtzman, Y onatan Bisk, Ali F arhadi, and Y ejin Choi. Hellaswag: Can a mac hine really ﬁnish y our sen tence? arXiv pr eprint arXiv:1905.07830 , 2019. F an Zhou, Zengzhi W ang, Qian Liu, Junlong Li, and Pengfei Liu. Programming ev ery example: Lifting pre-training data quality like exp erts at scale. arXiv pr eprint arXiv:2409.17115 , 2024. Y anqi Zhou, T ao Lei, Hanxiao Liu, Nan Du, Y anping Huang, Vincen t Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-exp erts with expert choice routing. A dvances in Neur al Information Pr o c essing Systems , 35:7103–7114, 2022. 14 A Empirical Routing Entropy W e pro vide detailed empirical routing entrop y computation and exp ert path space statistics for b oth indep en- den t routing and PathMoE routing from Section 3 . T able 4 reports comprehensiv e metrics for Indep-MoE and P athB4-MoE measured on 7.18M tokens. Table 4 Detailed empirical routing en trop y computation and expert path space statistics for Indep-MoE and P athB4- MoE (blo c k size B = 4 ) with L = 24 la y ers and N = 16 experts. All en trop y measurements are in bits. W e use ev aluation data with 7.18M tokens. Empirical Metric PathB4-MoE Indep-MoE Routing entrop y H ( E ) 21.14 bits 22.20 bits Routing decisions correlation for consecutive lay ers 85.6% 62% Unique paths observed 5,109,282 6,263,708 Eﬀectiv e path space 2 H ( π ) 2 . 31 × 10 6 4 . 82 × 10 6 B Training Details T able 5 summarizes the hyperparameters for b oth mo del scales. Table 5 T raining hyperparameters for 0.9B and 16B mo dels. Hyp erparameter 0.9B Mo del 16B Mo del Mo del Ar chite ctur e T otal parameters 0.9B 16.2B A ctiv e parameters 0.37B 2.13B Mo del dimension 1280 2048 FFN hidden dimension 768 2112 La y ers 24 28 A tten tion heads 20 16 KV heads 4 16 Exp erts 16 64 T op- k routing 4 6 T r aining Dataset Finew eb-100B DCLM-Pro T ok enizer Llama 2 (32k) GPT-NeoX-20B Sequence length 4096 4096 Global batch size 1024 2048 T raining steps 400k 200k Optimization Optimizer A dam W A dam W β 1 , β 2 0.9, 0.95 0.9, 0.95 W eigh t decay 0.1 0.1 P eak learning rate 4 . 2 × 10 − 4 2 . 0 × 10 − 4 LR schedule Cosine Cosine W arm up steps 2000 2000 Precision BF16 BF16 The 0.9B mo del architecture follows Qiu et al. ( 2024 ). The 16B mo del arc hitecture follows DeepSeek-MoE Dai et al. ( 2024 ) without shared experts to isolate the eﬀect of routing changes. 15 C Additional Results C.1 Results on removing load balancing loss (0.9B Model) W e pro vide full exp erimen tal results on keeping and remo ving load balancing loss (LBL) for Indep-MoE and P athB4-MoE. Table 6 P erformance comparison on Finew eb-100B dataset (0.9B total / 0.37B active). Routing LBL AR C-E BoolQ HSwag LAMBADA OBQA PIQA So cIQA WinoGr. A vg. Indep-MoE 0.01 44.57 56.45 45.99 46.19 29.80 66.87 38.84 51.54 47.53 Indep-MoE 0 43.18 58.29 47.05 47.55 28.80 65.94 39.36 53.67 47.98 PathB4-MoE 0.01 44.15 57.06 46.07 46.61 32.60 66.49 40.23 51.93 48.14 PathB4-MoE 0 44.70 60.40 47.95 49.45 31.60 66.32 40.94 55.64 49.62 C.2 Results on DCLM-Pro Dataset (0.9B Model) W e provide additional experimental results on the DCLM-Pro dataset using the 0.9B (total) parameter model. T able 7 presen ts the p erformance comparison across routing metho ds. Table 7 P erformance comparison on DCLM-Pro dataset (0.9B total / 0.37B active). Routing AR C-E Bo olQ HSwag LAMBADA OBQA PIQA So cIQA WinoGr. A vg. Indep-MoE 55.18 56.27 48.19 40.73 35.80 66.70 39.46 55.96 49.79 Lo wRank-MoE 58.84 59.42 48.26 40.68 33.00 66.54 40.69 54.38 50.22 P athMoE 57.49 60.06 48.30 40.69 34.40 67.25 39.87 56.99 50.63 P athB8-MoE 58.16 58.20 48.26 40.35 34.40 66.65 40.33 54.38 50.09 C.3 Orthogonal Benefits with Other Routing Methods Blo c k-wise shared routing provides orthogonal b eneﬁts that can b e com bined with other routing improv ements. T able 8 shows the performance of applying PathMoE (blo c k size 4) on top of X-MoE, which routes tok ens on a lo w-dimensional normalized representation space. Load balancing losses are used for b oth models. Table 8 Combining PathMoE with X-MoE on Finew eb-100B (0.9B total / 0.37B activ e). P athMoE improv es X-MoE’s a verage accuracy (standard deviation is ∼ 0 . 28% ). Routing AR C-E Bo olQ HSwag LAMB. OBQA PIQA So cIQA WinoGr. A vg. X-MoE 43.27 60.55 46.26 45.22 32.20 67.63 39.61 52.80 48.44 P athB4X-MoE 44.07 61.07 47.29 48.30 30.80 66.43 39.97 52.96 48.86 The improv ement demonstrates that cross-la y er co ordination through blo c k-wise shared routing compleme n ts represen tation-based routing enhancements, suggesting that PathMoE can b e applie d as a general tec hnique on top of existing routing methods. C.4 Token Category Definitions F or our path specialization analysis, w e classify tok ens in to detailed linguistic categories. T able 9 provides the complete list of categories used in our analysis. 16 Table 9 T ok en categories used for path specialization analysis. Eac h category is deﬁned by curated word lists or morphological patterns. Category T yp e Examples person_names Lexical Andrea, Ric hard, Oprah, Mary , John, Eliza- b eth titles_roles Lexical secretary , minister, commander, winner, pro- fessor, CEO speech_verbs Lexical said, explained, told, asked, claimed, an- nounced adverbs_discourse Lexical esp ecially , actually , particularly , ho wev er, therefore adverbs_manner Lexical quic kly , carefully , successfully , directly , prop- erly adverbs_time Lexical no w, to da y , recen tly , alwa ys, sometimes, cur- ren tly adverbs_other Pattern W ords ending in -ly not in ab o ve lists nationalities Lexical American, British, Chinese, Europ ean, Japanese temporal_words Lexical Monda y , Jan uary , morning, year, summer, af- terno on prepositions Lexical in, on, at, to, for, with, by , from, ab out conjunctions Lexical and, or, but, b ecause, although, while, if determiners Lexical the, a, an, this, that, some, ev ery , eac h pronouns Lexical he, she, they , it, who, someone, an ybo dy quantifiers Lexical all, many , most, million, p ercen t, sev eral common_verbs Lexical is, hav e, go, mak e, tak e, know, said, get verbs P attern W ords ending in -ing or -ed (running, created) adjectives Lexical go od, new, important, p olitical, economic, pub- lic proper_nouns Pattern Capitalized words not in name list abstract_nouns Pattern W ords ending in -tion/-sion/-ness/-men t agent_nouns P attern W ords ending in -er/-or (play er, actor) numbers P attern Numeric digits (1, 2, 100, 2024) ordinals P attern Numeric ordinals (1st, 2nd, 3rd) punctuation P attern Punctuation marks (. , ! ? ; : ’ " ( )) 17

Path-Constrained Mixture-of-Experts

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment