Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement

The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the supervision of a gating network. A common issue in MoE architectures is ``expert collapse'' where overlapping class boundaries in the r…

Authors: Abien Fred Agarap, Arnulfo P. Azcarraga

Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement
Mixture of Experts with So Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement Abien Fred Agarap abien.agarap@dlsu.edu.ph De La Salle University Manila, Philippines Arnulfo Azcarraga arnulfo.azcarraga@dlsu.edu.ph De La Salle University Manila, Philippines ABSTRA CT The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the super vision of a gating network. A common issue in MoE architectures is “ex- pert collapse” where ov erlapping class b oundaries in the raw input feature space cause multiple e xperts to learn redundant represen- tations, thus forcing the gating network into rigid routing to com- pensate. W e propose an enhanced MoE architecture that utilizes a feature extractor network optimized using Soft Near est Neighbor Loss (SNNL) prior to fee ding input features to the gating and ex- pert networks. By pre-conditioning the latent space to minimize distances among class-similar data points, we resolve structural expert collapse which r esults to experts learning highly orthogonal weights. W e employ Exp ert Specialization Entropy and Pairwise Embedding Similarity to quantify this dynamic. W e evaluate our experimental approach across four benchmark image classication datasets (MNIST , FashionMNIST , CIF AR10, and CIF AR100), and we show our SNNL-augmented MoE models demonstrate structurally diverse experts which allow the gating network to adopt a more exible routing strategy . This paradigm signicantly improves clas- sication accuracy on the FashionMNIST , CIF AR10, and CIF AR100 datasets. CCS CONCEPTS • Computing methodologies → Learning latent representa- tions ; Ensemble methods ; Semi-sup ervised learning settings ; KEY W ORDS contrastive learning, ensemble learning, mixture-of-experts, repre- sentation learning 1 IN TRODUCTION AND RELA TED W ORKS An articial neural network solves a given problem by learning to approximate the function that describes the relationship between dataset features 𝑥 and labels 𝑦 in a given training data. This is accomplished by training with r etro-propagation of its output errors [ 21 ] and then adjusting its weights based on such error signals. Neural networks can be used for a plethora of tasks such as image classication [ 15 ], language translation [ 6 ], and speech recognition [ 10 ]. It learns to solve these problems by breaking them down to learn the best representations for the input features, and then performing the given task on the resulting learned representations. This representation learning capability is among the advantages of neural networks over other machine learning algorithms. 1.1 Class Neighb orhood Structure A number of methods have been used to further improve the per- formance of neural networks on subsequent problem-specic tasks, such as expressing high-dimensional data in low-dimensional space, in which capturing the structure of a given dataset may be pur- posely learned. This structure contains the class information of a given dataset which indicates how the class-similar data points cluster together in a projected space. Several techniques have be en introduced and are popularly used for this purp ose such as princi- pal components analysis (PCA), t-stochastic neighbor embedding (t-SNE) [ 17 ], triplet loss [ 5 ], and the soft nearest neighbor loss [1, 7, 22] to transform the input features to contain attributes that are primed for classication. The aforementioned techniques capture the underlying class neighborhood structure of an input data. However , each technique has their respective drawbacks. If the most salient features of the data can be found in a nonlinear manifold, then a linear technique like PCA will not be able to fully learn the underlying structure of the data. On the other hand, nonlinear techniques like t-SNE, soft nearest neighbor loss, and triplet loss require relativ ely expensive computational cost. Not to mention that t-SNE discovers dierent representations of an input data as a function of its hyperparameters. Meanwhile, the soft nearest neighbor loss and triplet loss can b e relatively much slow er to compute even when compared to t-SNE, depending on the available compute resources. Despite the known drawbacks, it is still desirable to learn the underlying structure of a data as it implies how the input features form clusters, and consequently , these clusters imply the class mem- bership of the input features as per the clustering assumption in the semi-supervise d learning literature [ 3 ]. Although it is established that nonlinear techniques are relatively slower and more computa- tionally expensive to use, the evolution of computational har dware has provided drastic advancements that compute issues might be negligible for those with the necessary resources. 1.2 Ensemble Learning Aside from learning the class neighborhoo d structure of a given data, we can improve the performance of neural networks by com- bining their outputs through averaging or summation [ 2 ]. This technique is known as ensemble learning, which enfor ces coop- eration among neural networks to solve a common goal. In this context, we dene cooperation as the phenomenon where the neu- ral networks in an ensemble contribute to the overall success of the group. The neural netw orks in an ensemble cooperate among themselves by compensating for the performance of one another , thereby decreasing the error correlation of the gr oup. PCSC2026, April 2026, Davao, Philippines Agarap and Azcarraga In contrast, if a training data can be naturally divided into sub- sets, the group of neural networks can be inspired to specialize on their own subsets rather than to cooperate among themselves. This approach is known as the mixture-of-experts mo del [ 13 ] which uses a gating netw ork as a supervisor to choose which sub-network must b e assigned on a subset of the dataset based on their predictive performance on such subsets. Howev er , if the raw input features fed to the gating network con- tain highly entangled class boundaries, the gating network struggles to condently partition the space . This often leads to “ expert col- lapse” where a single expert dominates the task or multiple experts learn redundant representations. In such a case, the gating network often results to hard-routing strategies that fail to generalize w ell on complex datasets. 1.3 Contributions T o address the issue of expert collapse, we propose an architectural pipeline that primes the input features to the MoE model using SNNL. Our core contributions are: (1) Architectural Enhancement . W e employ a feature extrac- tor regularized with SNNL b efore the MoE routing phase, thus minimizing intra-class distances to simplify the gating network’s partition task. (2) Quantitative Specialization Metrics . W e address the need for detailed analysis by introducing Expert Specialization Entropy and Pairwise Embedding Similarity to empirically measure expert divergence and routing e xibility . (3) Statistical Rigor and Visual Proof . W e conduct robust non-parametric statistical testing (Wilcoxon signed-rank test) and provide compr ehensive manifold and embedding visualizations to denitively prove that SNNL prevents struc- tural collapse, thus allowing experts to be utilized more col- laboratively on complex benchmark datasets. 2 BA CK GROUND 2.1 Feed-For ward Neural Network The feed-forward neural network is the quintessential deep learning model that is used to approximate the function mapping between the dataset input and output, i.e. 𝑦 ≈ 𝑓 ( ® 𝑥 ; 𝜃 ) . Its 𝜃 parameters are then optimized to learn the b est approximation for the input targets, which may either be a class label (classication) or a real value (regression). 𝑓 ( ® 𝑥 ) = 𝑓 ( 𝑛 )  𝑓 ( . . . ) 𝑓 ( 1 ) ( ® 𝑥 )  (1) ℓ ce ( 𝑦, 𝑓 ( 𝑥 ) ) = −  𝑖 𝑦 𝑖 log [ 𝑓 ( 𝑥 𝑖 ) ] (2) T o accomplish this, the model comp oses a number of nonlinear functions in the form of hidden layers, each of which learns a representation of the input features (see Eq. 1). Afterwards, the similarities between the approximation and the input targets are measured by using an error function such as cross entropy (see Eq. 2), which shall b e the basis for optimization usually through gradient-based learning. For our experiments, we use a fee d-forward neural network with a single hidden layer containing 128 units. The hidden layer weights were initialized with Kaiming initializer [ 12 ] and had ReLU [ 19 ] ac- tivation function while the output lay er was initialized with Xavier initializer [ 9 ]. W e use this architecture for the e xpert networks in our MoE models. For the gating network, we simply use a linear classier to select which expert to use for particular input features. 2.2 Convolutional Neural Network The convolutional neural network (or CNN) is a neural network architecture that uses the convolution operator as its feature extrac- tor in its hidden layers [ 8 , 16 ]. Like feed-for ward neural networks, they also compose hidden layer representations for a downstream task. However , with its use of the conv olution operator in its hidden layers, it learns better representations of an input data. Figure 1: Architecture of the CNN-based Feature Extractor . This mo dule computes disentangled representations that are subsequently routed to the MoE gating network and experts. In Figure 1, we illustrate the CNN-based feature extractor we use to prime the input features for the gating and expert networks across all our e xperiments. The network processes raw input fea- tures through a sequence of two distinct feature blocks to construct a disentangled latent representation. Each block consists of a 2D convolutional layer utilizing a 3 × 3 kernel with 32 lters in the rst block and 64 lters in the second, then immediately followed by 2D batch normalization, a ReLU activation function to intro- duce non-linearity , and a 2 × 2 max pooling layer to progressively downsample the spatial dimensions. After the second block, the resulting multi-channel feature maps are attened into a concise 1D vector . This nal embedding ser ves as the topologically struc- tured, SNNL-optimized input fed directly into the downstream MoE gating network and expert modules. 2.3 Mixture-of-Experts The Mixture-of-Experts (MoE) model consists of a set of 𝑛 “e xpert” neural networks 𝐸 1 , . . . , 𝐸 𝑛 and a “gating” neural network 𝐺 . Figure 2 is an illustration of the MoE mo del. The experts are chosen by the gating network to handle a subset of the entire dataset, wherein each subset is tantamount to a sub-region of the data space. The output of this model is given by the following equation, ˆ 𝑦 = 𝑛  𝑖 = 1 𝐺 ( 𝑥 ) 𝑖 𝐸 𝑖 ( 𝑥 ) (3) Mixture of Experts with So Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement PCSC2026, April 2026, Davao, Philippines Figure 2: The mixture-of-experts model is a system of experts and gating networks where each expert be come a function of a subset of the input environment. These expert networks receive the same inputs and produce the same numb er of outputs. The gating network also receives the same input as the expert networks, but its output is the probability of choosing a particular expert on a given input. where 𝐺 ( 𝑥 ) 𝑖 is the probability output of the gating network to choose expert 𝐸 𝑖 for a given input 𝑥 . The gating network and the experts have their dier ent set of parameters 𝜃 . In our classication experiments, we modied the inference function above as follows, ˆ 𝑦 = 𝑛  𝑖 = 1 arg max 𝐺 ( 𝑥 ) 𝑖 𝐸 𝑖 ( 𝑥 ) (4) This modied inference function allows the selection of the b est expert network 𝐸 𝑖 as indicated by the highest gating network output arg max 𝐺 ( 𝑥 ) 𝑖 . Without this modication, the model only results to a model akin to a traditional ensemble model alb eit outputs a weighted summation as its output instead of a simple summation of expert outputs. Subsequently , we optimize the MoE model based on the following error function, L moe ( 𝑥 , 𝑦 ) = 1 𝑏 𝑛  𝑖 = 1 𝐺 ( 𝑥 ) 𝑖 × ℓ 𝑐𝑒 ( 𝑦, 𝐸 𝑖 ( 𝑥 ) ) (5) where ℓ ce is the cross entropy function measuring the dierence between the target 𝑦 and the output of e xpert 𝐸 𝑖 , while 𝐺 ( 𝑥 ) 𝑖 is the probability output for choosing expert 𝑖 . Then, the loss is averaged over the number of batch samples 𝑏 . In this system, each expert learns to specialize on the cases where they perform well, and they are impose d to ignore the cases on which they do not perform well. With this learning paradigm, the experts become a function of a subregion of the data space, and thus their set of learned weights highly dier from each other as opposed to traditional ensemble models that result to having almost identical weights for their learners. 3 RESOLVING EXPERT COLLAPSE WI TH DISEN T ANGLEMEN T 3.1 Soft Nearest Neighbor Loss W e dene disentanglement as how close pairs of class-similar data points from each other are, r elative to pairs of class-dierent data points, and we can measure this by using the soft nearest neighbor loss (SNNL) function [7, 22]. This loss function is an expansion on the original nonlinear neighborhood comp onents analysis objective which minimizes the distances among class-similar data points in the latent code of an autoencoder network [ 22 ]. On the other hand, the SNNL function minimizes the distances among class-similar data points in each hidden layer of a neural network [ 7 ]. The SNNL function is dened for a batch of 𝑏 samples ( 𝑥 , 𝑦 ) as follows, ℓ snn ( 𝑥 , 𝑦, 𝑇 ) = − 1 𝑏  𝑖 ∈ 1 .. .𝑏 log ©          « Í 𝑗 ∈ 1 . . .𝑏 𝑗 ≠ 𝑖 𝑦 𝑖 = 𝑦 𝑗 exp  − 𝑑 𝑖 𝑗 𝑇  Í 𝑘 ∈ 1 .. .𝑏 𝑘 ≠ 𝑖 exp  − 𝑑 𝑖𝑘 𝑇  ª ® ® ® ® ® ® ® ® ® ¬ (6) where 𝑑 is a distance metric on either raw input features or learned hidden layer representations 𝑥 of a neural network, and 𝑇 is a temperature parameter that can be used to inuence the value of the loss function. That is, at high temperatures, the distances among widely separated data points can inuence the loss value. 3.2 Our approach W e pr opose to use a feature extractor network 𝑓 for the MoE model instead of using the raw input featur es for the expert and gating net- works (see Figure 3). Using our featur e extractor network optimized with SNNL, we prime the input features via disentanglement w .r .t. their class labels. Our proposal thus leads to the optimization of a composite loss (see Eq. 7) of the cross entropy as the classication loss and the SNNL as the regularizer . L ( 𝑓 , 𝑥 , 𝑦 ) = L moe ( 𝑥 , 𝑦 ) + 𝛼 · min ℓ snn ℓ snn ( 𝑓 𝑖 ( 𝑥 ) , 𝑦 , 𝑇 ) (7) W e found more stable and better optimization of the SNNL when we take its minimum value across the hidden layers 𝑓 𝑖 of the feature extractor network. 3.3 Theoretical Analysis T o formalize the contribution of the soft nearest neighb or loss (SNNL) to expert specialization, w e analyze the interaction between the feature extractor 𝑓 and the Mixture-of-Experts (MoE) gating mechanism. 3.3.1 Latent Space Partitioning via SNNL. The SNNL objective 𝑙 𝑠𝑛𝑛 serves as a topological regularizer on the feature extractor’s hid- den layers. By minimizing the distance between class-similar data points relative to class-dier ent ones, the feature extractor 𝑓 ( ® 𝑥 ) maps the high-dimensional input ® 𝑥 into a lo wer-dimensional latent manifold where class clusters are highly localized. Mathematically , the SNNL gradient w .r .t. the weights of 𝑓 minimizes the pairwise PCSC2026, April 2026, Davao, Philippines Agarap and Azcarraga Figure 3: W e optimize the soft nearest neighbor loss over the hidden layers found in the feature extractor network before the MoE mo del. In doing so, the input features to the expert and gating networks are transformed to a set of representations with the classication information ingrained in them, thereby helping improve the overall classication performance of the MoE mo del. distance 𝑑 𝑖 𝑗 for all 𝑦 𝑖 = 𝑦 𝑗 . This results in a latent representation 𝑧 = 𝑓 ( ® 𝑥 ) where the intra-class variance is minimized. 3.3.2 Gating Dynamics and Exp ert Selection. In a traditional MoE model, the gating network 𝐺 ( ® 𝑥 ) often suers from “expert collapse” where a single expert dominates the gradients due to a disorganize d input space. In our approach, the gating network operates on the transformed representation 𝐺 ( 𝑓 ( ® 𝑥 ) ) . Since 𝑓 ( ® 𝑥 ) essentially clusters inputs by class, the gating function 𝐺 ( 𝑧 ) encounters a “pre-partitioned” environment. Let 𝑧 𝑐 be the centroid of class 𝑐 in the latent space. The gating network learns a mapping 𝜙 : 𝑧 𝑐 → { 1 , . . . , 𝑛 } , eectively assigning a specic expert 𝐸 𝑖 to a specic class cluster . As the inference function utilizes arg max 𝐺 ( 𝑧 ) 𝑖 , the model selects the "best" expert for a given sub- region of the data space. 3.3.3 Formalizing the Sp ecialization. Specialization is achieved when the variance of the input distribution seen by expert 𝐸 𝑖 is signicantly lower than the global distribution variance: V ar ( 𝑓 ( ® 𝑥 ) | 𝐺 ( 𝑓 ( ® 𝑥 ) ) 𝑖 ≈ 1 ) ≪ V ar ( 𝑓 ( ® 𝑥 ) ) By ensuring that the input to the gating network is already disentangled by class, our approach forces each expert 𝐸 𝑖 to learn a localized function 𝑓 ( 𝐸 𝑖 ) specically tuned to the nuances of the assigned class cluster . This reduces the interference between experts and facilitates higher classication accuracy , as evidenced by the empirical results on the classication datasets used in the study . 4 EXPERIMEN TS W e use four benchmark image classication datasets to evaluate our proposed model: MNIST , FashionMNIST , CIF AR10, and CIF AR100. W e ran each model ve times, and computed their average perfor- mance across those runs. W e report both the average classication performance and the best classication performance for each of our model. For reproducibility , we used the following set of seeds for the random number generator: 1, 2, 3, 4, and 5. No hyperparameter tuning was done as tr ying to achieve state-of-the-art performance is beyond the scope of this study , w e only intend to show that using a feature extractor optimized with SNNL helps resolve the issue of expert collapse in a mixture-of-experts model. In addition, no other regularizers were used in order to better demonstrate the benets of SNNL for our feature extractor . 4.1 Evaluation Metrics T o thoroughly evaluate the impact of SNNL and address the limita- tions of purely accuracy-based metrics, we employed the following analytical framework: (1) Expert Specialization Entropy (EN T) . W e calculate the Shannon entropy of the average routing distribution per class. Lower entrop y indicates “hard-routing” while higher entropy indicates a more exible routing distribution across the experts. (2) Pairwise Embe dding Similarity (SIM) . W e extract the rst-layer weight matrices of each expert network and com- pute their pairwise cosine similarity . Lower similarity indi- cates that the experts hav e learned orthogonal and divergent internal representations, thus escaping expert collapse. (3) Statistical Analysis . T o verify if the improvements were robust, all mo dels were trained across multiple random seeds. The dierences in metrics were evaluated using the non- parametric Wilcoxon signed-rank test. (4) Manifold Visualization . W e project the high-dimensional feature representations in the feature extractor into 2D space using UMAP [ 18 ] to qualitatively assess cluster homogeneity and conrm the prevention of structural collapse . 4.2 Experimental Setup T o ensure our ndings are applicable in resource-constrained re- search environments, all models wer e trained locally using PyT orch Lightning on an RTX 3060 GP U . Since the SNNL computation is only required during the training phase to shape the feature e xtrac- tor’s weights, our pr oposed architectural enhancement adds zero computational overhead during inference . W e trained all our models for 15,000 steps on a mini-batch size of 100 using SGD with momentum (0.9) and weight decay (1e-4) [ 20 ] with a learning rate of 1e-1, and we used OneCycleLR to anneal the learning rate [23]. 4.3 Classication Performance T able 1 summarizes the performance across all three core metrics. The experimental models demonstrate signicant b ehavioral shifts compared to the baselines, i.e. trading rigid routing for structural diversity which particularly beneted the more complex datasets. Mixture of Experts with So Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement PCSC2026, April 2026, Davao, Philippines A CC (MEAN (SD) in %) SIM (MEAN (SD) in x100) EN T (MEAN (SD) in x100) Dataset Baseline Experimental Baseline Experimental Baseline Experimental MNIST 99.36 (0.03) * 99.25 (0.08) 0.03 (0.03) 0.04 (0.06) 0.58 (0.06) 1.94 (0.14) FMNIST 91.33 (0.13) 91.61 (0.23) * 0.03 (0.07) -0.01 (0.05) 16.82 (10.88) 16.00 (11.03) CIF AR10 70.91 (1.09) 71.23 (0.91) (ns) 0.05 (0.11) 0.03 (0.06) 14.00 (15.28) 15.01 (15.52) CIF AR100 35.75 (1.64) 36.74 (1.59) * 0.20 (0.24) 0.10 (0.15) 43.92 (26.88) 45.74 (26.60) T able 1: Classication Accuracy (A CC), Embe dding Similarity (SIM), and Routing Entropy (EN T) across Baseline and Experimen- tal MoE models. V alues represent MEAN (SD). Asterisks (*) denote statistical signicance at 𝑝 < 0 . 05 via Wilcoxon signed-rank test; (ns) denotes not signicant. The Complex Datasets (FashionMNIST and CIF AR100) . The experimental model shows a statistically signicant improv ement over the Baseline in classication accuracy . This conrms that for more dicult and highly entangled feature spaces, priming the latent space with SNNL genuinely aids downstream classication. The Simple Dataset (MNIST) . The baseline model signicantly outperforms the experimental model. Since MNIST is an extremely simple and highly separable dataset (baseline accuracy approaches 99.4%), we suspect forcing SNNL clustering likely introduces over- regularization or unnecessary constraints that marginally degrade performance. 4.4 Ablation and Specialization Analysis Our specialization metrics potentially reveals why the performance on complex datasets improved with our experimental approach. The quantitative data contradicts the assumption that SNNL merely enforces harder routing; instead, SNNL fundamentally pre vents structural expert collapse. Orthogonal Embeddings (Resolving Expert Collapse) . On FashionMNIST , CIF AR10, and CIF AR100, the SNNL model achie ved noticeably lower emb edding similarities. This is profoundly pro- nounced on CIF AR100, where the baseline experts suered from high similarity (0.20), indicating structural redundancy and collapse. SNNL halved this redundancy (0.10). On FashionMNIST , it pushed the similarity into the negative ( -0.01). As visualized in Figure 4, the baseline condition exhibits redun- dancy among experts. By feeding disentangled representations to the MoE, the experts are forced to learn highly distinct and non- overlapping weights, thus cutting the highest o-diagonal similarity by more than half. Routing Entropy (Flexible Collaboration) . Our results show that aside from FashionMNIST , the SNNL model actually increased routing entropy . The SNNL feature extractor makes the latent clus- ters so well-dened that the gating network does not nee d to hard route input features. Instead, it employs a more distributed routing strategy (higher entropy) while relying on the structurally diverse, orthogonal experts to handle classication nuances. Figure 5 visualizes our “exible collaboration” narrative. On the highly complex CIF AR100 dataset, the SNNL condition elevates the entire interquartile range for accuracy , proving a robust p er- formance boost while simultaneously raising the median entropy . On the intermediate CIF AR10 dataset, the SNNL boxplot reaches a higher maximum accuracy , but the heavily overlapping boxes Figure 4: Pair wise cosine similarity between expert weight matrices on CIF AR10. (Left) The Baseline shows distinct patches of redundancy (e.g. b etween Expert 0 and 3). (Right) The SNNL condition visibly suppresses these redundancies, thus forcing experts to learn highly distinct and orthogonal features. contextualize why the overall gain was not statistically signicant. Nevertheless, the underlying behavioral shift towards higher en- tropy remains intact acr oss both complexity levels. Manifold Visualization (V alidating Disentanglement). The UMAP projections in Figure 6 qualitatively conrm the mechanism driving our quantitative metrics. The baseline feature space (Left) exhibits highly entangle d class boundaries, which explains why baseline MoE models suer from higher embe dding similarity: the gating network is forced to send ov erlapping distributions of data to multiple experts. Conversely , the SNNL feature space (Right) demonstrates that intra-class variance has eectively collapsed, resulting in dense and distinct clusters. This proves that SNNL eectively disentangles the feature representations, thus trivializing the gating network’s routing task and enforcing strict structural specialization. 4.5 SNNL W eight Ablation: T uning Expert Diversity T o further understand the continuous eect of the SNNL weighting factor ( 𝛼 ) on expert orthogonality , we conducted an ablation study using a 25k subset of the CIF AR10 training set. W e evaluated 30 models with 𝛼 values linearly spaced from -100 to 100. As illustrated in Figure 7, the trend line conrms a negative correlation between the SNNL weight and pairwise embedding PCSC2026, April 2026, Davao, Philippines Agarap and Azcarraga Figure 5: Distribution of T est Accuracy and Routing Entropy across random se eds. (T op/CIF AR100) SNNL provides a ro- bust, highly stable bo ost to both accuracy and entropy , elevat- ing the entire interquartile range. (Bottom/CIF AR10) While median entropy slightly increases, the wider variance and overlapping accuracy boxes explain the non-signicant ac- curacy gain on this intermediate dataset. Figure 6: UMAP projection of the latent feature space imme- diately prior to the MoE gating network on CIF AR10. (Left) The baseline featur e extractor produces highly entangled class boundaries. (Right) The SNNL-regularized feature ex- tractor minimizes intra-class distance, producing dense and homogeneous class clusters. similarity among the experts. While MoE routing exhibits inherent variance across runs, progressiv ely increasing 𝛼 actively pulls the expected similarity closer to 0. This demonstrates that the SNNL weight acts as a tunable control parameter , i.e. a stronger topologi- cal penalty directly forces the downstream experts to learn more distinct, non-overlapping repr esentations. Figure 7: The ee ct of the SNNL weight ( 𝛼 ) on the pairwise embedding similarity of the experts. The dashed regression line shows a negative correlation, indicating that stronger SNNL penalties actively suppress expert redundancy . 5 RECOMMEND A TIONS While our ndings demonstrate the signicant advantages of SNNL in resolving expert collapse, several avenues for future research remain to fully realize the potential of topology-aware MoE archi- tectures: (1) Dynamic Parameter T uning . Our ablation study high- lighted the sensitivity of expert orthogonality to the SNNL weighting factor ( 𝛼 ). Future work should explore dynami- cally adjusting b oth 𝛼 and the temp erature parameter ( 𝑇 ) during training. Utilizing learning rate schedulers to grad- ually decay or warm up these values could better optimize the transition from global repr esentation clustering to ne- grained cross-entropy classication. (2) Architectural Scaling . The current experiments were con- strained to lightw eight CNN variants to accommodate lo- cal computing limits. Future investigations should scale this methodology to Vision Transformers (ViT s) and higher-resolution, large-scale datasets (e.g. ImageNet) to verify if the disentan- glement benets scale linearly with model size and extreme class diversity . (3) Comparative Representation Learning . Direct bench- marking against other modern contrastive learning tech- niques such as Super vised Contrastive Learning (SupCon) [ 14 ] or SimCLR [ 4 ] is necessar y . This will isolate whether SNNL’s specic layer-wise distance minimization is uniquely suited for MoE gating or if any class neighborhoo d structure- preserving clustering objective yields similar structural di- versity and routing exibility . Mixture of Experts with So Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement PCSC2026, April 2026, Davao, Philippines 6 CONCLUSION W e proposed and evaluated an architectural modication to the Mixture-of-Experts (MoE) model by integrating a feature extractor optimized via Soft Nearest Neighbor Loss (SNNL). By expanding our evaluation to include specialization metrics like Pairwise Em- bedding Similarity and Expert Sp ecialization Entropy , and including UMAP manifold visualization, we uncovered a nuanced mechanism of action that addresses the limitations of standard MoE architec- tures. Our empirical ndings demonstrate that baseline MoE architec- tures often suer from structural expert collapse, where experts learn redundant representations which r esults to rigid routing by the gating network. Disentangling the feature r epresentations with SNNL solves this structural collapse by forcing the experts to learn highly orthogonal and diverse features. Conse quently , this lib erates the gating network to utilize a more collaborative routing strategy . While disentanglement may have the tendency to “ov er-regularize” models on simple datasets like MNIST , it yields statistically sig- nicant accuracy improv ements on complex or highly entangled datasets such as FashionMNIST , CIF AR10, and CIF AR100, thus ef- fectively bridging the gap between representation learning and ensemble specialization. REFERENCES [1] Agarap, Abien Fred, and Arnulfo P . Azcarraga. “Improving k-means clustering performance with disentangled internal representations. ” 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. [2] Breiman, Leo. “Stacked regressions. ” Machine learning 24.1 (1996): 49-64. [3] Chapelle, Olivier , Bernhard Scholkopf, and Alexander Zien. “Semi-super vised learning (chapelle, o. et al., e ds.; 2006)[ book reviews]. ” IEEE Transactions on Neural Networks 20.3 (2009): 542-542. [4] Chen, Ting, et al. “ A simple framework for contrastiv e learning of visual repre- sentations. ” International conference on machine learning. PmLR, 2020. [5] Chechik, Gal, et al. “Large Scale Online Learning of Image Similarity Through Ranking. ” Journal of Machine Learning Research 11.3 (2010). [6] Cho, Kyunghyun, et al. “Learning phrase representations using RNN en- coder–decoder for statistical machine translation. ” Proceedings of the 2014 con- ference on empirical methods in natural language processing (EMNLP). 2014. [7] Frosst, Nicholas, Nicolas Papernot, and Geore y Hinton. “ Analyzing and improv- ing representations with the soft nearest neighbor loss. ” International conference on machine learning. PMLR, 2019. [8] Fukushima, Kunihiko. “Neural network model for a me chanism of pattern recog- nition unaected by shift in position-Ne ocognitron. ” IEICE Technical Report, A 62.10 (1979): 658-665. [9] Glorot, X avier , and Y oshua Bengio. “Understanding the diculty of training deep feedforward neural networks. ” Proceedings of the thirteenth international con- ference on articial intelligence and statistics. JMLR W orkshop and Conference Proceedings, 2010. [10] Graves, Alex, Abdel-rahman Mohamed, and Georey Hinton. “Spee ch recognition with deep recurrent neural networks. ” 2013 IEEE international conference on acoustics, speech and signal processing. Ieee, 2013. [11] He, Kaiming, et al. “Deep residual learning for image recognition. ” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [12] He, Kaiming, et al. “Delving deep into rectiers: Surpassing human-level p er- formance on imagenet classication. ” Proceedings of the IEEE international conference on computer vision. 2015. [13] Jacobs, Robert A., et al. “A daptive mixtures of local experts. ” Neural computation 3.1 (1991): 79-87. [14] Khosla, Prannay , et al. “Supervised contrastive learning. ” Advances in neural information processing systems 33 (2020): 18661-18673. [15] Krizhevsky , Alex, Ilya Sutske ver , and Georey E. Hinton. “Imagenet classica- tion with deep convolutional neural networks. ” Advances in neural information processing systems 25 (2012). [16] LeCun, Y ann, et al. “Gradient-based learning applied to document recognition. ” Proceedings of the IEEE 86.11 (1998): 2278-2324. [17] V an der Maaten, Laurens, and Georey Hinton. “Visualizing data using t-SNE. ” Journal of machine learning research 9.11 (2008). [18] McInnes, Leland, et al. “UMAP: Uniform Manifold Approximation and Projection. ” Journal of Open Source Software 3.29 (2018). [19] Nair , Vinod, and Georey E. Hinton. “Rectied linear units impr ove restricted boltzmann machines. ” ICML. 2010. [20] Polyak, Boris T . “Some metho ds of speeding up the convergence of iteration methods. ” Ussr computational mathematics and mathematical physics 4.5 (1964): 1-17. [21] Rumelhart, David E., Georey E. Hinton, and Ronald J. Williams. Learning inter- nal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [22] Salakhutdinov , Ruslan, and Geo Hinton. “Learning a nonlinear embedding by preserving class neighbourhood structure. ” Articial intelligence and statistics. PMLR, 2007. [23] Smith, Leslie N., and Nicholay T opin. “Super-convergence: V er y fast training of residual networks using large learning rates. ” arXiv preprint arXiv:1708.07120 5 (2017): 4.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment