Reading time: 27 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.18452
  • Date:
  • Authors: Unknown

📝 Abstract

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.

📄 Full Content

Despite being one of the earliest neural network modules, the Multilayer Perceptron (MLP) arguably remains one of the least understood parts of the transformer architecture. Unlike attention mechanisms, whose attention patterns can be visualized [VSP + 17], MLP layers resist straightforward inspection. How can we understand what MLP layers are actually doing inside a trained transformer?

A popular approach to study MLPs is to provide a mathematical analysis proceeding from simplifying assumptions. These assumptions could be stylized hyperparameters (e.g. [JGH18, LLL20, JGŞ + 21, AZLS19]), toy data distributions (e.g., [ABM23, BSA + 23, NDL24]), or simplified architectures (e.g., [SMG13, ACGH18, BHL18, RBPB24, ZDDF25]). While such theoretical analyses can yield valuable insights, their simplifications raise a critical question: how much do they reflect models trained in practice? And how much can they directly tell us about the function of the MLP layer inside a real-world, trained transformer? More broadly: Is it possible to understand what MLP layers do while avoiding starting from simplifying assumptions, such as on (1) the hyperparameter regime, (2) the data distribution, or (3) the architecture?

Our contributions In this paper, rather than starting from a toy simplification and analyzing it, we seek to understand MLP layers by starting from a scientific hypothesis on the type of function that they secretly compute inside of trained transformers. Then, we experimentally check the extent to which this hypothesis is valid.

  1. Hypothesis: MLPs in trained transformers secretly implement sparse mixtures of experts.

We hypothesize that MLP layers in trained transformers have a certain hidden structure: they can be well approximated by sparsely-activating Mixture of Experts layers (MoE) [JJNH91,ERS13], with a much smaller number of active parameters than the original MLP; see Figure 1.

This hypothesis is spurred by a novel mathematical insight showing a connection between sparse autoencoder structure in activation space and secret Mixture of Experts structure.

MLP layer inside transformer Sparsely activating mixture of experts active (inactive) (inactive) (inactive) (All parameters active) can be distilled to Figure 1: In this paper we hypothesize, and then experimentally validate, that an MLP layer in the middle of a pretrained transformer can be effectively described by a sparsely-activating mixture-of-experts layer. Figure 2: We distill the middle MLP layer of Pythia-410M to either a smaller MLP student model, or an MoE student model with fewer active parameters. On the left, we see that under the input distribution induced by the previous layers, MoE students can achieve the same distillation performance with fewer active parameters than MLPs. On the right, under a Gaussian input distribution with the same mean and covariance, MoE students yield no significant gain, showing that the data distribution is crucial for the secret MoE structure. See Section 4 and Appendix A for details and further experiments.

Namely, it has been observed that activations in language models are typically sparse in some dictionary [BTB + 23]. Starting from this observation, we prove theorems that suggest that the MLPs in trained networks should also be well approximated by sparsely-activating MoE layers.

We also prove that the dictionary-sparsity in the input distribution and the model architecture is critical, since in contrast for Gaussian inputs (which do not have dictionary-sparse input structure), we prove that MLPs should generally not be approximable by sparsely-activating MoEs.

  1. Empirical validation of hypothesis: MLP layers in trained models have secret MoE structure. Next, we empirically validate the hypothesis. By distilling from a dense MLP in a pretrained model to a sparsely-activating MoE, we show that layers in pretrained LLMs can be approximated well by sparse MoEs. We compare this to the “control” condition, where the input distribution is Gaussian distribution (with matched covariance), in which case we demonstrate that MoEs are not a good approximation.

As a bonus, our distillation experiments yield a promising direction for efficient MoE architecture design -we find that using a low-rank router improves overall distillation performance while reducing computational cost.

The organization of the rest of this paper is as follows. Section 2 presents preliminaries, including the definitions of MLP and MoE architectures. Section 3 presents the mathematical analysis that spurs the secret mixture of experts hypothesis. Finally, Section 4 presents experimental validation of the hypothesis on several pretrained language models, by distilling MLP layers to more structured sparse MLP layers. We discuss the broader implications of our approach as a framework for conducting research in architecture design and studying pretrained models in Section 5. We discuss further related work in Section 6.

We consider inputs x ∈ R d , drawn from some distribution D. The MLP module (also called feedforward module) is the earliest designed neural network [Ros58,Ros62], and has the form:

(2.2)

When the gating function has sparsity ∥g(x)∥ 0 ≤ k for all inputs, the layer is a (m, k)-MoE, meaning that only k out of the m experts are active. Since only the active experts have to be evaluated, this results in significant computational efficiency in training and in inference.

The most popular gating function is “linear routing” with top-k activation, which is what we consider in this paper. For completeness, we define this gating function below. Then, linear routing has parameters β ≥ 0 and a matrix R ∈ R m×d , and is given by1 g(x; β, R) := softmax(β • topk(Rx)) .

Definition 2.2. We say that g is a “hard” gating function if it places uniform weight on the active experts: namely g(x) ∈ {0, 1/∥g(x)∥ 0 } m for any input x. Notice that, when β = 0, then linear routing is a hard gating function.

3 Hypothesized structure: secret mixtures of experts in your model

The main hypothesis of this paper is:

In a pretrained transformer model, the dense MLP layers can be well-approximated by sparsely-activating MoE layers.

We emphasize that this hypothesis is not that a transformer architecture with MoE layers instead of MLP layers will have good performance. Indeed, it is known that transformers with MoE layers perform well [SMM + 17], and these are becoming an increasingly common architectural choice in frontier language models [LFX + 24, Met25, JSR + 24, Mos24, Qwe24, Sno24, AAA + 25].

Instead, our hypothesis is that, in dense transformer models with MLP layers, the MLP layers can be well-approximated by sparsely-activating MoE layers. If the hypothesis is true, then it helps explain why replacing these layers with MoE layers is a good choice to preserve performance with a smaller number of active parameters.

We formally define below what is meant by approximating a model under an input distribution D.

In the context of our hypothesis, the input distribution D is the distribution induced by pushing forward the input distribution to the network (such as a text distribution) through the layers preceding the MLP layer. In other words, if we are considering approximating the MLP at layer ℓ, then D is the distribution of internal activations that are inputted to the MLP at layer ℓ. We can restrict to considering regions U S that have lower-bounded probability

since by a union bound argument the regions with smaller probability provide negligible contributions and can be ignored in our analysis.

Let us now argue that the mixture of experts model f MoE cannot approximate the identity function f * (x) = x on any region U S satisfying the probability lower bound (3.1). On inputs restricted to region U S , the mixture of experts f MoE is a sum of k MLPs of width d exp . In other words, on this region the mixture of experts is given by an MLP of width kd exp . Since an MLP depends on a subspace of the inputs of dimensionality at most equal to its width, there is a linear projection Π S ∈ R kdexp×d and a function f S : R kdexp → R d such that

Intuitively, since kd exp < d/2, the projection Π S to a lower-dimensional space “loses information” about x. Therefore the mixture-of-experts should not be able to compute the identity function f * (x) on this region. The only way in which the identity function can be computed is if the region is degenerate -i.e., U S mostly lies in a low-dimensional subspace of the input space. However, this would imply that the region has small probability mass, contradicting the probability mass lower bound condition (3.1).

We make this intuition precise by leveraging a technical lemma of [BR25], which was developed for a different purpose (studying the expressive power of granularity in mixture of experts models). The full proof is in Appendix B.

The result of the previous section indicates that for our Hypothesis 3.1 to be true, then there must be extra structure in the input distribution, which allows the MLP’s computations to be well approximated by a mixture of experts. Indeed, our the hypothesis is motivated by a better understanding of the structure of the input distribution, beyond the crude assumption of Gaussianity. We proceed from the observation of [BTB + 23] that neural network activations are approximately sparse in some dictionary of vectors. Let us formalize this observation. Definition 3.4 (Dictionary-sparse structure). Distribution D over vectors is (m, k)-dictionary-sparse if there is a dictionary of vectors v 1 , . . . , v m such that any vector x in the support of

For the purposes of our analysis, we additionally posit the property the dictionary’s vectors are approximately orthogonal to each other. This is a natural property to expect in high dimensions, since the normalized inner product of two random vectors in dimension

Additionally, approximate orthogonality has been argued to be key to how networks represent concepts in superposition [EHO + 22]:

Now we show that Hypothesis 3.1 is reasonable when the input distribution satisfies the above properties. As long as the input distribution is dictionary-sparse with an approximately-orthogonal dictionary, we show that sparse MoEs can represent any linear function. Thus, they overcome the obstacle from Gaussian input distribution, where they could not even represent the special case of an identity function f * (x) = x.

Theorem 3.6 (Linear functions are approximable by sparse MoEs under sparse-dictionary data). Suppose that the data distribution D is supported in the unit ball, and is an (m, k)-dictionary-sparse distribution for a (γ, k)-approximately-orthogonal dictionary.

Then for any linear function

Proof. Let the MoE have m experts, each corresponding to one of the elements of the dictionary. Let the ith expert compute f i (x) = kAv i v⊤ i x, which is implementable with a single neuron and linear activation function. Next, let the hard gating function be such that, on input x, the gating function activates

A few remarks to help interpret this theorem are in order.

Remark 3.7 (Making sense of the number of active parameters). The number of active parameters of the MoE should be contrasted to the number of active parameters that would be needed to perform the same approximation with an MLP. For simplicity, let us consider the identity function f * (x) = x as our linear function, and let us consider a dictionary e 1 , . . . , e d of the standard basis vectors and a distribution D which is uniform on {e 1 , . . . , e d }. In this setting, the above theorem guarantees that there is a mixture of d single-neuron experts, of which exactly 1 is active on any input, which computes f * perfectly. On the other hand, in order to obtain this with an MLP, the output of the MLP has to be able to span the full d-dimensional space, which means that it must have at least d neurons. Therefore, the theorem shows a factor of d decrease in the number of active expert parameters with a sparse MoE over a dense MLP.

Remark 3.8 (On the approximation guarantee). In d dimensions, we expect the approximate orthogonality of our dictionary to be on the order of γ = O(1/ √ d), since this is the approximate magnitude of the inner product of two random vectors on the sphere. So, as the dimension increases, the error bound of the theorem should tend to 0. Remark 3.9 (On the implementation of the gating scheme). Although we are not describing how to compute the gating function in the above theorem, in practice [BTB + 23, BLN24, GlTT + 24] have shown that the active dictionary indices i 1 , . . . , i k in a sparse autoencoder can be found with a linear projection followed by a topk operation. In these cases, the MoE gating scheme can be implemented with linear routing and β = 0.

In Appendix C we extend this Theorem 3.6 to nonlinear functions. We state our result informally below.

Informal Theorem 3.10 (Nonlinear functions are approximable by sparse MoEs under dictionary data). Suppose that the distribution D is supported in the unit ball, and is an (m, k)-dictionary-sparse distribution for a (γ, k)-approximately-orthogonal dictionary. If f * : R d → R d is a homogeneous polynomial given by a tensor with “rank-r interactions between features of the dictionary”, then there is a hard-gated (m, k)-MoE with d expert ≤ O p (r) that γ∥A∥ op -approximates f * .

See Appendix C for the relevant definitions and a formal statement and proof of this extension.

We empirically test the Secret MoE Hypothesis (Hypothesis 3.1) on pretrained transformer LLMs. Our experiments report the loss of distilling a dense MLP at some layer ℓ inside of a pretrained language model to a sparsely-activating mixture of experts. We distill the middle MLP layers of three dense transformer architectures of increasing scale: Pythia-70m (layer ℓ = 3), Gemma-270m (layer ℓ = 9), and Pythia-410m (layer ℓ = 12).

Datasets over which we distill The datasets that we distill over are generated as follows. We start with a dataset D text of tokenized texts {(z

of varying lengths that are inputted to the LLM. We push each text (z

Ti ) forward through the first ℓ -1 layers of the network to get a sequence of activations (x

Ti ) that are the inputs to the MLP at layer ℓ (see Figure 3). We form the dataset D act by concatenating all activations generated in this way

We run this procedure to generate a training dataset D act,train and a testing dataset D act,test from corresponding train and test splits of the text datasets. We use Wikitext-103 [MXBS16] in our experiments, filtering out texts with fewer than 20 tokens to select higher-quality text data. This yields a total of around 4M training samples and 200K test samples, which is a computationally manageable quantity for distillation given our computational budget.

We additionally generate datasets D gauss,train and D gauss,test of Gaussian data with the same mean and covariance as D act,train and the same corresponding numbers of train samples and test samples. Figure 4: The unexplained fraction of the variance in the outputs from distilling the middle MLP layer of Pythia-70M (first row), Gemma-270M (second row). Results for Pythia-410M are in Figure 2. In the left column, we observe that over the activation dataset D act sparse MoE students are able to capture a significantly higher amount of the variance than corresponding MLP students with the same number of active neurons. In particular, for Pythia-410M and Gemma-3-270M there are cases in which the sparse MoE captures the same variance as the MLP using 8 times fewer active neurons. On the other hand, the distillation results in the right column demonstrate that MoE students have little advantage when the data distribution is instead Gaussian (with matched mean and covariance).

We fit a student MoE model in a student-teacher setup, where the teacher is the pretrained model’s MLP layer ℓ. We distill over both input datasets D act,train D gauss,train and compare performance. For each student-teacher-dataset tuple we train with Adam for 100 epochs with mean-squared error and batch size 1024. We sweep over the learning rate hyperparameter in the range 1e-3, 3e-4, and 1e-4, and choose the one with the best final test loss.

We achieve the best distillation performance by distilling to a student MoE with a shared expert. This is a less expressive variant of the MoE defined in (2.2), but it is used in practice in leading open-source LLMs since it is easier to optimize [LFX + 24, T + 24]. The MoE with shared expert has the form:

In our experiments, our MoE has single-neuron experts d exp = 1, and we pick the inner dimension of the shared MLP to equal the total number of active experts k. Therefore, this architecture has d mlp + kd exp = 2k active neurons, and it is strictly less expressive than a pure MoE architecture as in (2.2) with 2k active experts. Nevertheless, in our setting the shared expert yields improved performance, likely due to a better optimization landscape (see Appendix A.1 for an ablation experiment). Additionally, in order to reduce computational costs we reparametrize the linear routing matrix in the gating function as

with trainable matrices R 1 ∈ R m×dproj and R 2 ∈ R dproj ×d . Here, d proj is a smaller inner projection dimension than the outer dimensions of the routing matrix. Surprisingly, we find that although this reparametrization makes the linear router strictly less expressive, it actually significantly improves performance of our distillation procedure (see Appendix A.2 for an ablation experiment). Understanding why the Burer-Monteiro reparametrization R = R 1 R 2 makes linear routers easier to train is an interesting question for future study. We take d proj = 128 for Pythia-70M and Gemma-270M, and we take d proj = 256 for Pythia-410M.

Distillation results validate the secret sparse computation hypothesis In Figure 4, we report the fraction of variance explained by distilling the middle MLP layer of Pythia-70M and Gemma-270M to sparse MoE and dense MLP student models of varying sizes. Pythia-410M results are shown in Figure 2. The experiment shows that over the activation dataset D act , the large language model’s MLP layer can be significantly better approximated by sparse students than by dense students at a given number of active parameters. On the other hand, for the Gaussian dataset D gauss with the same mean and covariance, sparse students yield no significant gain.

We discuss two high-level takeaways of our paper for deep learning research. First, distillation is a general lens for studying trained networks; see Section 5.1. Second, distillation can be used as a test-bed for fast and cheap experimentation with architecture design; see Section 5.2.

A classical approach to deep learning theory proceeds by analyzing highly simplified settings -imposing strong assumptions on the data, architecture, or optimization dynamics -and deriving formal consequences. While this strategy has yielded valuable insights, it often abstracts away many of the properties that make modern neural networks empirically successful. In this paper, we explore an alternative path to theory development that more closely mirrors how the natural sciences build explanatory frameworks.

In fields such as physics, theories are evaluated by how well they account for the behavior of real systems, rather than by how tractable they are under idealized assumptions. Analogously, we advocate a “model your model” approach to deep learning theory: instead of starting from simplified constructions, we treat trained neural networks themselves as the object of study. The goal is to formulate and test hypotheses about the structure these models have actually acquired through training.

Concretely, the paradigm consists of two steps. First, one hypothesizes that a pretrained model exhibits a particular form of structure -in this case secret sparse MoE structure. Second, the model is distilled into a restricted class of models that explicitly encode this hypothesized structure, allowing one to test whether the structure is sufficient to reproduce the original model’s behavior. In this way, distillation becomes a tool not only for compression, but for probing and validating theoretical claims about learned representations. A restricted version of this paradigm was previously advocated in [Boi24], which focused on the synthetic data setting where models secretly encoded small decision trees, and formulated extracting those trees as a distillation problem.

More broadly, this perspective allows for analyzing real, implemented neural networks whose behavior reflects both algorithmic and hardware considerations. As such, they may resist clean analysis in idealized mathematical settings. By grounding theory in the empirical properties of trained models themselves, the distillation-based paradigm offers a complementary route toward understanding deep learning systems as they are, rather than as simplified abstractions.

Several major labs have replaced MLP layers with mixture-of-experts (MoE) layers in frontier transformer architectures to improve scalability [LFX + 24, Met25, JSR + 24, Mos24, Qwe24, Sno24, AAA + 25]. This trend raises a natural question: why do MoE layers work so well as replacements for standard feedforward layers?

Our results suggest a simple explanation: standard MLPs already exhibit a latent MoE-like structure, making explicit MoEs a natural architectural refinement rather than a completely new component. Crucially, we show that this hypothesis can be tested in a resource-constrained academic setting. Rather than pretraining

  1. Make simplifying assumption on architecture, data, optimizer, etc. . . full models from scratch, our approach distills a single layer of a pretrained network, enabling targeted architectural comparisons at low computational cost. Using this distillation framework, we also find that adding a shared expert to the MoE improves performance, consistent with emerging best practices [DDZ + 24]. Taken together, these results demonstrate that distillation provides a practical and efficient tool for rapid architectural experimentation, enabling systematic comparison of candidate designs with minimal compute.

Our experimentation in the distillation setting also allows us to prescribe a new architecture to try: MoEs with low-rank routers. When there are many experts, we find that MoEs with low-rank routers are easier to train and yield improved performance over MoEs with full-rank routers (see Section 4 and Appendix A.2). The many-expert regime is increasingly important, as open-source frontier models scale the number of experts because of the performance benefits of granularity [KLA + 24, BR25], and the computational savings of sparsity [SMM + 17]. Further testing this proposed architecture is a promising direction. This paper highlights a structural connection between SAEs and Mixture-of-Experts (MoE) models, suggesting a bidirectional opportunity: advances in SAE design may inform better MoE router architectures, and conversely, insights from MoE routing may guide the development of more expressive or efficient SAEs. Beyond standard SAEs, multi-level variants that learn nested dictionaries [BNKN25] may be an interesting avenue to consider, as they naturally suggest an MoE architectures with hierarchically organized experts operating at different levels of precision.

Mechanistic interpretability and network subcircuits A growing body of work in mechanistic interpretability seeks to identify functional subcircuits within neural networks -collections of parameters or activations that are selectively engaged during computation [MRM + 24, CMPL + 23] such that the rest of the parameters can be ablated with minimal loss in performance on a task. In this work, we effectively find subcircuits at the level of individual MLP layers, showing that these can be well approximated by sparse models. Thus, our results show sparse structure at a finer level than mechanistic interpretability methods that identify subcircuits at the level of heads or between layers.

Another related work is the lottery ticket hypothesis [FC18], which posits that dense neural networks contain much smaller subnetworks capable of achieving comparable performance. Their method to find those subnetworks yields an equivalent-performance model with MLP layers that have sparse weight matrices. The models from the lottery ticket hypothesis are in a different regime from the MoE models considered in this paper, since their sparsity pattern does not depend on the input, and therefore the active and total parameter counts are the same. Additionally, their sparsity pattern is not structured as in MoEs.

Another related approach to finding subcircuits, Automatic Parameter Decomposition [BBH + 25], explicitly finds dictionaries of parameters such that a sparse subset is active in a forward pass. In our work, we do not require the student MoE to be based on dictionary decomposition of the parameters of the initial model, so we can scale beyond toy scales to much larger student models by simply training the student MoE model.

Code for the experiments can be found in this repository: https://github.com/eboix/secret_moe . We now report results from several ablations.

We follow the practice of the frontier open-weight architectures [DDZ + 24, T + 24], which have a shared expert that is always active in their MoE architectures; see the architecture in (4.1). Here, we validate through an ablation study that this improves performance compared to a pure mixture of experts without a shared expert (2.2).

In Figure 6, we compare student MoEs with full-rank routing where the routing matrix R is directly trained, versus low-rank routing where we parametrize it as R = R 1 R 2 . We find that for Gaussian data there is no significant difference between the two procedures, as expected from our theory. On the other hand, for the true activation data distribution the low-rank routing generally yields much better results even though it is strictly less expressive. Since we did not tune the rank of the routing, this indicates that our distillation fraction variance explained by MoEs may be possibly further improved.

In Figures 7 through 12, we plot the test loss curves during training for several distillations of models to student MLP and MoE models. These plots show that generally the distillation to MLP models has a loss that stabilizes quickly and is fairly independent of the learning rate. On the other hand, the performance of the distillation to MoE models has a higher variability with the learning rate, but still converges. Part of the effect of the convergence is due to cosine learning rate decay, but the loss curves generally seem to stabilize early on -especially for the MLP student distillations.

Gaussian data D gauss with matching mean and covariance Unexplained fraction of variance when distilling middle MLP layer of Pythia-410M (Gaussian data) MoE with Full-rank routing (100000 experts) MoE with Low-rank routing (100000 experts)

Figure 6: The unexplained fraction of the variance in the outputs from distilling the middle MLP layer of Pythia-70M (first row), Gemma-270M (second row), and Pythia-410M (third row) to either an MoE with full-rank routing matrix R ∈ R m×d , or an MoE with low-rank routing matrix R = R 1 R 2 , where R 1 and R 2 are trained. As in the main text, for low-rank routing we choose the inner dimension 128 for Pythia-70M and Gemma-3-270M, and 256 for Pythia-410M. Notice that distilling to a low-rank routing MoE is generally an improvement over distilling to a full-rank routing MoE, even though the former is less expressive.

In this section, we prove Theorem 3.3, which is restated below for convenience.

The definitions and technical claim that we will use from [BR25] are as follows.

Definition B.2. For a distribution µ, and a measurable set U , let µ| U be the probability measure from restricting µ to U . I.e., µ| U (A) = µ(A ∩ U )/µ(U ) for all measurable sets A. Given a distribution µ and measurable set of nonzero measure, additionally define Σ U = cov(X, X) for X ∼ µ| U to be the conditional covariance.

The technical claim that we use is an error bound for approximating linear functions by a nonlinear function that depends on a propert subspace of the input.

). There are universal constants C, c > 0 such that the following is true for µ = N (0, I d /d). Let U ⊆ R d be a measurable set, and let Π ∈ R d×d be a projection matrix to a subspace of dimension p ≤ 99d/100, and let h : R d → R d be a measurable function, and let A ∈ R d×d be a linear transformation. Then,

where Π ⊥ ∈ R d×d is the orthogonal projection to Π.

We may now proceed to prove the theorem. Let f MoE be the (m, k)-MoE that approximates the identity function on distribution µ = N (0, I d /d). For any S ⊆ [m], let U S = {x ∈ R d : support(g(x)) = S} be the region on which the subset of experts S is active. Define the set of experts that have lower-bounded probability to be

For any S ⊆ [m] with |S| ≤ k, let f S : R kdexp → R d be a function and let Π : R kdexp×d be a projection such that

By Proposition B.3, there are universal constants c, C > 0 such that, for any S ∈ S we have

We now strengthen the theoretical evidence for Hypothesis 3.1 by showing that the approximability result holds beyond linear functions, to a class of nonlinear functions that have the property that interactions between the dictionary features are of low complexity. Our result here naturally generalizes Theorem 3.6 to nonlinear functions.

For simplicity, we consider target functions that are homogeneous degree-p polynomials (this result can be generalized to non-homogenous polynomials but with more notational heaviness). for some tensor A ∈ (R d ) ⊗(p+1) .

When p = 1, notice that we recover linear functions f (x) = Ax for a matrix A ∈ R d×d . We define an operator norm, a tensor-vector product, and a rank for these tensors as follows: With these definitions in hand, we may define what it means for a tensor A to have “low rank” interactions between the features in the dictionary. Definition C.5 (Low-rank feature interactions). Given a dictionary v 1 , . . . , v m ∈ R d and a homogeneous polynomial f (x) of the form (C.1) for a tensor A ∈ (R d ) ⊗(p+1) , we say that the interactions between dictionary features are rank ≤ r if rank(Av i ) ≤ r ,

Notice that our notions of operator norm, tensor-vector product and rank agree with the corresponding standard definitions of operator norm, matrix-vector product and rank for matrices when p = 1. Thus, our results in this section will naturally generalize the case of linear functions developed in Theorem 3.6. Indeed, one can observe that linear functions are a special case of functions with rank-1 feature interactions.

Our generalization of Theorem 3.6 to nonlinear functions is as follows.

Theorem C.6 (Low-rank feature interaction functions are approximable by sparse MoEs). Suppose that the distribution D is supported in the unit ball, and is an (m, k)-dictionary-sparse distribution for a (γ, k)approximately-orthogonal dictionary. Suppose also that f * : R d → R d is a homogeneous polynomial given by a tensor A ∈ (R d ) ⊗(p+1) according to (C.1) and has rank-r interactions between dictionary features.

Then, there is a hard-gated (m, k)-MoE with d expert ≤ O p (r) that γ∥A∥ op -approximates f * .

Proof outline. The proof proceeds by having the m experts of the MoE correspond to the features in the dictionary. When a certain feature is active, the corresponding expert is loaded and captures the interaction of that feature with the rest of the input. The interaction rank condition ensures that this interaction can be represented with a width-O p (r) expert that has activation function σ(t) = t p . The full proof is in Appendix C.3.

  • (x) = x. d : support(g(x)) = S} where g is the gating function.

Figure 12: Test variance unexplained by iteration when training MoE students on Pythia-410M with Adam for 100 epochs. The loss curves converge (although this is due in part to cosine learning rate decay).

Out of convenience we take the convention that 0 × -∞ = -∞ so that the gating is continuous in β.

In practice, the activation distribution is not perfectly sparse in the dictionary, but only approximately. Our results below can be readily adapted with an extra additive error term to account for this approximation, but we omit this to keep the notation and discussion simple.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut