Dynamic Mixture-of-Experts for Visual Autoregressive Model
Visual Autoregressive Models (VAR) offer efficient and high-quality image generation but suffer from computational redundancy due to repeated Transformer calls at increasing resolutions. We introduce a dynamic Mixture-of-Experts router integrated into VAR. The new architecture allows to trade compute for quality through scale-aware thresholding. This thresholding strategy balances expert selection based on token complexity and resolution, without requiring additional training. As a result, we achieve 20% fewer FLOPs, 11% faster inference and match the image quality achieved by the dense baseline.
💡 Research Summary
The paper tackles the computational inefficiency inherent in Visual Autoregressive Models (VAR), which generate images by progressively predicting token maps from coarse to fine resolutions. While VAR already reduces the number of sequential generation steps compared to pixel‑wise autoregressive models, each resolution still requires a full Transformer pass, causing FLOP and latency to grow quadratically with image size. The authors propose a dynamic Mixture‑of‑Experts (MoE) augmentation that replaces the dense feed‑forward network (FFN) in each Transformer block with a set of lightweight experts and a regression‑based router that selects a subset of experts per token.
First, the dense FFN weights are sparsified using a Hoyer regularizer (ℓ₁‑ℓ₂ ratio penalty) together with ReLU activation, encouraging many weights to become near‑zero. The resulting weight matrix is clustered via k‑means into 32 balanced experts of dimension 128. This conversion yields a MoE layer where each expert specializes on a subset of token patterns.
Second, the router R is trained to predict the ℓ₂‑norm of each expert’s output for a given token, minimizing a mean‑squared error loss between the predicted norm and the true norm. At inference time, a relative threshold τ is applied: an expert i is activated for token x only if R_i(x) ≥ τ·max_j R_j(x). Crucially, τ is not fixed globally; instead, a scale‑aware schedule τ₁ < τ₂ < … < τ_S is used, where lower thresholds at coarse scales allow many experts to fire, while higher thresholds at fine scales enforce strong sparsity. This design reflects the observation that fine‑scale tokens are often already well‑determined and require less computation.
The method is evaluated on ImageNet using the publicly released VAR‑d16 checkpoint. After fine‑tuning the dense model for two epochs with the Hoyer loss (α = 0.1), MoE layers are inserted only in the last three scales (scales 8‑10). With τ values of 0.6, 0.7, and 0.8 respectively, the model achieves a 19 % reduction in measured FLOPs and an 11 % decrease in wall‑clock inference time, while the Fréchet Inception Distance (FID) degrades by less than 1 % (e.g., from 6.2 to 6.3). Qualitative samples are visually indistinguishable from the dense baseline.
A series of ablations clarify the design choices. (1) Hoyer regularization mainly induces sparsity in early layers but contributes little to overall speed‑up; ReLU‑induced sparsity is more impactful. (2) Applying MoE to all scales inflates routing overhead and negates speed gains; restricting MoE to the finest stages yields the best trade‑off. (3) Uniform τ across scales harms quality because early‑stage errors propagate; the scale‑aware τ schedule is essential. (4) Varying τ produces a smooth FID‑vs‑FLOP curve, enabling users to trade compute for quality at inference without retraining. (5) Increasing model depth to 20 layers amplifies FLOP savings, indicating that deeper VAR models contain more redundant computation that dynamic routing can exploit.
The authors conclude that dynamic, scale‑aware expert selection effectively reduces VAR’s computational burden while preserving image quality. The approach requires no additional training beyond a brief sparsity‑aware fine‑tuning, and the inference‑time τ can be tuned on the fly to meet different latency budgets. Future work includes fine‑tuning only the last few scales for targeted sparsity and developing routers that adapt τ based on semantic content or class‑specific complexity. This work demonstrates a practical path toward making high‑fidelity autoregressive image synthesis more scalable and resource‑efficient.
Comments & Academic Discussion
Loading comments...
Leave a Comment