Vision Transformer MLP 용량 절감이 성능 향상을 이끈다
📝 Abstract
Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7% of the baseline parameters. Our GroupedMLP variant shares MLP weights between adjacent transformer blocks and achieves 81.47% top-1 accuracy while maintaining the baseline computational cost. Our ShallowMLP variant halves the MLP hidden dimension and reaches 81.25% top-1 accuracy with a 38% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47% to the range 0.03% to 0.06%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/ parameter-efficient-vit-mlps.
💡 Analysis
Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7% of the baseline parameters. Our GroupedMLP variant shares MLP weights between adjacent transformer blocks and achieves 81.47% top-1 accuracy while maintaining the baseline computational cost. Our ShallowMLP variant halves the MLP hidden dimension and reaches 81.25% top-1 accuracy with a 38% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47% to the range 0.03% to 0.06%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/ parameter-efficient-vit-mlps.
📄 Content
Vision Transformers (ViTs) have benefited from scaling: from the 86M parameters of ViT-Base to the billions in contemporary foundation models, larger models often yield better performance on standard benchmarks (Dosovitskiy, 2020;Dehghani et al., 2023). Scaling laws such as the Chinchilla optimality curves further formalize this trend, relating parameter count, data, and compute to downstream accuracy (Hoffmann et al., 2022). These observations have encouraged the view that increasing model capacity is a primary route to improved performance.
Yet recent work challenges this view. Deep double descent shows non-monotonic test error with model size (Nakkiran et al., 2021), sparse subnetworks can match dense parents (Frankle & Carbin, 2018), and ALBERT achieved state-of-the-art NLP results with 70% fewer parameters through parameter sharing (Lan et al., 2019). These results suggest that raw parameter count is an imperfect proxy for effective capacity.
In this paper, we present empirical evidence that a related phenomenon arises in ViT-B/16 trained on ImageNet-1K. We study two simple parameter-reduction strategies applied to the MLP blocks. Our GroupedMLP variant shares MLP parameters between adjacent transformer blocks, while our ShallowMLP variant halves the MLP hidden dimension and preserves the initialization statistics of a wider MLP. Both models remove 32.7% of the baseline parameters (using 67.3% of the original count), yet slightly outperform the 86.6M-parameter ViT-B/16 baseline: GroupedMLP reaches 81.47% top-1 accuracy and ShallowMLP 81.25%, versus 81.05% for the baseline. Moreover, both reduced-parameter models exhibit substantially improved training stability, with final accuracies within 0.06% of their peak values, compared to a 0.47% peak-to-final degradation for the baseline.
The two architectures offer complementary practical trade-offs. GroupedMLP maintains the baseline computational cost (16.9 GFLOPs) while reducing memory footprint by sharing MLP weights across layers, making it attractive in memory-constrained settings. ShallowMLP reduces both parameters and compute (11.3 GFLOPs), yielding a 38% increase in inference throughput and making it preferable when latency or cost is critical. That these two mechanistically distinct strategies both improve accuracy and stability at matched parameter counts suggests that, under this training setup, ViT-B/16 operates in an overparameterized regime in which additional MLP capacity may hinder optimization rather than help it.
Our contributions are: (1) demonstrating that, for ViT-B/16 on ImageNet-1K, two parameterreduction schemes in the MLPs improve accuracy and training stability despite removing 32.7% of parameters, (2) comparing parameter sharing versus width reduction at matched parameter counts to clarify their trade-offs, and (3) connecting these findings to overparameterization in transformers and outlining open questions about architectural constraints in ViTs.
Parameter Sharing and Pruning. ALBERT (Lan et al., 2019) demonstrated that cross-layer parameter sharing can achieve state-of-the-art NLP results with 70% fewer parameters than BERT. Universal Transformers (Dehghani et al., 2018) share a single block recurrently, with recent analysis showing benefits primarily from gradient aggregation rather than depth (Lin et al., 2023). Unlike these global schemes, our GroupedMLP shares only MLP submodules between adjacent blocks, preserving attention independence. The lottery ticket hypothesis (Frankle & Carbin, 2018) showed that sparse subnetworks can match dense counterparts, and has been extended to ViTs in subsequent work (Chen et al., 2021;Yu et al., 2022). Structured pruning methods like S-ViTE (Chen et al., 2021) and VTP (Zhu et al., 2021) remove entire heads or channels using importance scores. Our ShallowMLP represents a simpler approach (uniform width reduction with initialization from a wider MLP) that requires no iterative pruning yet improves both accuracy and stability.
ViT Efficiency and Redundancy. Architectural innovations include Swin’s hierarchical design (Liu et al., 2021), DeiT’s distillation (Rangwani et al., 2024), and dynamic token pruning (Rao et al., 2021). Studies have documented substantial redundancy in ViTs: Bhojanapalli et al. (2021) found high attention correlation between layers, while Liang et al. (2022) observed similar patch representations in deeper layers. We provide complementary evidence that, for ViT-B/16 on ImageNet-1K, removing 32.7% of MLP parameters can improve performance, consistent with the view that this standard configuration is overparameterized in its MLPs.
Overparameterization Dynamics. Deep double descent (Belkin et al., 2019;Nakkiran et al., 2021) revealed non-monotonic test error with capacity, with pruned networks sometimes outperforming dense parents. Chinchilla scaling laws (Hoffmann et al., 2022) suggest many models are computesuboptimal. The fact that our 58.2M-parameter variants outp
This content is AI-processed based on ArXiv data.