Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture-of-Experts (MoE) layers activate a subset of model weights, dubbed experts, to improve model performance. MoE is particularly promising for deployment on process-in-memory (PIM) architectures, because PIM can naturally fit experts separately and provide great benefits for energy efficiency. However, PIM chips often suffer from large area overhead, especially in the peripheral circuits. In this paper, we propose an area-efficient in-memory computing architecture for MoE transformers. First, to reduce area, we propose a crossbar-level multiplexing strategy that exploits MoE sparsity: experts are deployed on crossbars and multiple crossbars share the same peripheral circuits. Second, we propose expert grouping and group-wise scheduling methods to alleviate the load imbalance and contention overhead caused by sharing. In addition, to address the problem that the expert choice router requires access to all hidden states during generation, we propose a gate-output (GO)cache to store necessary results and bypass expensive additional computation. Experiments show that our approaches improve the area efficiency of the MoE part by up to 2.2x compared to a SOTA architecture. During generation, the cache improves performance and energy efficiency by 4.2x and 10.1x, respectively, compared to the baseline when generating 8 tokens. The total performance density achieves 15.6 GOPS/W/mm2. The code is open source at https://github.com/superstarghy/MoEwithPIM.

💡 Research Summary

Mixture‑of‑Experts (MoE) layers consist of many independent “experts” and a gating network that routes each input token to a subset of those experts. Because only a few experts are activated per token, MoE can increase model capacity while keeping the overall compute cost modest. This sparsity makes MoE a natural fit for Process‑in‑Memory (PIM) accelerators, which perform matrix‑vector multiplications directly inside memory crossbars, thereby eliminating costly weight transfers.

The main obstacle for PIM‑based MoE is the large area occupied by peripheral circuits (ADCs, DACs, buffers). Prior PIM designs allocate a full set of peripherals to every crossbar, and these peripherals can consume more than 60 % of the chip area. The authors address this problem with three complementary techniques.

Crossbar‑level multiplexing – Experts are placed on crossbars, but several crossbars share the same peripheral block. Since MoE is sparse, only a limited number of crossbars are active at any moment, so contention on the shared peripherals is rare. This sharing dramatically reduces the peripheral footprint.
Load‑aware expert grouping – To avoid structural contention when multiple crossbars compete for the shared peripherals, experts are grouped before deployment. The authors profile a small data sample, rank experts by their average token load, and then pair high‑load and low‑load experts into the same group (load‑sorted grouping). This statistical balancing makes the total workload of each group similar, mitigating hotspots without imposing hard capacity limits that could hurt model quality.
Dynamic scheduling for the pre‑fill stage – During the pre‑fill phase of a large language model, token‑expert choices are not known ahead of time, so static grouping alone cannot guarantee balanced execution. The authors propose a two‑step scheduler: (a) dispatch multiple tokens to different groups simultaneously to create a compact schedule, and (b) insert idle slots where data reuse opportunities exist, thereby reducing redundant data transfers. The algorithm runs in linear time with respect to token length and can be pipelined in hardware, so its latency is effectively hidden.

While the above techniques improve area and throughput, autoregressive generation with expert‑choice routing still suffers because the gate must see all previously generated hidden states to decide which experts will process the next token. Existing work sidesteps this by switching to token‑choice routing at inference time, but that creates a train‑inference mismatch.

To preserve expert‑choice routing and still achieve fast generation, the paper introduces a Gate‑Output (GO) cache. For each expert, the cache stores the gate scores and the expert’s output for the top‑k tokens it selected during the pre‑fill stage. The cache resides in off‑chip DRAM alongside the conventional KV cache. During generation, only the newly generated token is fed to the gate; the cached scores are used to update the top‑k list, and the cached expert outputs are fetched directly, eliminating the need to recompute the gate for all previous tokens. The cache size is fixed at k × (number of experts) × model dimension and does not grow with sequence length, because only the top‑k results per expert are stored. If a token’s score enters the top‑k, the cache entry is replaced; otherwise the cache remains unchanged.

Experimental methodology – The authors evaluate a MoE variant of Llama‑2‑7B (Llama‑MoE‑4/16) with 16 experts, top‑k = 4, across 32 transformer blocks. They use an operator‑accurate simulator built on the 3DCIM PIM architecture, modeling a HERMES‑style crossbar (256 × 256, 8‑bit I/O, 130 ns latency, 0.096 nW power, 0.635 mm² area). One MoE layer requires 1536 crossbars; the baseline (3DCIM without sharing) assigns a dedicated peripheral set to each crossbar and processes tokens strictly sequentially. The authors test group sizes of 2 and 4, with uniform (U) and load‑sorted (S) grouping, and with compact (C) and optimized (O) scheduling.

Results –

Area efficiency: Sharing peripherals and load‑aware grouping reduce the MoE linear core area by up to 2.2× compared with the baseline.
Generation latency and energy: The GO cache, combined with the KV cache, cuts linear‑layer latency by 4.2× and energy consumption by 10.1× when generating 8 tokens. The KV cache alone helps attention latency but does not save energy because DRAM transfers dominate.
Overall performance density: The full system achieves 15.6 GOPS/W/mm², a notable figure for a PIM‑accelerated MoE transformer.

Key insights –

MoE sparsity can be exploited at the hardware level by multiplexing peripheral resources across crossbars, turning a traditional area bottleneck into a negligible overhead.
Load‑aware grouping, derived from a lightweight profiling step, balances the demand on shared peripherals without sacrificing model quality, unlike hard capacity constraints used in prior work.
Dynamic scheduling that inserts idle cycles to exploit data reuse further reduces off‑chip traffic during the pre‑fill stage, where token‑expert decisions are still being computed.
The GO cache demonstrates that caching gate outputs, rather than only activations, is sufficient to retain the benefits of expert‑choice routing while enabling fast autoregressive inference. The cache’s fixed size and simple update rule make it practical for hardware implementation.

Conclusion – The paper presents a cohesive hardware‑software co‑design for MoE transformers on PIM platforms. By sharing peripheral circuits, intelligently grouping experts, dynamically scheduling token streams, and introducing a gate‑output cache, the authors achieve substantial gains in area, latency, and energy while preserving the original expert‑choice routing semantics. The open‑source implementation (GitHub link provided) offers a valuable baseline for future research on sparse neural networks and memory‑centric accelerators.

Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching

💡 Research Summary

Comments & Academic Discussion

Leave a Comment