JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.

💡 Research Summary

The paper introduces a novel scaling dimension for large language models (LLMs) that expands model capacity without increasing FLOPs, by attaching token‑indexed parameters to each Transformer layer. The authors propose two mechanisms: Joint‑Token (JTok) and its sparse extension Mixture of Joint‑Token (JTok‑M). In JTok, each layer maintains a learnable embedding table of size V × d (vocabulary size by hidden dimension). For a token x, the corresponding row Eℓ

JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment