A.X K1 Technical Report
We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.
💡 Research Summary
The paper presents A.X K1, a 519‑billion‑parameter Mixture‑of‑Experts (MoE) language model designed to balance reasoning capability with inference efficiency, especially for Korean‑language applications. Leveraging recent scaling laws for MoE architectures (Tian et al., 2025) and vocabulary size (Tao et al., 2024), the authors allocate a total of 519 B parameters but only 33 B active parameters during inference, achieving a high knowledge density within a fixed compute budget of approximately 2.55 × 10²⁴ FLOPs. Training was performed on 1 024 NVIDIA H200 GPUs over 75 days, processing roughly 10 trillion tokens drawn from a meticulously curated multilingual corpus that includes web text, PDFs, code, and synthetic data.
Key architectural choices include a 61‑layer model with Multi‑head Latent Attention (MLA), 64 attention heads, and a shared dense expert. Expert capacity is set to a granularity of G = 7 (d_model = 7 168, d_expert = 2 048), slightly below the theoretically optimal range to improve training stability under imperfect load‑balancing. The model employs a dual‑normalization scheme using RMSNorm before and after MoE blocks, eliminating auxiliary load‑balancing loss and mitigating early‑stage loss spikes. A 160 K Byte‑Level BPE tokenizer, derived from a derivative‑based estimation, balances coverage across English, Korean, Chinese, Japanese, and Spanish while maintaining hardware‑friendly alignment.
The data pipeline is a four‑stage system: raw data collection, document parsing (including a custom vision‑language model for Korean PDF layout extraction), synthetic data generation (seed‑corpus restructuring and topic‑driven generation), and a three‑step curation process (quality filtering, domain classification, difficulty scoring). This pipeline yields a high‑quality 10 T‑token dataset that emphasizes reasoning‑intensive domains.
A novel “Think‑Fusion” training recipe unifies two inference modes within a single parameter space: a “thinking” mode that produces deep, multi‑step reasoning chains, and a “non‑thinking” mode that delivers concise, low‑latency answers. By mixing these modes during model merging and instruction‑tuning, users can dynamically select the desired computational budget at inference time, effectively trading off latency for reasoning depth.
Extensive evaluations across English and Korean benchmarks—including knowledge (MMLU), instruction following, mathematics (GSM‑8K), and coding (HumanEval)—show that A.X K1 matches or exceeds the performance of contemporary open‑source models with comparable active parameter counts. Notably, on Korean‑specific benchmarks (KoBench, KOR‑MMLU), A.X K1 achieves a 2‑4 percentage‑point lead, demonstrating the benefit of a language‑focused pre‑training corpus. Inference experiments confirm that the Think‑Fusion switch can reduce latency by up to threefold for simple queries while preserving high accuracy on complex tasks.
Beyond technical contributions, the authors release A.X K1 as an open‑source model (https://huggingface.co/skt/A.X‑K1), positioning it as a cornerstone for a sovereign Korean AI ecosystem. The work showcases a reproducible pathway for building frontier‑scale MoE models under realistic compute constraints, offering valuable insights for future endeavors aiming at even larger parameter regimes.
Comments & Academic Discussion
Loading comments...
Leave a Comment