경량 에이전트 코어 Xmodel‑2.5: µP 기반 파라미터 전이와 FP8 혼합 정밀도 학습

Reading time: 5 minute
...

📝 Original Info

  • Title: 경량 에이전트 코어 Xmodel‑2.5: µP 기반 파라미터 전이와 FP8 혼합 정밀도 학습
  • ArXiv ID: 2511.19496
  • Date: 2025-11-26
  • Authors: ** - Yang Liu - Xiaolong Zhong - Ling Jiang - Xiaoduo AI Lab **

📝 Abstract

Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present \textbf{Xmodel-2.5}, a 1.3-billion-parameter small language model designed as a \emph{drop-in agent core}. Training with maximal-update parameterization ($\mu$P) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied \emph{tie-word-embedding} architecture. A 1.4T-token Warmup--Stable--Decay curriculum is used, and we further show that \textbf{switching from AdamW to Muon during the decay phase} improves the 13-task reasoning average by 4.58\,\% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.\footnote{https://huggingface.co/XiaoduoAILab/Xmodel-2.5 and https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history (training checkpoints).} Training code and evaluation harness: https://github.com/XiaoduoAILab/Xmodel-2.5.

💡 Deep Analysis

📄 Full Content

Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM Yang Liu Xiaolong Zhong Ling Jiang Xiaoduo AI Lab foamliu@yeah.net Abstract Large language models deliver strong reason- ing and tool-use skills, yet their computa- tional demands make them impractical for edge or cost-sensitive deployments. We present Xmodel-2.5, a 1.3-billion-parameter small lan- guage model designed as a drop-in agent core. Training with maximal-update parameteriza- tion (µP) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter- tied tie-word-embedding architecture. A 1.4T- token Warmup–Stable–Decay curriculum is used, and we further show that switching from AdamW to Muon during the decay phase improves the 13-task reasoning average by 4.58 % while keeping every other hyper- parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpen- ing for better downstream performance. FP8- mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and eval- uation code are released under the Apache-2.0 license.1 Training code and evaluation har- ness: https://github.com/XiaoduoAILab/ Xmodel-2.5. 1 Introduction Large language models (LLMs) have demonstrated remarkable reasoning, planning, and tool-use capa- bilities, yet their deployment as autonomous agents remains prohibitive for resource-constrained envi- ronments. State-of-the-art agent backbones typi- cally exceed 7–13 B parameters, demanding high- end accelerators and large memory footprints in- compatible with edge or cost-sensitive scenarios. Recent small language models (SLMs, < 2 B) match LLMs on single-turn benchmarks such as GSM8K or MMLU, but still fall short in complex multi-step reasoning—the core skill required for 1https://huggingface.co/XiaoduoAILab/Xmodel-2. 5 and https://huggingface.co/XiaoduoAILab/ Xmodel-2.5-history (training checkpoints). tool invocation, long-horizon planning, and robust error recovery. Boosting this capability within the 1–2 B regime is the central goal of our work. Xmodel-2.5 We present Xmodel-2.5, a 1.3 B- parameter decoder-only model that retains Xmodel- 2’s two-stage pre-training recipe and maximal- update parameterization (µP), while introducing four targeted upgrades: 1. We extended Megatron-LM with complete µP support, modifying its parameterization, attention scaling, and residual connections. The implementation was validated to preserve µP dynamics, enabling reliable hyperparame- ter transfer. 2. Tokenizer. Adopted the 129 k-token DeepSeek-v3 tokenizer (vs. Xmodel-2’s 65 k- token Unigram), improving compression and decoding speed. 3. Numeric precision. Switched from BF16 to FP8-mixed precision, raising training through- put by ≈30% with no observable degradation in pilot experiments. 4. Optimizer schedule. Switched from AdamW to Muon during the decay phase, improving the 13-task reasoning average by 4.58% while keeping all other hyper-parameters fixed. We hope Xmodel-2.5 serves as a minimal yet strong baseline for lightweight agents with en- hanced complex-reasoning capabilities. 2 Background & Related Work 2.1 Small Language Models for Reasoning Parameter-efficient SLMs (< 2 B) have re- cently closed the gap with larger counterparts on mathematical and commonsense reasoning. MiniCPM (Team et al., 2025b) and DCLM-1B (Li arXiv:2511.19496v1 [cs.LG] 23 Nov 2025 et al., 2025) adopt code-enriched corpora and co- sine or WSD schedules to surpass 35% on GSM8K. Abdin et al. (2024) further emphasises textbook- style synthetic data. These works, however, pri- marily target single-turn question answering; sys- tematic evaluation of multi-step agentic behaviours remains under-explored. 2.2 Hyper-Parameter Transfer with Maximal-Update Parameterisation µP (Yang et al., 2022, 2023) preserves training dy- namics across widths, enabling hyper-parameter transfer from “toy” to full-scale models. Original derivations assume SGD; recent work integrates Adam (Kingma and Ba, 2017) but reports instabil- ity below 2 B parameters. 2.3 Efficient Training of Lightweight Models FP8 mixed-precision (Micikevicius et al., 2022) and fused attention kernels reduce memory and energy, yet have not been jointly studied with µP optimisers. WSD learning-rate schedules (Hu et al., 2024) improve late-stage performance by decou- pling the annealing phase from token count; we extend WSD with domain-weighted data mixing during decay, an ablation absent in prior literature. 3 Methodology We scale Xmodel-2 to 1.3 B parameters while retaining its deployment-friendly, deep-and-thin decoder-only skeleton. The section below details three design pillars: (i) architecture-level µP com- patibility (§3.1), (ii) a three-phase Warmup–Stable– Decay (WSD) curriculum (§3.2), and (iii) FP8 mixed-precision acceleration (§3.3). 3.1 Model Architecture Xmodel-2.5 keeps the deep-and-thin decoder-only design of Xmodel-2, with the configuration in Table 1. To preserve maximal-update dynamics acros

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut