📝 Original Info
- Title: 경량 에이전트 코어 Xmodel‑2.5: µP 기반 파라미터 전이와 FP8 혼합 정밀도 학습
- ArXiv ID: 2511.19496
- Date: 2025-11-26
- Authors: ** - Yang Liu - Xiaolong Zhong - Ling Jiang - Xiaoduo AI Lab **
📝 Abstract
Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present \textbf{Xmodel-2.5}, a 1.3-billion-parameter small language model designed as a \emph{drop-in agent core}. Training with maximal-update parameterization ($\mu$P) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied \emph{tie-word-embedding} architecture. A 1.4T-token Warmup--Stable--Decay curriculum is used, and we further show that \textbf{switching from AdamW to Muon during the decay phase} improves the 13-task reasoning average by 4.58\,\% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.\footnote{https://huggingface.co/XiaoduoAILab/Xmodel-2.5 and https://huggingface.co/XiaoduoAILab/Xmodel-2.5-history (training checkpoints).} Training code and evaluation harness: https://github.com/XiaoduoAILab/Xmodel-2.5.
💡 Deep Analysis
📄 Full Content
Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM
Yang Liu
Xiaolong Zhong
Ling Jiang
Xiaoduo AI Lab
foamliu@yeah.net
Abstract
Large language models deliver strong reason-
ing and tool-use skills, yet their computa-
tional demands make them impractical for edge
or cost-sensitive deployments.
We present
Xmodel-2.5, a 1.3-billion-parameter small lan-
guage model designed as a drop-in agent core.
Training with maximal-update parameteriza-
tion (µP) allows hyper-parameters tuned on
a 20M-parameter proxy to transfer directly
to the full model, even under the parameter-
tied tie-word-embedding architecture. A 1.4T-
token Warmup–Stable–Decay curriculum is
used, and we further show that switching
from AdamW to Muon during the decay
phase improves the 13-task reasoning average
by 4.58 % while keeping every other hyper-
parameter fixed, verifying that early AdamW
stability can be paired with late Muon sharpen-
ing for better downstream performance. FP8-
mixed-precision training balances accuracy and
throughput. All checkpoints, recipes, and eval-
uation code are released under the Apache-2.0
license.1
Training code and evaluation har-
ness: https://github.com/XiaoduoAILab/
Xmodel-2.5.
1
Introduction
Large language models (LLMs) have demonstrated
remarkable reasoning, planning, and tool-use capa-
bilities, yet their deployment as autonomous agents
remains prohibitive for resource-constrained envi-
ronments. State-of-the-art agent backbones typi-
cally exceed 7–13 B parameters, demanding high-
end accelerators and large memory footprints in-
compatible with edge or cost-sensitive scenarios.
Recent small language models (SLMs, < 2 B)
match LLMs on single-turn benchmarks such as
GSM8K or MMLU, but still fall short in complex
multi-step reasoning—the core skill required for
1https://huggingface.co/XiaoduoAILab/Xmodel-2.
5
and
https://huggingface.co/XiaoduoAILab/
Xmodel-2.5-history (training checkpoints).
tool invocation, long-horizon planning, and robust
error recovery. Boosting this capability within the
1–2 B regime is the central goal of our work.
Xmodel-2.5
We present Xmodel-2.5, a 1.3 B-
parameter decoder-only model that retains Xmodel-
2’s two-stage pre-training recipe and maximal-
update parameterization (µP), while introducing
four targeted upgrades:
1. We extended Megatron-LM with complete
µP support, modifying its parameterization,
attention scaling, and residual connections.
The implementation was validated to preserve
µP dynamics, enabling reliable hyperparame-
ter transfer.
2. Tokenizer.
Adopted
the
129 k-token
DeepSeek-v3 tokenizer (vs. Xmodel-2’s 65 k-
token Unigram), improving compression and
decoding speed.
3. Numeric precision. Switched from BF16 to
FP8-mixed precision, raising training through-
put by ≈30% with no observable degradation
in pilot experiments.
4. Optimizer schedule. Switched from AdamW
to Muon during the decay phase, improving
the 13-task reasoning average by 4.58% while
keeping all other hyper-parameters fixed.
We hope Xmodel-2.5 serves as a minimal yet
strong baseline for lightweight agents with en-
hanced complex-reasoning capabilities.
2
Background & Related Work
2.1
Small Language Models for Reasoning
Parameter-efficient SLMs (<
2 B) have re-
cently closed the gap with larger counterparts
on mathematical and commonsense reasoning.
MiniCPM (Team et al., 2025b) and DCLM-1B (Li
arXiv:2511.19496v1 [cs.LG] 23 Nov 2025
et al., 2025) adopt code-enriched corpora and co-
sine or WSD schedules to surpass 35% on GSM8K.
Abdin et al. (2024) further emphasises textbook-
style synthetic data. These works, however, pri-
marily target single-turn question answering; sys-
tematic evaluation of multi-step agentic behaviours
remains under-explored.
2.2
Hyper-Parameter Transfer with
Maximal-Update Parameterisation
µP (Yang et al., 2022, 2023) preserves training dy-
namics across widths, enabling hyper-parameter
transfer from “toy” to full-scale models. Original
derivations assume SGD; recent work integrates
Adam (Kingma and Ba, 2017) but reports instabil-
ity below 2 B parameters.
2.3
Efficient Training of Lightweight Models
FP8 mixed-precision (Micikevicius et al., 2022)
and fused attention kernels reduce memory and
energy, yet have not been jointly studied with µP
optimisers. WSD learning-rate schedules (Hu et al.,
2024) improve late-stage performance by decou-
pling the annealing phase from token count; we
extend WSD with domain-weighted data mixing
during decay, an ablation absent in prior literature.
3
Methodology
We scale Xmodel-2 to 1.3 B parameters while
retaining its deployment-friendly, deep-and-thin
decoder-only skeleton. The section below details
three design pillars: (i) architecture-level µP com-
patibility (§3.1), (ii) a three-phase Warmup–Stable–
Decay (WSD) curriculum (§3.2), and (iii) FP8
mixed-precision acceleration (§3.3).
3.1
Model Architecture
Xmodel-2.5 keeps the deep-and-thin decoder-only
design of Xmodel-2, with the configuration in
Table 1. To preserve maximal-update dynamics
acros
Reference
This content is AI-processed based on open access ArXiv data.