KAT-Coder-V2 Technical Report

KA T -Coder -V2 T echnical Report KwaiKA T T eam Abstract W e present KA T -Coder-V2, an agentic coding model developed by the KwaiKA T team at Kuaishou. KA T -Coder-V2 adopts a Specialize-then-Unify paradigm that decomposes agentic coding into ﬁve expert domains—SWE, W ebCoding, T erminal, W ebSearch, and General—each undergoing independent supervised ﬁne-tuning and reinfor cement learning, before being consolidated into a single model via on-policy distillation. W e develop KwaiEnv , a modular infrastructur e sustaining tens of thousands of concurr ent sandbox instances, and scale RL training along task complexity , intent alignment, and scaffold generalization. W e further propose MCLA for stabilizing MoE RL training and T ree T raining for eliminating redundant computation over tree-structur ed trajectories with up to 6.2 × speedup. KA T -Coder-V2 achieves 79.6% on SWE-bench V eriﬁed (vs. Claude Opus 4.6 at 80.8%), 88.7 on PinchBench (surpassing GLM-5 and MiniMax M2.7), ranks ﬁrst across all three frontend aesthetics scenarios, and maintains strong generalist scores on T erminal-Bench Hard (46.8) and 𝜏 2 -Bench (93.9). Our model is publicly available at https://streamlake.com/product/kat- coder . Figure 1 | Results of KA T -Coder -V2 and Claude Opus 4.6 across dif fer ent scaffolds on various software engineering benchmarks. 1. Introduction Large Language Models (LLMs) are rapidly evolving from single-turn code generation toward Agentic Coding , the ability to autonomously plan, execute, and verify multi-step software engineering tasks within real-world development environments. Recent frontier models [ 1 – 6 ] have demonstrated impressive progr ess in this dir ection, steadily advancing the state of the art on benchmarks including SWE-bench [ 7 ], T erminal-Bench [ 8 ], and 𝜏 2 -Bench [ 9 ]. Unlike traditional code question-answering or mathematical r easoning, agentic coding requir es models to interact with authentic code repositories, manage intricate dependency graphs, orchestrate multi-turn tool invocations, and gr ound their decisions in execution feedback. This interactive, long-horizon workﬂow demands that models’ multi-step behaviors be aligned with end-to-end engineering outcomes, rather than merely optimizing for single-turn code correctness. Realizing this vision presents thr ee fundamental challenges. The ﬁrst is capability fragmen- tation . SWE tasks requir e long-chain code editing grounded in test veriﬁcation, W ebCoding demands aesthetic judgment under sparse colloquial inputs, and T erminal tasks call for per - sistent environment state tracking. The training signals across these domains are not mer ely differ ent but often conﬂicting, making it impractical for a single monolithic training pipeline to reach the optimum in every domain simultaneously . The second challenge is infrastructure coupling . Agentic RL training demands high-throughput sandbox or chestration, heteroge- neous benchmark support, and seamless compatibility with a rapidly gr owing ecosystem of agent scaffolds such as Claude Code, OpenClaw , and OpenCode. Existing systems, however , tightly couple these concerns, making every new scaffold or dataset integration a costly engi- neering endeavor . The third is scaling agentic RL . Effectively training coding agents requir es scaling along multiple dimensions simultaneously—task complexity , prompt diversity , and scaffold generalization—while coping with the MoE instability and computational redundancy introduced by tr ee-structur ed, multi-turn trajectories. W e introduce KA T -Coder-V2 , a comprehensive agentic coding model developed by the KwaiKA T team at Kuaishou. Built upon KA T -Coder-V1 [ 10 ] through continued post-training, the model follows a Specialize-then-Unify paradigm that systematically addr esses all thr ee chal- lenges above. W e decompose the full capability spectrum into ﬁve orthogonal expert domains (SWE, W ebCoding, T erminal, W ebSearch, and General), each undergoing independent data construction, supervised ﬁne-tuning, and environment-feedback reinfor cement learning. The re- sulting domain experts are then consolidated into a single deployable model thr ough On-Policy Distillation (OPD) , which combines the direct mistake-avoidance of on-policy exploration with dense, step-by-step supervision from the specialized experts, achieving lossless fusion without the exposure bias of of ﬂine imitation. T o tackle infrastructure coupling, we develop KwaiEnv , a modular infrastr ucture that de- couples datasets, sandboxes, scaffolds, and veriﬁers, sustaining tens of thousands of concurrent sandbox instances. Built on this foundation, we propose an Agentic Scaling paradigm that systematically scales RL training along task complexity , intent alignment, and scaf fold gener - alization, yielding over 100K diverse, high-dif ﬁculty training samples across multiple agent frameworks. T o stabilize MoE RL training, we pr opose MCLA (Monte-Carlo Log-pr obability A veraging) for reducing log-probability variance. W e further intr oduce T ree T raining for elimi- nating redundant computation over tr ee-structured trajectories, achieving up to 6.2 × training speedup. Extensive evaluation shows that KA T -Coder -V2 closely matches Claude Opus 4.6 acr oss scaf- 2 folds and benchmarks: 79.6% on SWE-bench V eriﬁed (vs. 80.8%), 88.7 on PinchBench (surpassing GLM-5 at 86.4 and MiniMax M2.7 at 87.1), leading scor es across all thr ee frontend aesthetics scenarios (Landing Page 59.8, Slides 57.6, Data V isualization 67.6), and strong generalist perfor - mance (T erminal-Bench Hard 46.8, 𝜏 2 -Bench 93.9). These results conﬁrm that domain-specialized training, large-scale agentic RL with systematic scaling, and uniﬁed on-policy distillation form an effective path to powerful coding agents. 2. KwaiEnv: Infrastructure for Agentic Code Intelligence 2.1. Background and Design Motivation As the capabilities of Large Language Models continue to evolve, Agentic Coding has emerged as a critical domain for model evaluation and Reinfor cement Learning (RL) training. Unlike traditional Question-Answering (QA) or mathematical reasoning tasks, Agentic Cod- ing—particularly Software Engineering (SWE) tasks—requires models to execute multi-step, long-chain operations within a sandbox envir onment equipped with authentic code r epositories, dependencies, and test suites. The rollout process for these tasks involves several complex stages, including envir onment initialization, tool calling, state management, and r esult veriﬁcation, far exceeding the complexity of single-turn inference scenarios. In engineering practice, this complexity introduces the following challenges: • Dataset Heterogeneity: Diverse benchmarks (e.g., SWE-bench, SWE-bench Pro [ 11 ]) impose varying requir ements on sandbox images and evaluation logic. • Scaffold Proliferation: New scaffolds for Coding Agents are constantly emerging with signiﬁcant diff erences in integration protocols; without a uniﬁed abstraction, onboar ding each new agent requir es r edundant engineering effort. • High-Throughput Demands: During the RL training phase, a massive number of rollouts must be executed concurrently , placing stringent performance r equir ements on sandbox scheduling and trajectory collection. T o addr ess these challenges, we developed KwaiEnv . The cor e design objective is to decouple datasets, sandboxes, scaffolds, and veriﬁcation logic through a modular and conﬁgurable architectur e. This allows for the ﬂexible combination of components at minimal cost, supporting the entire workﬂow fr om model evaluation to RL training. 2.2. System Overview KwaiEnv provides a uniﬁed interface that supports the conﬁgurable combination of models, scaffolds, and datasets. This enables a complete closed-loop workﬂow encompassing model trajectory collection, rollout evaluation, and the delivery of trajectories to the RL engine for training. The system consists of ﬁve core modules, each with distinct responsibilities and high degrees of decoupling, allowing for ﬂexible extension as needed. In a typical workﬂow , the user speciﬁes the dataset, target model, and scaf fold via a conﬁg- uration ﬁle. KwaiEnv then orchestrates the necessary r emote sandboxes, deploys the scaf fold onto the corresponding dataset images, and forwards model r equests to the target LLM through a uniﬁed network proxy layer , recording the entir e interaction trajectory . Upon completion, the V eriﬁer scor es the results, and the T rajectory Manager formats the trajectories for the RL engine. This entire pipeline operates autonomously without human intervention, signiﬁcantly reducing the engineering overhead of data collection and model training, as shown in ﬁgur e 2. 3 L L M M o de l S W E - l i k e da t a s e t S c a f f o l d L L M P r o x y T r a j M a n a g e r Tu rn 0 Tu rn 1 Tu rn N Tu rn 0 Tu rn 0 Tu rn 1 S F T T r a j RL T r a j M o de l P a t c h V e ri fi e r S c o r e S F T RL E v a l u a t i o n Ta sk C onf i g S andbo x · · · Prox y R e q u e s t LLM M o d e l t h ro u g h p ro x y Rol l out Re sul t s D a t a s y nt he s i s 1 2 R e i n f or c e L e a r ni ng 3 E v a l u a t i o n s u b - a g e n t 1 s u b - a g e n t N Tu rn 1 Tu rn 2 · · · Figure 2 | KwaiEnv W orkﬂow for SWE T asks, Supporting Key Processes Including Data Synthesis, RL, and Evaluation 2.3. Core Modules 2.3.1. Dataset KwaiEnv integrates mainstream LLM benchmarks covering data analysis, code generation, SWE, web search, and general r easoning. This includes widely adopted evaluation sets such as SWE-bench [ 7 ], LiveCodeBench [ 12 ] and so on. Furthermore, the system incorporates internal proprietary training and test sets to support multi-dimensional evaluation and full-scenario RL. The Dataset module utilizes a uniﬁed abstract interface to mask the discr epancies in task formats, image dependencies, and scoring logic across differ ent benchmarks. New datasets can be seamlessly integrated by implementing standard methods deﬁned by the interface. 2.3.2. V eriﬁer KwaiEnv employs differ entiated veriﬁcation strategies tailored to various task types, encap- sulated within the V eriﬁer module. The system supports thr ee primary categories of veriﬁcation: • Deterministic Scoring: For tasks with deﬁnitive answers (e.g., mathematical proofs, code generation), a specialized module performs pr ecise scoring based on golden patches, execu- tion of test cases, or standard output comparison. • LLM-as-Judge: For open-ended tasks (e.g., instruction following, long-document com- prehension), the system supports LLM-based evaluation and Rubric-based scoring, with conﬁgurable dimensions and weights. • SWE Evaluation: For software engineering tasks, the system invokes ofﬁcial scoring modules to execute test suites within the sandbox and r eturn key metrics such as pass rates. 4 2.3.3. Scaffold KwaiEnv supports the "black-box" integration of leading Coding Agent scaf folds—including Claude Code 1 , Kilo Code 2 , Cline 3 , OpenClaw 4 , OpenCode 5 , etc —while maintaining compat- ibility across versions. The integration cost is minimal: since KwaiEnv proxies model requests at the network layer , any Coding Agent that calls an LLM via API can be integrated without code modiﬁcations, requiring only the conﬁguration of API endpoints and authentication. 2.3.4. Sandbox The Sandbox module is the foundational infrastructure for lar ge-scale RL training. The system can trigger a massive number of r emote sandbox instances within seconds. Each sandbox runs in an isolated container environment, mounted with dataset-speciﬁc images. KwaiEnv manages the entire lifecycle—cr eation, task assignment, monitoring, and reclamation—making the process transparent to upper-layer modules. The system can support tens of thousands of concurrent sandboxes, pr oviding the high throughput r equir ed for rapid RL rollout acquisition. 2.3.5. T rajectory Manager Acting as the bridge between KwaiEnv and the RL engine, the T rajectory Manager handles trajectory collection, formatting, and output. It inter cepts all LLM requests via the network proxy , recording comprehensive metadata including I/O content, tool-call sequences, token usage, and timestamps. For RL training, the module can assemble, r eor der , and truncate raw trajectories to meet the input speciﬁcations of various algorithms. 2.4. Decoupling and Scalability KwaiEnv adheres to the principle of Separation of Concerns. The ﬁve cor e modules com- municate through standar dized interfaces, allowing independent iteration of any module. This design yields several key beneﬁts: • Data Scalability: Scaling training data requir es only the implementation of a uniﬁed data interface, without impacting sandboxes or scaffolds. • Scaffold Scalability: New Coding Agents can be onboarded by simply conﬁguring con- tainer commands and API endpoints. • Evaluation Agility: The evaluation and training pipelines share the same infrastructur e, ensuring high consistency and short iteration cycles. • Algorithmic Adaptability: The formatting logic is decoupled from RL algorithms; new algorithms can be supported by simply registering new trajectory formatting r ules. 3. Post-T raining Methodology 3.1. T raining Pipeline Overview KA T -Coder-V2 is built upon KA T -Coder-V1 [ 10 ] through continued post-training, following a specialize-then-unify paradigm. W e decompose the capability spectr um of agentic coding into ﬁve 1 https://github.com/anthropics/claude-code 2 https://github.com/kilo-org/kilocode 3 https://github.com/cline/cline 4 https://github.com/openclaw/openclaw 5 https://github.com/anomalyco/opencode 5 T able 1 | Overview of the ﬁve expert domains in the SFT stage. Expert Scenario Key Methodology SWE Issue resolution Issue-PR pairing with merge-status supervision; Auto- Builder for veriﬁable task synthesis (F2P+P2P); Code Com- prehension trajectory synthesis W ebCoding UI generation T ri-Perspective label system; Prompt rewriting (designer → ordinary user); designer -panel evaluation T erminal CLI reasoning Cross-format SWE → T erminal conversion; multi-agent syn- thesis; Docker-based veriﬁcation W ebSear ch Agentic search KG construction fr om search trajectories; Pass@8 ﬁltering; rejection sampling ﬁne-tuning General Instr . / QA / Code-Math Compositional constraint training; long-conversation sam- ples; online-judge veriﬁcation orthogonal expert domains—SWE (software engineering rep air and development), W ebCoding (frontend genera tion and aesthetics), T erminal (command-line reasoning), W ebSearch (online search and information synthesis), and General (general-purpose code intelligence)—each of which undergoes independent data constr uction and specialized training. The overall pipeline consists of three stages: • Supervised Fine-T uning: For each expert domain, we leverage KwaiEnv’s trajectory collection capabilities and domain-speciﬁc data synthesis pipelines to construct large-scale, high-quality training data, producing a dedicated expert model per domain. • Reinforcement Learning: Using the sandbox environments and veriﬁer infrastructur e provided by KwaiEnv , we apply environment-feedback-based r einforcement learning to further improve decision quality in multi-turn interactions and long-horizon tasks. • On-Policy Distillation: The capabilities of multiple domain experts are consolidated into a uniﬁed KA T -Coder-V2 thr ough on-policy distillation, achieving single-model deployment while retaining expert-level performance acr oss all domains. The following subsections detail the data construction and training methodology for each expert domain. 3.2. Supervised Fine-T uning W e train ﬁve domain experts via supervised ﬁne-tuning, each targeting a distinct capability requir ed for agentic coding. T able 1 summarizes the data sources, scale, and key method- ological innovations of each expert. The r emainder of this section details the unique technical contributions within each domain. 3.2.1. SWE Expert: Autonomous Issue Resolution The SWE Expert targets real-world software engineering scenarios, training the model to autonomously perform end-to-end tasks—codebase compr ehension, fault localization, and code repair —starting fr om an issue description. Data construction revolves around three comple- mentary pipelines: Issue-PR , which supplies large-scale real-world engineering r epair corpora, 6 G i t H u b R e p o s i t o r i e s 1 0 0 K + r e p o s 1 1 l a n g u a g e s P R – I s s u e P ai r i n g S e m a n t i c a s s o c i a t i o n a n a l y s i s C o d e d i f f e x t r a c t i o n M e r g e s t a t u s a s c o r r e c t n e s s s u p e r v i s i o n C h ai n R ec o n s t r u c t i o n P r o b l e m D i s c o v e r y → F a u l t L o c a l i z a t i o n → C o d e R e p a i r L o n g - c o n t e x t a u g m e n t a t i o n C r o s s - ﬁ l e c h a n g e s M u l t i - r o u n d r e v i e w s L i n k e d P R i t e r a t i o n s R e t r i e v al T a s k s I s s u e s e m a nt i c s p a c e → c o d e s p a c e m a p p i n g E d i t i n g T a s k s F a u l t a t t r i b u t i o n & c ha n g e p r o p o s a l s Q u al i t y C o n t r o l A u t o F i l t e r i n g &  S e m a nt i c d e d u p l i c a t i o n 2 M + H i g h - Q u al i t y S a mp l e s R e t r i e v a l & E d i t i n g M u l t i - l a n g u a g e L o n g - c o nt e x t Figure 3 | Overview of the Issue-PR data construction pipeline. Merged PRs serve as anchor points for bidir ectional Issue–PR mapping and code dif f extraction, with merge status pr oviding natural correctness supervision. The reconstr ucted engineering chains are decomposed into retrieval and editing tasks, then ﬁlter ed through multi-stage quality control to yield over 2M training samples. AutoBuilder , which generates veriﬁable interactive training tasks, and Code Comprehension , which produces interactive code understanding trajectories grounded in r eal-world repositories. Issue-PR Pipeline. W e extract paired data of merged Pull Requests and their associated Issues from hundr eds of thousands of GitHub open-source repositories, covering 11 mainstream programming languages (illustrated in Figure 3). Using mer ged PRs as anchor points, we establish bidirectional Issue-PR mappings through semantic association analysis. Speciﬁcally , for each merged PR 𝑝 , we compute a relevance score 𝑠 ( 𝑖 , 𝑝 ) = cos  e 𝑖 , e 𝑝  (1) between the Issue embedding e 𝑖 and the PR embedding e 𝑝 , retaining pairs with 𝑠 ( 𝑖 , 𝑝 ) > 𝜏 to establish bidirectional mappings. W e then extract pre- and post-mer ge code state differences (diffs), and r econstruct the complete problem discovery → fault localization → code repair chain. Building upon this chain, we constr uct two complementary training paradigms. Retrieval tasks guide the model to perform precise mapping from the Issue semantic space to the code space—given an Issue description, the model must locate relevant ﬁles and functions within a large-scale codebase. Editing tasks requir e the model to produce complete r epairs that integrate fault attribution with change proposals, forming an end-to-end capability loop. Along the long- context dimension, we exploit the inherent long-range dependency characteristics of Issue-PR data (cross-ﬁle changes, multi-r ound reviews, and linked PR iterations) by aggr egating highly correlated engineering fragments into long-sequence samples, strengthening the model’s ability to associate information across lar ge-scale codebases. Regarding data quality , the PR merge status serves as a natural correctness supervision signal. On top of this, we ﬁlter out auto-generated artifacts and non-essential dependency changes, perform semantic-level deduplication of repetitive repair patterns, and ultimately curate over 2M high-quality samples. AutoBuilder Pipeline. Static code data lacks environment interaction information and is insufﬁcient for training the long-horizon reasoning capabilities requir ed in agentic scenarios. T o addr ess this, we design an automated task synthesis pipeline (illustrated in Figur e 4) that automatically constructs veriﬁable software engineering tasks fr om real-world repositories, comprising three stages: 7 8 , 0 0 0 + O p e n - S o u r c e R e p o s i t o r i e s ( P y t h o n , J a v a , T y p e S c r i p t , G o , R u s t , C / C + + ) C o m m i t s / P R s w i t h t e s t c h a n g e s S t a g e 2 : I n s tr u c t i on C on s tr u c t i o n I n p u t : c o m m i t d i f f + i s s u e + c o n t e x t c o d e L L M g e n e r a t e s n a t u r e - l a n g u a g e t a s k d e s c r i p t i o n M u l t i - r o u n d ﬁ l t e r i n g & r e w r i t i n g ( i n t e n t o n l y , n o s o l u t i o n l e a k a g e ) L e a k - f r e e t a s k i n s t r u c t i o n S W E T a s k D e ﬁ n i t i on — 3 0 k V e r i ﬁ e d T r a i n i n g S a m p l e s E n v i r o n m e n t ( D o c k e r + s c r i p t s ) B u g g y C o d e ( p r e - ﬁ x s t a t e ) I n s t r u c t i o n ( l e a k - f r e e ) V e r i ﬁ e r ( r u l e + L L M - a s - J u d g e ) S t a g e 1 : E n v i r on m e n t S e tu p S e l e c t b a s e D o c k e r i m a g e & l a n g u a g e r u n t i m e M u l t i - A g e n t i t e r a t i v e b u i l d & r e p a i r ( d e p e n d e n c y · c o n ﬁ g · v e r i ﬁ c a t i o n ) G e n e r a t e s e t u p / b u i l d / t e s t s c r i p t s R e p r o d u c i b l e s a n d b o x e n v i r o n m e n t S t a g e 3 : I n s t a n c e V e r i ﬁ c a t i on F 2 P ( F a i l - t o - P a s s ) R e l a t e d f a i l i n g t e s t s p a s s a f t e r ﬁ x R u l e - b a s e d v e r i ﬁ e r + L L M - a s - J u d ge ✗ D i s c a r d ✓ K e e p P 2 P ( P a s s - t o - P a s s ) E x i s t i n g p a s s i n g t e s t s r e m a i n p a s s i n g F 2 P & P 2 P ? Figure 4 | Overview of the AutoBuilder pipeline for automated SWE task synthesis from open-source r epositories. Environment Setup. W e select active repositories with well-conﬁgur ed CI from GitHub and extract commit/PR instances that contain unit test changes. For each instance, we employ multi- agent collaboration to automatically constr uct an isolated sandbox: a Dependency Resolution Agent, an Envir onment Conﬁguration Agent, and a Build V eriﬁcation Agent are respectively responsible for dependency installation, compilation conﬁguration, and test execution. These agents iteratively set up and repair the envir onment based on the repository’s own Dockerﬁle, dependency manifests, and CI scripts, until the code compiles and tests are executable. Instruction Construction. T aking the commit diff, associated Issue, and surrounding code context as input, we use an LLM to automatically generate user instructions. The key constraint is that instructions must describe only the r equirement intent without leaking implementation details. Multi-round ﬁltering ensur es clarity and open-endedness, closely approximating the way real users pose questions. Instance V eriﬁcation. Let T fail and T pass denote the sets of originally failing and passing tests, respectively . An instance with repaired code ˆ 𝑐 is retained if and only if it satisﬁes both criteria: ∀ 𝑡 ∈ T fail : 𝑡 ( ˆ 𝑐 ) = Pass | {z } Fail-to-Pass (F2P) ∧ ∀ 𝑡 ∈ T pass : 𝑡 ( ˆ 𝑐 ) = Pass | {z } Pass-to-Pass (P2P) (2) F2P conﬁrms the r epair is effective by requiring all previously failing tests to pass, while P2P rules out r egression defects by ensuring all pr eviously passing tests remain unaf fected. Only instances satisfying both conditions are r etained. Through this pipeline, we produce 30k veriﬁed training samples from over 8,000 open- source repositories spanning mainstream languages including Python, Java, T ypeScript, Go, Rust, and C/C++, covering typical task types such as bug ﬁxing, feature development, and code refactoring. Each sample is deﬁned by a complete quadruple: a repr oducible environment (Docker image + build scripts), buggy-state code, a leak-free task instruction, and a dual veriﬁcation mechanism combining a rule-based veriﬁer with multi-dimensional GRM scoring. 8 Code Comprehension Pipeline. While the Issue-PR and AutoBuilder pipelines focus on code editing capabilities, agentic SWE equally demands deep code comprehension —the ability to navigate, understand, and reason about lar ge-scale codebases. T o train this complementary skill, we design a seven-stage trajectory synthesis pipeline that produces interactive code understanding data grounded in r eal-world repositories. The pipeline begins with large-scale r epository discovery: we crawl high-star GitHub reposi- tories via segmented sear ch (partitioning by star ranges to bypass the API’s 1,000-r esult limit) and apply a six-dimensional quality ﬁlter —covering naming patterns, description keywords, language composition ( ≥ 50% primary-language code), contributor count ( ≥ 10), PR/Issue ac- tivity ( ≥ 50 each), and the presence of source code or build conﬁguration ﬁles—to retain only repositories genuinely suitable for code comprehension tasks. For each qualifying r epository , we retrieve str uctured pr oject documentation via DeepW iki and pin the corresponding commit hash to ensure version consistency with the subsequent sandbox envir onment. Next, we construct isolated Docker environments per r epository (based on the pinned com- mit) and synthesize code comprehension queries using an LLM. The query synthesis is guided by a controlled design covering six question types (overview , code locating, implementation walkthrough, call-chain tracing, enhancement planning, and code review) across four difﬁ- culty levels, with balanced Chinese–English bilingual generation, yielding approximately eight queries per repository . T rajectory synthesis is then performed by deploying a Claude Code Agent inside each Docker container , wher e the agent autonomously explor es the codebase using its full toolset (ﬁle reading, gr ep search, bash execution) to answer the generated queries, with a maximum of 150 interaction turns per task. The r esulting raw trajectories are converted fr om Anthropic format to OpenAI-compatible training format. 3.2.2. W ebCoding Expert: Aesthetic-A ware UI Generation The W ebCoding Expert targets automatic generation of fr ontend pages (HTML/CSS/JS) with commercial-grade visual quality fr om natural language input, focusing on Landing Pages, Presentations, and Data V isualizations. The core challenge is that real users predominantly pr o- vide colloquial, short-form inputs (e.g., “make it cool and street-style”), while general-purpose models suffer from aesthetic collapse under such low-information-density inputs, r egressing to conservative defaults (blue-white palettes, single-column grids). T ri-Perspective Label System. W e propose a label system that maintains three aligned views for each design speciﬁcation: user per ception → design rationale → technical implementation . For- mally , the system deﬁnes a structured mapping L : 𝑉 user → 𝑉 design → 𝑉 impl , decomposed into seven hierarchical levels { 𝐿 𝑘 } 7 𝑘 = 1 (L1 style guidance → L2–L4 global visual/animation/typography norms → L5–L7 module-level speciﬁcations/technical implementation/asset manifests). Col- loquial user inputs typically cover only 𝐿 1 ; the model autoregr essively infers the remaining levels: ˆ 𝐿 𝑘 = 𝑓 𝜃  𝐿 1 , ˆ 𝐿 2 , . . . , ˆ 𝐿 𝑘 − 1  , 𝑘 = 2, . . . , 7 (3) transforming the generation from a black box into a traceable, str uctured derivation. Data Synthesis and Prompt Rewriting. Data construction proceeds in four stages: high-quality design screenshot collection, reverse-engineer ed structured pr ompts, seed HTML generation 9 with designer screening, and large-scale training data derivation via a T eacher Model. T o bridge the distributional gap between verbose structur ed pr ompts and real user inputs, we adopt a Prompt Rewriting strategy: for each HTML, we construct thr ee semantically equivalent prompt variants—a designer-annotated version ( > 1000 words), a professional-user version (200–300 wor ds), and an ordinary-user version ( ∼ 50 words). This spectrum enables consistent visual quality across varying input granularities. Aesthetic Evaluation. A critical distinction in frontend generation is between code ﬁdelity — whether the generated HTML/CSS renders correctly without errors—and aesthetic ﬁdelity — whether the rendered page achieves professional-grade visual quality as judged by trained designers. Code ﬁdelity is a necessary but insuf ﬁcient condition for aesthetic ﬁdelity: a page that renders without err ors can still score anywhere fr om poor to excellent on aesthetic quality . Existing benchmarks (e.g., W ebArena [ 13 ], Design2Code [ 14 ]) predominantly measur e code ﬁdelity or pixel-level similarity against a r eference design, leaving a systematic gap in evaluating aesthetic quality—particularly in the reference-fr ee T ext-to-UI setting where no ground-truth design exists. T o address this gap, we establish the ﬁrst systematic refer ence-free aesthetic evaluation benchmark for T ext-to-UI generation. All test prompts ar e drawn exclusively from colloquial, ordinary-user inputs (e.g., “make it cool and street-style”), directly testing the model’s ability to infer complete aesthetic decisions from low-information-density descriptions. For Landing Pages , we decompose aesthetic ﬁdelity into 10 independent dimensions span- ning four layers: • Structural layer : Layout (spatial rhythm, alignment consistency , visual hierarchy) and T ypography (font-size hierarchy , weight contrast, line spacing); • V isual layer : Color (primary/secondary/accent color system coherence), Font (typeface selection and scene appr opriateness), Image (style consistency and thematic r elevance), Background (section dif ferentiation and gradient quality), and Elements (icon consistency , SVG quality , logo ﬁdelity); • Component layer : Components (button/card/form style consistency and primary–secondary visual distinction); • Dynamic layer : Interaction (hover/active/focus feedback richness) and Animation (entry sequence design, timing, and visual appeal). For Presentations (Slides) , we adopt a streamlined 5-dimension evaluation comprising Layout, T ypography , Color , Image, and Elements. The remaining ﬁve dimensions are r emoved because: (i) font selection in slides conventionally defaults to system fonts for cr oss-platform compatibility; (ii) uniform backgrounds acr oss slides ar e standar d practice; (iii) slide compo- nents are structurally simpler than Landing Page counterparts; and (iv) slide transitions and interactions are baseline expectations rather than aesthetic dif ferentiators. Each dimension is scored on a 0–5 scale with pr ecisely anchored rubrics: 0 denotes complete absence, 1 indicates rendering failure, and 5 r equires ﬂawless execution with notable design excellence. All evaluations are conducted by a calibrated pr ofessional UI/UX designer panel under standardized conditions (Chr ome, 1920 × 1080 viewport, full interactive review including scroll, hover , and click), ensuring that dynamic dimensions such as interaction and animation are pr operly assessed rather than judged from static screenshots alone. 10 3.2.3. T erminal Expert: Interactive Command-Line Reasoning The T erminal Expert targets complex tasks in r eal terminal environments—system conﬁgura- tion, experiment repr oduction, DevOps operations, and general software engineering—requiring broad domain knowledge and autonomous decision-making. Each training sample contains a task instruction, a refer ence solution, an automated test script, and a repr oducible Docker environment. W e construct data from four complementary sour ces: • Expert-annotated data : domain experts manually author tasks across 12 technical do- mains (DevOps, data science, security , computational biology , etc.), scr eened for sufﬁcient difﬁculty via mainstr eam model evaluation. • Multi-agent synthetic data : over 20 parallel agent instances automatically produce veriﬁ- able tasks including descriptions, Docker environments, and test scripts. • Cross-format adaptation : SWE-format tasks are converted into T erminal format via AutoBuilder , yielding 100K+ veriﬁable tasks across 10 pr ogramming languages. • Open-source integration : established datasets including CLI-Gym [ 15 ], and T ermiGen [ 16 ], covering 420+ unique CLI tools across 11 task categories. 3.2.4. W ebSearch Expert: Agentic Search The W ebSearch Expert trains the model to answer complex questions by actively invoking search tools and performing multi-hop inference. W e construct over 100K training samples through a pipeline centered on sear ch-trajectory-based knowledge graph construction and multi-stage ﬁltering. Knowledge Graph Construction from Search T rajectories. W ithin a single search trajectory , the web pages visited sequentially form a naturally coherent document set. W e exploit this coherence to construct knowledge graphs: named entities are extracted and linked via co- occurrence r elations across pages, forming bridging nodes. Multi-hop subgraphs ar e sampled along paths of contr ollable depth, and an LLM generates QA pairs by masking key entities— ensuring questions cannot be answered by parametric memory alone. Filtering and Rejection Sampling. Raw data undergoes two rounds of ﬁltering. First, samples answerable without tools are removed. Second, each sample is independently sampled 𝐾 = 8 times, yielding an empirical success rate: ˆ 𝑟 = 1 𝐾 𝐾  𝑗 = 1 1 [ 𝑎 𝑗 = 𝑎 ∗ ] (4) where 𝑎 𝑗 is the 𝑗 -th sampled answer and 𝑎 ∗ the ground truth. T rivially easy ( ˆ 𝑟 = 1) and intractable ( ˆ 𝑟 = 0) samples are discar ded, retaining only the intermediate band that maximizes the ef fective gradient signal for policy optimization. Rejection sampling ﬁne-tuning then selects positive trajectories satisfying thr ee criteria: correct ﬁnal answer , no failed tool calls, and no duplicate queries. 3.2.5. General Expert: Instruction Following and Code-Math Reasoning The General Expert maintains the model’s core competitiveness across general-purpose scenarios, covering three directions: instruction following (format, content, and compositional con- straints with ﬁne-grained violation penalties), general QA (open-domain knowledge, multi-turn 11 dialogue with long-conversation samples involving topic shifts and cross-turn dependencies), and code-math r easoning (competition-level programming and mathematical problem sets from elementary to advanced levels, veriﬁed through online judge systems). This expert ensures fundamental capabilities are r etained while domain-speciﬁc experts are strengthened. 3.3. Reinforcement Learning 3.3.1. Agentic RL Agentic Scaling. While SFT establishes a model’s foundational instruction-following and code-generation capabilities, Reinfor cement Learning is crucial for advancing its exploration and reasoning in complex, long-horizon tasks. However , conventional RL datasets—typically derived from simple question-answering pairs or static environments—fail to capture the inherent variability and complexity of r eal-world agentic scenarios. T o bridge this gap, we propose an RL data synthesis paradigm termed Agentic Scaling . Leveraging a foundational task pool curated by our internal Autobuilder system, we systemati- cally scale the training data acr oss three critical dimensions: T ask Complexity , Intent Alignment , and Scaffold Generalization . This pipeline yields a large-scale, high-quality RL dataset comprising over 100,000 diverse samples. Effective policy optimization r equires training on tasks poised near the model’s capability frontier . Using the Autobuilder pool, we employ a state-of-the-art closed-source model acting as both T eacher and Judge to generate and robustly verify trajectories within a secure sandbox. W e explicitly ﬁlter out easily solvable tasks, retaining only challenging instances that necessitate extensive reﬂection or iterative r eﬁnement, even for the frontier teacher model. These veriﬁed, high-difﬁculty trajectories provide the critical learning signals necessary to unlock deeper reasoning during RL. A primary challenge in real-world deployment is a distinct Sim-to-Real Gap : whereas train- ing data typically features well-structur ed, expert-crafted prompts, end-users fr equently pr ovide incomplete or ambiguous instructions. T o enhance robustness against this discr epancy , we apply semantic augmentation to task descriptions, ensuring that each real-world code commit maps to a diverse set of prompts . Using LLMs, we r ewrite standardized task speciﬁcations into a spectrum of variants—ranging fr om detailed expert instructions to colloquial, underspeciﬁed queries. This “one-commit-to-multiple-prompts” strategy compels the model to accurately infer the underlying engineering intent from noisy , realistic user inputs. Furthermore, to pr event overﬁtting to any single agent framework, we tr eat the scaffold itself as an independent variable during data synthesis. W e generate trajectories using black- box scaffolds (Claude Code, OpenCode, Kilo Code, etc.), which operate in highly abstracted environments and emphasize ﬁnal task outcomes, alongside white-box variants (SWEagent 6 ). T raining the model across multiple scaf folds for the identical task fosters scaffold-agnostic, highly transferable problem-solving behaviors. Finally , we generalize the RL format into a uniﬁed 5-tuple repr esentation: D RL =  ⟨E , T tools , S agent , I task , V veriﬁer ⟩  (5) where E denotes the execution environment, T tools the available toolset, S agent the speciﬁc scaffold and system pr ompt, I task the task instruction, and V veriﬁer the veriﬁcation and r eward signals. 6 https://github.com/swe-agent/swe-agent 12 This comprehensive formulation captur es the rich supervision r equired for interaction-heavy coding tasks, establishing a robust data foundation for scaling agentic RL. Modiﬁed T urn-level Policy Optimization. While Group Relative Policy Optimization (GRPO [ 17 ]) efﬁciently eliminates the value model, its token-level importance sampling can intr oduce high variance in long-horizon agent scenarios like Softwar e Engineering (SWE). Conversely , Group Sequence-level Policy Optimization (GSPO) [ 18 ] improves stability by aggregating pr obabilities across the entire trajectory . However , applying a single sequence-level ratio and advantage makes temporal credit assignment challenging in multi-turn environments, as it obscures which speciﬁc turn (e.g., T urn 1 vs. T urn 5) led to the ultimate outcome. T o balance training stability and precise credit assignment, we introduce a turn-level adap- tation of GSPO. W e operate at the granularity of an interaction turn by partitioning the full generated sequence 𝑦 into 𝑁 discrete turns. For each turn 𝑛 containing a subset of tokens T 𝑛 , we compute an independent importance ratio: 𝑟 ( 𝑛 ) 𝑡𝑢𝑟 𝑛 ( 𝜃 ) = Ö 𝑖 ∈ T 𝑛 𝜋 𝜃 ( 𝑦 𝑖 | 𝑥 , 𝑦 < 𝑖 ) 𝜋 𝜃 𝑜𝑙 𝑑 ( 𝑦 𝑖 | 𝑥 , 𝑦 < 𝑖 ) (6) The clipped surrogate objective over a gr oup of trajectories 𝜏 is then formulated as: L 𝑇 𝑢𝑟 𝑛 ( 𝜃 ) = E 𝜏 ∼ 𝜋 𝜃 𝑜𝑙 𝑑 " 1 𝑁 𝑁  𝑛 = 1 min  𝑟 ( 𝑛 ) 𝑡𝑢𝑟 𝑛 ( 𝜃 ) 𝐴 𝑛 , clip  𝑟 ( 𝑛 ) 𝑡𝑢𝑟 𝑛 ( 𝜃 ) , 1 − 𝜖 , 1 + 𝜖  𝐴 𝑛  # (7) where 𝐴 𝑛 is the group-level advantage. By evaluating the probability shift of entire action blocks, this formulation deeply aligns with the Markov Decision Pr ocess (MDP) of LLM Agents. It preserves the variance r eduction beneﬁts of sequence-level optimization while ensuring ﬁne-grained credit assignment. Fur- thermore, dynamically deﬁning turn boundaries based on scaffold-speciﬁc markers seamlessly accommodates our multi-scaf fold data, signiﬁcantly accelerating conver gence and enhancing the model’s self-correction capabilities in long-step debugging. Monte-Carlo Logprob A veraging. RL training of Mixtur e-of-Experts (MoE) models is widely known to be unstable, often attributed to policy mismatch between rollout and training phases. In addition to this, we identify another key factor: the high variance of trajectory log-probability expectation estimates, which leads to unstable gradient directions during optimization. Speciﬁ- cally , policy gradient methods rely on Monte Carlo estimation of the form: ∇ 𝐽 ( 𝜃 ) = E 𝑎 ∼ 𝜋 rollout  𝑅 ( 𝑎 ) 𝜋 train ( 𝑎 ) 𝜋 rollout ( 𝑎 ) ∇ log 𝜋 train ( 𝑎 )  (8) where the importance weight depends on estimated log-probabilities. In practice, due to the inherent stochasticity of MoE architectur es (e.g., stochastic expert routing, capacity dropping, or numerical variance), the estimated policy log-probability is noisy: log 𝜋 ( 𝑎 ) = log 𝜋 ∗ ( 𝑎 ) + 𝜖 (9) This noise induces high variance in the importance weights: 𝑤 ( 𝑎 ) = exp  log 𝜋 rollout ( 𝑎 ) − log 𝜋 train ( 𝑎 )  (10) 13 P r e f i x T r e e ( T r i e ) r u v 1 v 5 v 2 v 3 v 4 v 6 v 7 … r o o t l e a v e s l e a v e s r r r r r u u u u u v 1 v 5 v 1 v 1 v 2 v 3 v 4 v 5 v 6 v 7 … … … … … s h a r e p r e f i x f o r t r a j e c t o r i e s 1, 2, 3 sh a r e p r e f i x f o r a l l t r a j e ct o r i e s s h a r e p r e f i x f o r t r a j e c t o r i e s 4, 5 A g e n t ’ s T r a j e c t o r i e s t ra j e c t o ri e s 1 t ra j e c t o ri e s 2 t ra j e c t o ri e s 3 t ra j e c t o ri e s 4 t ra j e c t o ri e s 5 Figure 5 | Overview of T ree T raining in agentic RL. As a result, the variance of the estimator , V ar  𝑅 ( 𝑎 ) 𝑤 ( 𝑎 ) ∇ log 𝜋 train ( 𝑎 )  , can become excessively large, leading to unstable gradient dir ections. T o address this, we adopt a simple yet ef fective variance reduction strategy termed MCLA (Monte-Carlo Log-probability A veraging). During training, the forward pass for each trajectory is prefetched (pr eﬁlled) 𝐾 times, and the corresponding log-pr obabilities are averaged: ¯ log 𝜋 ( 𝑎 ) = 1 𝐾 𝐾  𝑘 = 1 log 𝜋 ( 𝑘 ) ( 𝑎 ) , 𝐾 = 8 (11) thereby signiﬁcantly reducing the variance of the trajectory-level estimator . In addition, we combine MCLA with IcePop (which suppresses training–infer ence misalignment in RL training by clipping excessive-discrepancy tokens), aligning routing decisions between rollout and training, and further reducing system-level mismatch. These two components are strictly complementary: log-probability averaging r educes estimator variance, while IcePop mitigates distributional inconsistency . Empirically , this synergy results in highly stable training, faster convergence, and superior ﬁnal performance. 3.3.2. Agentic Engineering RL Framework. T o robustly support our policy optimization algorithms and the massive scale of agentic data, we developed KRL (Kwai RL) , a highly optimized r einforcement learning framework built around two core system innovations. First, T ree T raining addresses the severe computational bottleneck of group sampling by eliminating the redundant calculation of shared preﬁxes acr oss trajectory branches, achieving an approximate 6 × acceleration during training. Second, KRL is engineered for high-ef ﬁciency , large-scale sandbox environment training . T o handle the complex and asynchr onous interactions between the policy model and diverse execution scaffolds, we integrated Cache-A ware intelligent scheduling to maximize KV Cache hit rates and Dynamic Str eaming for ﬁne-grained pipeline or chestration. By seamlessly interleaving the generation (Rollout) and weight update (T raining) phases across massive sandbox instances, this architectur e reduces the overall unit sample cost by 2.8 × , providing an essential, cost- effective infrastr uctural foundation for agentic scaling. T ree T raining. Modern agent scaffolds—particularly those employing sub-agents, concurr ent tool invocations, and context engineering—produce training trajectories that are fundamentally 14 RL training tasks KwaiEnv SWE - agent OpenCode Kilo Code Claude Code Sandbox -1 (task1) Sandbox -2 (task2) Sandbox -N KRL G P U G P U G P U G P U G P U G P U TaskN- rollout1 TaskN- rollout2 Task2- rollout1 Task2- rollout2 Task1- rollout1 Task1- rollout2 G P U G P U G P U G P U G P U G P U 1 2 3 4 5 5 6 7 8 9 Sandbox Cluster Sglang Server Megatron Tr a in e r Figure 6 | Overview of the RL and sandbox framework. tree-str uctured rather than linear . When a scaffold spawns parallel sub-agents, manages multi- turn context windows with selective retention, or discards intermediate reasoning tokens between turns, the tokens generated by a single task cannot be r epresented as a ﬂat sequence. The context fed into each subsequent turn is not the direct concatenation of prior turns, yielding branches with deeply shared pr eﬁxes. Consequently , naively linearizing these trajectories into independent sequences causes shared pr eﬁxes to be recomputed redundantly in every forwar d and backward pass, imposing a training cost that grows proportionally with the degree of branching. As agent scaffolds grow mor e sophisticated, this over head compounds and becomes a fundamental bottleneck. T o eliminate this redundancy , as shown in Figur e 5, we employ T ree T raining [ 19 ], which serializes the entire trajectory tree into a single Depth-First Search (DFS) ﬂattened sequence and applies a per-token loss weight. This simple reweighting is pr ovably sufﬁcient: by the linearity of differentiation, the r esulting gradients ar e exactly equivalent to those of the baseline that trains on all root-to-leaf paths independently , requiring only the negligible computational overhead of an element-wise scalar multiplication on the per-token loss tensor . Correct computation further relies on three lightweight components: a tr ee-structured attention mask (built on FlashAttention V3) that restricts each token’s attention to its own r oot-to-leaf path, per-token position IDs that restor e each token’s original sequence position rather than its offset in the ﬂattened tree, and the gradient scaling weights described above. The implementation is orthogonal to standar d distributed parallelism strategies (TP , EP , DP , PP) and integrates seamlessly with the rest of the training infrastructur e. On real-world agentic RL r ollouts collected from the diverse scaf folds described above, T ree T raining achieves up to a 6.2 × end-to-end training speedup. High-Concurrency Sandbox T raining. T o ef ﬁciently orchestrate the complex interactions between the policy model and diverse execution scaffolds, our framework implements a highly concurrent, asynchronous pipeline. As illustrated in Figure 6, the end-to-end training loop 15 proceeds thr ough the following stages: 1. Instance Sampling : W e sample a training batch of executable tasks from the dataset. 2. Agent Allocation : Utilizing KwaiEnv , the system assigns a speciﬁc agent scaffold type to each sample and dispatches the execution request to the r emote sandbox cluster . 3. Sandbox Initialization : Powered by W anqing (Kuaishou’s pr oprietary large-scale con- tainer cloud platform), the r emote sandbox cluster dynamically provisions the appr opriate Docker-isolated environments. Each sandbox secur ely encapsulates a single execution trajectory and initiates interaction requests to the SGLang inference engine. 4. Request Routing : The KRL router dynamically orchestrates incoming requests across multiple SGLang inference servers to guarantee strict load balancing. 5. Rollout Generation : The SGLang inference service iteratively generates trajectory data, totaling a volume of batch_size × group_size . For agentic multi-turn scenarios, the model actively interacts with the sandbox environment until task completion, maximum turn limits, or time constraints are r eached. 6. Reward Computation : Rollouts ar e evaluated against task-speciﬁc rules or veriﬁer models to acquire environmental rewar ds. The system calculates advantages and subsequently executes trajectory packing across the r ollouts. 7. Engine Switching : W e perform a critical live context switch, transitioning the active GPU resour ces from hosting the SGLang inference service to the Megatron training service. 8. Parameter Optimization : The model undergoes policy training and parameter updates using the packed trajectories. Following the update, the newly reﬁned model weights are seamlessly synchronized back to the SGLang servers. 9. Iteration : The uniﬁed cycle instantly advances to train the next batch. 3.4. Expert Fusion via On-Policy Distillation After developing highly specialized experts acr oss diverse domains (e.g., coding, r easoning), we face the challenge of amalgamating them into a single omni-capable model. Dir ect weight averaging causes catastrophic forgetting, while standard RL provides feedback too sparse to pinpoint intermediate reasoning errors. Conversely , off-policy SFT suffers fr om exposure bias. T o bridge this gap, we adopt On-Policy Distillation (OPD) 7 , which combines the active exploration of RL with the dense supervision of knowledge distillation. During training, the uniﬁed Student model actively generates complete trajectories acr oss mixed-domain prompts. Crucially , we jointly optimize the Student using both a standard RL loss and an expert-guided OPD loss. For the RL component, the envir onment (e.g., execution sandboxes) provides sparse gr ounding rewar ds to ensure ﬁnal task success. Concurrently , for the OPD component, we dynamically select the best-performing expert for each speciﬁc task to act as the T eacher . This designated expert evaluates the Student’s on-policy r ollouts and provides dense, step-level supervision via its log-probabilities . This appr oach avoids computationally expensive full-logit distillation while providing unbiased optimization tar gets. Condensing diverse, specialized capabilities into a single set of model weights inevitably incurs a slight performance degradation compared to the isolated experts, primarily due to capacity constraints and cross-domain interference. However , our joint optimization effectively mitigates catastrophic forgetting. By aligning the Student’s active reasoning with the best experts’ log-probabilities while gr ounding it in envir onmental success, OPD successfully minimizes this performance drop, ultimately yielding a r obust and highly capable uniﬁed model. 7 https://thinkingmachines.ai/blog/on-policy-distillation 16 4. Evaluation T o systematically evaluate the capabilities of KA T -Coder-V2, we conducted a comprehensive analysis of its performance acr oss multiple repr esentative benchmarks based on the KwaiEnv evaluation platform. This evaluation covers four core dimensions: multi-scaf fold coding capa- bility , agent task execution capability , frontend aesthetics generation capability , and general task processing capability . Overall, the r esults indicate that KA T -Coder-V2 demonstrates outstanding performance across these dimensions, placing it among the top tier of coding models. 4.1. Multi-Scaffold Coding In real-world AI coding scenarios, developers often choose differ ent development scaffolds based on personal habits, team norms, or speciﬁc business requir ements. These scaffolds often exhibit signiﬁcant dif ferences in prompt construction, T ool Use protocols, and context organization and management mechanisms, thereby posing high demands on the cross-scaffold generalization capabilities of foundation models. W e evaluated KA T -Coder-V2 acr oss major mainstream scaffolds, using their native inter- action protocols and system pr ompt conﬁgurations, on SWE-bench V eriﬁed [ 7 ], SWE-bench Multilingual [ 20 ], and a subset of Swe-rebench-V2 [ 21 ]. Evaluation results show that the model maintains stable performance across differ ent scaffold environments. Its cor e metrics on main- stream frameworks such as Claude Code, OpenClaw , and OpenCode are comparable to the most advanced coding model. Thanks to the outstanding framework generalization capability , KA T -Coder-V2 is compatible with over 10 mainstream AI coding scaf folds, pr oviding developers with ﬂexible choices. T able 2 | Evaluation Results on Software Engineering T asks under Multiple Scaffolds. Data points marked with * are taken fr om https://www .anthropic.com/news/claude-opus-4-6. Other data points are evaluated on KwaiEnv . Benchmark Scaffold KA T -Coder-V2 Claude Opus 4.6 SWE-bench V eriﬁed Claude Code 79.6 80.8 ∗ OpenCode 74.8 75.0 OpenClaw 72.8 75.7 SWE-bench Multilingual Claude Code 75.4 77.8 ∗ OpenCode 71.2 70.2 SWE-rebench-V2 (subset) Claude Code 43.3 43.7 OpenCode 38.7 37.3 4.2. Agent T ask Execution W ith the rapid rise of new-generation AI Agent frameworks repr esented by OpenClaw , AI coding tools ar e further evolving towar ds the autonomous execution of real-world complex tasks. This trend introduces mor e complex skill invocation, task orchestration mechanisms, and dynamic scheduling strategies, placing higher -order demands on the model’s interaction and tool use capabilities. T o evaluate KA T -Coder-V2 under these conditions, we performed systematic testing on the 17 PinchBench and Claw-Eval benchmarks based on the OpenClaw framework. The results indicate that under str ess scenarios such as scheduled triggering, high-concurr ency request processing, and long-chain task execution, KA T -Coder-V2 demonstrates strong execution efﬁciency and response stability . T able 3 | Evaluation Results on Real-W orld Agent Benchmarks under OpenClaw . Data points of other models are taken fr om https://pinchbench.com/ and https://claw-eval.github.io/ (retrieved on Mar ch 25, 2026). Benchmark Scaffold KA T - Coder-V2 GLM-5 MiniMax Claude GPT -5.4 Gemini M2.7 Opus 4.6 3.1 Pro PinchBench Best Score 88.7 86.4 87.1 87.4 90.5 86.7 A verage Score 81.9 80.3 81.8 82.3 81.6 75.9 Claw-Eval Pass ˆ 3 55.6 57.7 51.9 66.3 66.3 50.0 A verage Score 73.4 73.0 70.7 79.3 80.6 74.2 4.3. Frontend Aesthetics Generation Interface aesthetics are an important component of frontend generation quality , directly affecting users’ visual per ception and interactive experience. T argeting this dimension, we constructed a systematic aesthetic evaluation benchmark speciﬁcally oriented toward r eference- free design scenarios. This benchmark covers three typical application scenarios: Landing Pages, Slides, and Data V isualization. Among them, Landing Pages are further divided into 10 evaluation dimensions, while Slides and Data V isualization are each divided into 5 dimensions. Standardized anchor -based scoring scales were designed for each dimension. All queries in the evaluation test set are based on the colloquial expressions of ordinary users, and the assessments ar e conducted through blind evaluations by a professional UI/UX designer team under standardized experimental conditions. The results show that KA T -Coder-V2 achieved leading aesthetic scores across all three scenarios, demonstrating strong user intent understanding and fr ontend visual generation capabilities. T able 4 | Evaluation Results on Frontend Generation T ask. Benchmark KA T -Coder-V2 GLM-5 Kimi K2.5 Landing Page 59.8 57.6 54.6 Slides 57.6 42.8 34.8 Data V isualization 67.6 42.4 46.0 4.4. General T ask Processing In r eal-world coding scenarios, a strong model must not only complete basic code generation and optimization but also excel in end-to-end complex tasks, multi-turn interactive r easoning, long-context understanding, and high-precision instr uction following. 18 (a) Landing Page 1 (b) Landing Page 2 (c) Slides (d) Data V isualization Figure 7 | Frontend Generation Results of KA T -Coder-V2. Focusing on these key capabilities, we systematically evaluated KA T -Coder-V2 on multiple mainstream benchmarks, including T erminal-Bench Hard [ 8 ], 𝜏 2 -Bench T elecom [ 9 , 22 ], AA- LCR [ 23 ], and IFBench [ 24 ]. The results show that KA T -Coder-V2 achieved competitive scor es across various general scenarios. Its strong foundational capabilities pr ovide strong support for its adaptability , robustness, and stability in complex programming environments, further enhancing its comprehensive performance in r eal-world development tasks. T able 5 | Evaluation Results on General T ask. Data points of KA T -Coder-V2 ar e evaluated on KwaiEnv . Data points of other models are from https://artiﬁcialanalysis.ai/evaluations (retrieved on Mar ch 25, 2026). Benchmark KA T - Coder-V2 GLM-5 MiniMax Claude GPT -5.4 Gemini M2.7 Opus 4.6 3.1 Pro T erminal-Bench Hard 46.8 43.2 39.4 46.2 57.6 53.8 𝜏 2 -Bench T elecom 93.9 98.2 84.8 92.1 91.5 95.6 AA-LCR 68.0 63.3 68.7 70.7 74.0 72.7 IFBench 67.0 72.3 75.7 53.1 73.9 77.1 5. Conclusion In this report, we have intr oduced KA T -Coder-V2, a compr ehensive agentic coding model that demonstrates domain-specialized training, large-scale agentic RL, and uniﬁed distillation 19 as a principled path toward building powerful coding agents. By decomposing capabilities into orthogonal expert domains and fusing them through on-policy distillation, KA T -Coder-V2 retains expert-level performance acr oss SWE, frontend generation, terminal reasoning, and general tasks within a single model. The accompanying infrastructure (KwaiEnv) and algorith- mic innovations (MCLA, T ree T raining), together with systematic agentic scaling acr oss task complexity , prompt diversity , and scaffold generalization, collectively enable stable, ef ﬁcient training at scale. W ith these strengths, KA T -Coder-V2 closely rivals the str ongest proprietary coding models across multiple scaf folds and benchmarks. However , gaps remain on certain agent execution benchmarks such as Claw-Eval, which we aim to narrow through further scaling of agentic RL and richer environment interaction. Future work will also focus on extending the Specialize-then-Unify paradigm to broader agentic domains beyond coding and exploring mor e efﬁcient expert fusion strategies to fully unlock the potential of domain-specialized training. 6. Contribution Contributors’ names are listed in alphabetical or der by ﬁrst name. Core Contributors Fengxiang Li Han Zhang Haoyang Huang Jinghui W ang Jinhua Hao Kun Y uan Mengtong Li Minglei Zhang Pengcheng Xu W enhao Zhuang Y izhen Shao Zongxian Feng Contributors Can T ang Chao W ang Chengxiao T ong Fan Y ang Gang Xiong Haixuan Gao Han Gao Hao W ang Haochen Liu Hongliang Sun Jiabao Li Jingwen Chang Jun Du Junyi Peng Leizhen Cui Meimei Jing Mingqi W u Shangpeng Y an Shaotong Qi Suzhe Xu W enxuan Zhao Xianda Sun Xuan Xie Y anbo W ang Y ao Xia Y inghan Cui Y ingpeng Chen Y ong W ang Y uze Shi Zhiwei Shen Ziyu W ang T ech Leads Ming Sun Lin Y e Bin Chen 20 References [1] Anthropic. Claude opus 4.6 system card, 2026. [2] Google DeepMind. Gemini 3 pro model car d, 2025. [3] Aohan Zeng, Xin Lv , Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Y in, Chendi Ge, Chengxing Xie, Cunxiang W ang, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026. [4] Kimi T eam, T ongtong Bai, Y ifan Bai, Y iping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: V isual agentic intelligence. arXiv preprint arXiv:2602.02276, 2026. [5] Ailin Huang, Ang Li, Aobo Kong, Bin W ang, Binxing Jiao, Bo Dong, Bojun W ang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 ﬂash: Open frontier -level intelligence with 11b active parameters. arXiv preprint arXiv:2602.10604, 2026. [6] Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan W ang, Bingzheng Xu, Bochao W u, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv pr eprint arXiv:2512.02556, 2025. [7] Carlos E Jimenez, John Y ang, Alexander W ettig, Shunyu Y ao, Kexin Pei, Oﬁr Pr ess, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024. [8] Mike A Merrill, Alexander G Shaw , Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Y eon Shin, Thomas W alshe, E Kelly Buchanan, et al. T erminal-bench: Benchmarking agents on har d, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868, 2026. [9] V ictor Barres, Honghua Dong, Soham Ray , Xujie Si, and Karthik Narasimhan. 𝜏 2 - bench: Evaluating conversational agents in a dual-control environment. arXiv preprint arXiv:2506.07982, 2025. [10] Zizheng Zhan, Ken Deng, Jinghui W ang, Xiaojiang Zhang, Huaixi T ang, Minglei Zhang, Zhiyi Lai, Haoyang Huang, W en Xiang, Kun W u, et al. Kat-coder technical report. arXiv preprint arXiv:2510.18779, 2025. [11] Xiang Deng, Jeff Da, Edwin Pan, Y annis Y iming He, Charles Ide, Kanak Garg, Niklas Lauffer , Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941, 2025. [12] Naman Jain, King Han, Alex Gu, W en-Ding Li, Fanjia Y an, T ianjun Zhang, Sida W ang, Armando Solar -Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contam- ination free evaluation of lar ge language models for code. arXiv preprint , 2024. [13] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, T ianyue Ou, Y onatan Bisk, Daniel Fried, et al. W ebarena: A realistic web envir on- ment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. [14] Chenglei Si, Y anzhe Zhang, R yan Li, Zhengyuan Y ang, Ruibo Liu, and Diyi Y ang. De- sign2code: Benchmarking multimodal code generation for automated front-end engineer - ing. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of 21 the Association for Computational Linguistics: Human Language T echnologies (V olume 1: Long Papers), pages 3956–3974, 2025. [15] Y usong Lin, Haiyang W ang, Shuzhe W u, Lue Fan, Feiyang Pan, Sanyuan Zhao, and Dandan T u. Cli-gym: Scalable cli task generation via agentic environment inversion. arXiv pr eprint arXiv:2602.10999, 2026. [16] Kaijie Zhu, Y uzhou Nie, Y ijiang Li, Y iming Huang, Jialian W u, Jiang Liu, Ximeng Sun, Zhenfei Y in, Lun W ang, Zicheng Liu, Emad Barsoum, W illiam Y ang W ang, and W enbo Guo. T ermigen: High-ﬁdelity environment and r obust trajectory synthesis for terminal agents. arXiv preprint arXiv:2602.07274, 2026. [17] Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models. arXiv pr eprint arXiv:2402.03300, 2024. [18] Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Y u, Chang Gao, Kai Dang, Y uqiong Liu, Rui Men, An Y ang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025. [19] Shaojie W ang, Jinghui W ang, Y inghan Cui, Xuxing Chen, Chao W ang, Liang Huang, Xiaojiang Zhang, Junyi Peng, Li W an, Haotian Zhang, et al. T ree training: Accelerating agentic llms training via shared pr eﬁx reuse. arXiv preprint arXiv:2511.00413, 2025. [20] John Y ang, Kilian Lier et, Carlos E Jimenez, Alexander W ettig, Kabir Khandpur , Y anzhe Zhang, Binyuan Hui, Oﬁr Press, Ludwig Schmidt, and Diyi Y ang. Swe-smith: Scaling data for software engineering agents. arXiv pr eprint arXiv:2504.21798, 2025. [21] Ibragim Badertdinov , Maksim Nekrashevich, Anton Shevtsov , and Alexander Golubev . Swe- rebench v2: Language-agnostic swe task collection at scale. arXiv preprint , 2026. [22] Shunyu Y ao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 𝜏 -bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint , 2024. [23] Artiﬁcial Analysis T eam. Artiﬁcial analysis long context reasoning benchmark(lcr), 2025. [24] V alentina Pyatkin, Saumya Malik, V ictoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing veriﬁable instruction following. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks T rack, 2025. 22

KAT-Coder-V2 Technical Report

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment