InCoder-32B: Code Foundation Model for Industrial Scenarios

InCoder -32B: Code Foundation Model for Industrial Scenarios Jian Y ang 1 , W ei Zhang 1 , Jiajun W u 1 , Junhang Cheng 1 , Shawn Guo 2 , Haowen W ang 2 , W eicheng Gu 1 , Y axin Du 3 , Joseph Li 4 , Fanglin Xu 2 , Y izhi Li 5 , Lin Jing 2 , Y uanbo W ang 1 , Y uhan Gao 1 , Ruihao Gong 1 , Chuan Hao 2 , Ran T ao 2 , Aishan Liu 1 , T uney Zheng 2 , Ganqu Cui 6 , Zhoujun Li 1 , Mingjie T ang 7 , Chenghua Lin 5 , W ayne Xin Zhao 8 , Xianglong Liu 1 , Ming Zhou 9 , Bryan Dai 2 , W eifeng Lv 1 1 Beihang University 2 IQuest Research 3 Shanghai Jiao T ong University 4 ELLIS 5 University of Manchester 6 Shanghai Artiﬁcial Intelligence Laboratory 7 Sichuan Univeristy 8 Gaoling School of Artiﬁcial Intelligence, Renmin University of China 9 Langboat † Corresponding Authors. Email: {jiayang}@buaa.edu.cn HuggingFace: https://huggingface.co/Multilingual- Multimodal- NLP/IndustrialCoder GitHub: https://github.com/CSJianYang/Industrial- Coder Abstract Recent code lar ge language models have achieved remarkable progr ess on general program- ming tasks. Nevertheless, their performance degrades signiﬁcantly in industrial scenarios that requir e reasoning about har dware semantics, specialized language constructs, and strict resour ce constraints. T o address these challenges, we intr oduce InCoder-32B (Industrial-Coder - 32B), the ﬁrst 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D mod- eling. By adopting an efﬁcient architectur e, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progr essively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded veriﬁcation. W e conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder -32B achieves highly competitive performance on general tasks while establishing strong open-sour ce baselines across industrial domains. C/Compiler CodeOptimization Embedded- CGen SuperCoder   TileBench KernelBench CAD/CAM 3DModeling CAD-Coder Verilog/RTL Chip Design VeriScope VeriRepair RealBench ArchXBench  InCoder CUDA/Kernel GPUOptimization   General Code Intelligence Code Generation evalplus bigcodebench fullstackbench livecodebench Agentic Coding swebench- verified multi-swe terminal-bench Tool Use bfclv3 mind2web tau-2-bench Other Capabilities codearena mercury multiple spider Comprehensiveness Full-Spectrum Capability Coverage Practicality Industry-Ready Practical Expertise Industrial Code Intelligence Figure 1. Scope of industrial code intelligence. InCoder-32B aims to serve as a uniﬁed foundation model for (left) general software development and (right) industrial pr ogramming domains. The model supports a spectrum of capabilities, ranging fr om general coding tasks such as code generation, agentic development, and tool use to industrial workloads including chip design, GPU kernel optimization, CAD/CAM modeling, and compiler-level optimization. Implementing efficient CUDA RMS Normalization for large-spatial-dimension tensors (512x512) while respecting GPU grid dimension limitations. 512x512 Dimensional Flatting 512x512 Failure Success Hardware Limit Y-dim: 65535 Processor Figure 2. Comparison of CUDA grid conﬁguration strategies for RMS Normalization on large spatial dimensions (512 × 512). (Left) Claude assigns spatial_size (262,144) to gridDim.y , exceeding the CUDA hardwar e limit of 65,535, resulting in an invalid conﬁguration argument runtime err or . (Right) InCoder-32B ﬂattens all dimensions into a 1D grid, safely utilizing the gridDim.x limit and avoiding the hardwar e constraint violation. 1. Introduction Code intelligence has witnessed substantial pr ogress with the emergence of incr easingly capable LLMs [ 126 , 127 ]. Recent model releases such as Qwen3.5 [ 99 ], DeepSeek-V3.2 [ 68 ], and Claude- 4.6 [ 10 ] have demonstrated strong performance acr oss a wide range of programming tasks, with frontier models achieving gold-medal-level r esults in competitive programming [ 52 ], software tasks [ 105 – 107 ], and tool use tasks [ 16 , 102 ]. These advances mark a turning point where LLMs have become genuinely capable assistants for everyday software engineering. Much of this progr ess is driven by the abundance and diversity of publicly available code data. Repositories on GitHub, StackOverﬂow discussions, and open-sour ce documentation provide rich supervision for training covering mainstream programming languages (PLs), frameworks, and development patterns. Y et a critical gap persists between general code intelligence and the demands of industrial softwar e development. Scenarios such as CUDA kernel optimization [ 87 ], V erilog hardwar e description [ 71 ], embedded ﬁrmware programming [ 35 ], and compiler op- timization [ 23 ] impose requirements that fundamentally dif fer from conventional software engineering with specialized language semantics, strict timing and resour ce constraints, reason- ing about hardware behavior , and rigorous veriﬁcation methodologies. Related benchmarks show that even the strongest code LLMs struggle on industrial tasks, with the best models achieving only 28.80% call success rate of G and 41.57% of T on T riton operator generation [ 63 ] and 33.3% accuracy of location generated V erilog code that passes simulation failing formal equivalence checking [ 56 ]. T o bridge this gap, we propose InCoder-32B , the ﬁrst large language model purpose-built for industrial code intelligence . W ith 32B parameters, InCoder-32B is explicitly designed to tackle the unique challenges of industrial software development, including reasoning about hardwar e constraints, timing behavior , synthesis requir ements, and low-level performance optimization, that existing code LLMs tr eat as out-of-distribution tasks. A single InCoder-32B model serves chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling, unifying these pr eviously fragmented industrial domains for the ﬁrst time. T o achieve this, we adopt an efﬁcient recurr ent architecture and train InCoder -32B thr ough a 2 Pre-train Phase 1:Data Curation Domain Taxonomy Expert ontology Hybrid Recall & Filtering Licesce/Privacy Rule/Sem/ Token/Repo/ X-source Deep Dedup Parse/Compile/ Constraint Validity Check Phase 2:Data Enhancement Normalize / Denoise Structured Rewrite Dep completion Context formatting Semantic drift check Post-Rewrite Validation Hallucination filter Phase 3:Training Strategy Easy -> Hard   Training Strategy Local -> Global Sub-task -> Repo Mixed Training Objectives + Code-Text Alignment AR + FIM + Cross-file + General anchor data Feedback Loop Mid-train Scenario Spec Data Stream A: Synthetic QA Pipeline Seed Code Gen QA + Verification Data Stream B: Curated Data Agent Traj. Code Commits Industrial Artefacts FIM Data Mixture 8k->32k->128k Industrial-Aware Foundation Post-train Chip Design GPU Optim. 3D Modeling Code Optim. Task Seed Spec Sim-Grounded Generation Domain-Balanced Verified SFT Pairs General Code SFT Data Figure 3. The three-stage training pipeline of InCoder-32B. Pre-train performs data curation and enhancement, Mid-train constructs an industrial-awar e foundation with progr essive context scaling from 8K to 128K, and Post-train pr oduces simulation-grounded SFT data across industrial domains. three-stage Code-Flow pipeline: (1) Pre-training & Annealing with curated industrial code data and automated veriﬁcation; (2) Mid-training that pr ogressively extends context from 8K to 128K tokens with synthetic industrial r easoning data and agentic trajectories; and (3) Post-training with execution-grounded veriﬁcation, yielding both an instruction-tuned variant and a thinking variant. W e conduct extensive evaluations on general and industrial code benchmarks and demonstrate that InCoder-32B combines broad coding competence with specialized industrial capabilities. InCoder-32B achieves 74.8% on SWE-bench V eriﬁed, 49.14% on LiveCodeBench, and 60.99% on BFCL, competitive with leading models of comparable or lar ger scale. On industrial bench- marks, InCoder-32B establishes the str ongest open-source results acr oss all evaluated domains, including chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. Our contributions are: • T o the best of our knowledge, InCoder-32B is the ﬁrst code LLM purpose-built for industrial code intelligence , bridging the long-standing gap between academic code benchmarks and real-world industrial engineering domains such as chip design, GPU kernel optimization, embedded systems, and compiler engineering. • W e assemble the most comprehensive industrial code evaluation to date, covering 14 general benchmarks and 9 industrial benchmarks across 4 specialized domains. • Through extensive ablations, we ﬁnd that repository transition data outperforms static snapshots for planning, mid-training reasoning trajectories improve robustness under distribution shift, and thinking paths unlock emergent capabilities absent in standard instruction tuning. 2. Scaling Industrial Data under Simulation Environments Industrial code dif fers from general software in that its correctness can only be established by running it in the same environment where it will ultimately be deployed. A V erilog module is validated through R TL simulation before it r eaches silicon; a GPU kernel must execute on real hardwar e and pr oduce numerically corr ect r esults; embedded ﬁrmware must boot on a microcontr oller and interact correctly with its peripherals; and a CAD script must produce geometry that can be manufactured. T o generate r eliable post-training data for InCoder-32B, we 3 reconstr uct these four classes of industrial envir onments in software, matching the toolchains and correctness criteria that engineers encounter in pr oduction. 2.1. Chip Design In the semiconductor industry , a digital design progresses through an established ﬂow: R TL authoring, behavioral simulation against testbenches, logic synthesis, and physical implemen- tation. W e reconstruct the ﬁrst thr ee stages using publicly available EDA t ools. Icarus V erilog serves as the front end for behavioral simulation of V erilog designs. For IP cores written in SystemV erilog, we employ V erilator , which translates R TL into optimized C++ models and is the same simulator adopted by projects such as CHIPS Alliance and lowRISC. At the synthesis stage, Y osys maps RTL to a gate library , allowing us to verify synthesizability and extract area and timing estimates. These three tools ar e composed into a single containerized image that mirrors the environment an R TL engineer works in: source ﬁles and testbenches go in, and compilation status, simulation results, and synthesis r eports come out. By replicating this industrial ﬂow rather than inventing a pr oxy , every training signal we extract is grounded in the same criteria that determine whether a design succeeds on real silicon. 2.2. GPU Optimization GPU kernel development follows a distinct workﬂow: an engineer writes a kernel in CUDA or T riton, compiles it via the NVIDIA toolchain, launches it on a GPU, and validates both numerical correctness and performance. W e replicate this workﬂow on NVIDIA A100 nodes. For CUDA, we integrate the nvcc compiler through PyT orch’s r untime compilation interface, matching the workﬂow used in libraries such as FlashAttention and xFormers where custom kernels are compiled and loaded at import time. For T riton, we rely on the ofﬁcial compiler stack: a Python function decorated with @triton.jit is compiled to GPU code at ﬁrst invocation and cached for subsequent calls, the same path used in serving frameworks such as vLLM and SGLang. The execution environment pr eserves the key characteristics of real deployment. Kernels launch on the same A100 hardwar e that pr oduction workloads target, memory is allocated through the standard CUDA allocator , and timing is measured via CUDA events. By building on the identical har dwar e and softwar e stack that kernel engineers use, we ensur e that signals obtained during data synthesis transfer directly to r eal deployment. 2.3. 3D Modeling In mechanical engineering, parametric CAD models are author ed in scripting languages that drive a solid modeling kernel. The most widely adopted such kernel is OpenCascade, which supports Boolean operations, ﬁlleting, chamfering, extrusion, r evolution, and lofting. CadQuery provides a Python interface to OpenCascade and has become the standar d for programmatic CAD in the open hardwar e community . W e construct a modeling envir onment ar ound CadQuery that repr oduces the workﬂow a CAD engineer follows: a Python script deﬁnes geometric primitives, applies transformations, and exports the resulting solid to interchange formats such as STEP and STL. Generated scripts run against the same OpenCascade version used by production tools such as FreeCAD and KiCad, so code that passes our envir onment will also execute correctly in real CAD applications. Geometric ﬁdelity is evaluated by tessellating the output solid and comparing it volumetrically against a refer ence, ensuring that the generated model is not merely syntactically valid but 4 V eriScope Score V eriRepair Fix RealBench Module Syn@1 ArchXBench t CAD-Coder Compile Pass EmbedCGen Main SuperCoder Acc T ritonBench G-call KernelBench L2 0 20 40 60 80 80.7 80.0 74.8 51.0 82.0 35.2 91.0 18.5 36.0 87.7 83.3 69.2 58.2 77.0 79.0 88.0 28.8 28.0 73.1 83.3 43.7 49.7 40.0 81.0 46.0 12.5 23.0 44.8 86.7 35.2 53.5 36.0 17.8 34.0 7.6 10.0 InCoder Claude-Sonnet-4.6 Kimi-K2.5 Qwen3.5-397B-A17B 1 Figure 4. The performance of models on industrial code benchmarks. geometrically faithful to the speciﬁcation. 2.4. Code Optimization Code optimization in industry takes two forms: embedded systems pr ogramming, where code must run correctly on microcontr ollers with speciﬁc peripheral hardwar e, and performance optimization, wher e the goal is to produce faster machine code. W e construct a dedicated environment for each. For embedded systems, we target the STM32F407, one of the most widely deployed ARM Cortex-M4 microcontr ollers. The environment replicates the complete ﬁrmwar e toolchain: the arm-none-eabi-gcc cross compiler builds generated C code against CMSIS device headers and a linker script that maps the chip’s memory layout. The compiled ﬁrmwar e is then loaded into the Renode simulator , which provides a virtual r eplica of the entire STM32F407 including GPIO ports, UAR T contr ollers, SPI and I2C buses, timers, ADC with DMA, and the interrupt controller . Each peripheral model r eproduces the r egister layout and interr upt behavior speciﬁed in the refer ence manual, so that code running corr ectly in our envir onment will also run on physical hardwar e. This ﬁdelity is critical because embedded bugs are often caused not by algorithmic errors but by incorr ect register conﬁguration or interrupt priority conﬂicts that only surface on real or faithfully emulated har dware. For x86-64 assembly optimization, we replicate the standar d compiler benchmarking workﬂow . Generated assembly is linked against a test harness and executed natively under controlled conditions: ﬁxed CPU frequency , pinned core af ﬁnity , and repeated measur ements. This mirrors the methodology used in LL VM and GCC regr ession suites, where the goal is to verify that an optimization is both correct and measurably faster . The shared principle across all four envir onments is to r eplicate the toolchains and execution semantics that industrial engineers use rather than constructing simpliﬁed proxies. By building on the same simulators, compilers, and hardwar e that real deployments depend on, we ensure that training signals transfer directly to practice. 3. T raining Strategy Industrial hardwar e and system engineering spans diverse domains: digital circuit design (R TL/V erilog), GPU computing (T riton operators, CUDA kernels), systems programming (C/C++/Rust kernels), FPGA synthesis (HLS), CAD tool integration, and embedded sys- tems—each with domain-speciﬁc challenges, timing constraints, r esource budgets, and veriﬁca- 5 tion methodologies. While these domains showcase great coverage of industrial coding tasks, corresponding training corpora is lacking during the entire training stages. Detailed training procedur es for pre-training, mid-training, and post-training are provided in Appendix A , C , and D , respectively . 3.1. Stage 1 Pre-T raining Data Collection. W e collect industrial codes from public repositories, technical literature, and domain-speciﬁc web data. Notably , we design a three-step r ecall strategy to increase the coverage of industrial codes we collect fr om public repositories. Additionally , we adopt OCR to collect high-quality code snippets and structur ed content from technical literature. Following the previous work [ 127 ], we use the same model conﬁg for InCoder -32B. See appendix for further details. Data Cleaning and Reﬁnement. W e perform license ﬁltering, personally identiﬁable infor- mation (PII) removal, and ﬁle-level validation, followed by deduplication at levels of exact hash matching, token-level near-duplicate detection [ 76 ], repository-level fork consolidation, and cross-sour ce deduplication. W e apply additional domain-speciﬁc checks befor e data reﬁne- ment, where we normalize surface-level formatting and add structured annotations. All reﬁned samples are veriﬁed thr ough AST comparison and re-compilation to ensure corr ectness. W e train InCoder-32B on 4,096 GPUs with autoregr essive language modeling and ﬁll-in-the- middle (FIM) completion [ 40 , 46 ] using a standard decoder -only T ransformer architecture. See appendix for more details. 3.2. Stage 2 Mid-T raining 3.2.1. Context Extension W e extend model context length with a two-sub-stage strategy , increasingly extend fr om 8K tokens to 32K tokens, and then 128K tokens. While the ﬁrst sub-stage focuses on ﬁle-level tasks, e.g. completing R TL modules, the latter sub-stage unlocks model’s long-context capabilities, e.g. extended debugging sessions. 3.2.2. Industrial Data Synthesis and Curation Our stage 2 pre-training data consist of synthetically generated industrial reasoning QAs, agent trajectories and code artifacts. Notably , our synthesized data leverage real-world development scenarios extensively that are normally underr epresented among public repositories. Synthetic Industrial Code QA. Our synthesis pipeline operates in thr ee steps designed to produce industrially gr ounded, factually correct reasoning data: (i) Industrial scenario speciﬁcation through consultation with practising hardware and systems engineers; (ii) Seed code generation that reﬂects realistic har dware design patterns and domain-speciﬁc conventions; (iii) QA pair synthesis with automated veriﬁcation . The detailed synthesis pipeline and coverage analysis are provided in Appendix E . Agent T rajectories. W e include multi-step debugging and repair trajectories following the Thought-Action-Observation cycle [ 128 ], capturing closed-loop r easoning with tool feedback from har dware simulators, synthesis tools, C/C++ compilers, and formal veriﬁcation engines. Curating such trajectories addresses the lack of operational context in standar d code corpora. 6 Industrial Code Artifacts. W e also include auxiliary artifacts that reﬂect the operational context of professional har dware development: hardware testbenches (SystemV erilog/UVM), timing constraints (SDC), synthesis scripts, GPU proﬁling traces, and memory sanitiser logs [ 7 , 104 ]. These domain-speciﬁc artifacts expose the model to the full ecosystem of industrial har dware engineering, compensating for their scarcity in public data. 3.3. Stage3 Post-T raining Data Construction General-purpose supervised ﬁne-tuning (SFT) datasets [ 21 , 89 ] carry little signal for industrial coding tasks, especially when execution-based veriﬁcations can have a non-trivial impact. Therefor e, we constr uct 2.5M samples directly from r eal-life industrial coding tasks grounded in execution. Finally , our tasks spanning acr oss har dware design, GPU kernel development, systems programming, and embedded ﬁrmwar e. T ask Construction, Candidate Generation, and V eriﬁcation Each task is decomposed into a structur ed instruction with a natural language r equirement description, interface constraints (port lists, function signatur es, API contracts), the tar get platform and toolchain, dependency conﬁgurations, and associated veriﬁcation scripts. This normalization step produces a consistent instruction format for SFT . Given an instruction, we generate a diverse set of candidate solutions thr ough a group of complementary samples, such as template-based perturbation and cross-language migration, in order to boost the diversity of generated solutions. W e validate generated solutions through execution. Notably , this veriﬁcation is gr ounded in a real execution environment, i.e. where a real engineer use in pr oduction. Feedback-Driven Repair For a solution that fail executions, our pipeline captures the entir e feedback context, including compiler error messages, runtime logs, counterexample inputs, waveform differences, and pr oﬁling bottlenecks. W e then append this feedback to the failed solution to generate a r epaired solution. Note that the result is a closed-loop repair trajectory [ 54 , 131 ] including both the failed and the succeeded solution with execution feedback, which we also include in the SFT corpus in order to mimic a workﬂow of bug-ﬁxing from an experienced engineer . Quality Filtering and Final Composition Finally , we ﬁlter SFT samples through executability , stability , and information density , from which we categorize samples into three kinds, i.e. direct solution , defect repairs , and performance and structural optimization samples . Note that the last category refers to a correct solution improved with respect to efﬁciency , readability , or architectural quality . 4. Evaluation 4.1. Baselines W e compare InCoder-32B against a compr ehensive set of large language models spanning both open-weight and proprietary systems, evaluating them across general-purpose code benchmarks and specialized industrial code domains. For general-purpose code evaluation, our baselines include DeepSeek-Coder-V2-Lite-Instruct [ 26 ] and DeepSeek-V3.2 [ 68 ], the Qwen2.5-Coder series (7B, 14B, and 32B) [ 51 ], Qwen3-235B-A22B- Instruct and Qwen3-235B-A22B-Thinking [ 122 ], the Qwen3-Coder series (30B-A3B and 480B- A35B) [ 97 ], Seed-Coder -8B-Instruct [ 101 ] fr om ByteDance, Kimi-Dev-72B [ 129 ], Kimi-K2-Instruct 7 T able 1. Performance comparison on code generation tasks. Model Size EvalPlus BigCodeBench FullStackBench HumanEval HumanEval+ MBPP MBPP+ Full Hard 6B+ Models DeepSeek-Coder-V2-Lite-Instruct 2.4/16B 81.1 75.6 85.2 70.6 37.8 18.9 49.4 Qwen2.5-Coder-7B-Instruct 7B 87.2 81.7 84.7 72.2 37.8 13.5 42.2 Seed-Coder-8B-Instruct 8B 81.1 75.6 86.2 73.3 44.6 23.6 55.8 Qwen2.5-Coder-14B-Instruct 14B 62.8 59.8 88.6 77.2 47.0 6.1 53.1 30B+ Models Qwen3-Coder-30B-A3B-Instruct 3.3/30.5B 93.9 87.2 90.7 77.2 46.9 27.7 60.9 Deepseek-V3.2 37/671B 93.9 88.4 93.4 77.2 48.1 27.0 64.9 Qwen2.5-Coder-32B-Instruct 32B 93.3 86.6 90.2 77.8 48.0 24.3 57.4 Qwen3-235B-A22B-Instruct-2507 22/235B 96.3 91.5 92.3 77.8 47.4 25.7 62.7 Qwen3-235B-A22B-Thinking-2507 22/235B 98.8 93.3 95.5 81.5 44.1 23.0 - Qwen3-Coder-480B-A35B-Instruct 35/480B 97.6 92.7 94.2 80.2 49.4 27.7 66.4 Kimi-Dev-72B 72B 93.3 86.0 79.6 68.8 45.4 31.8 38.6 Kimi-K2-Instruct-0905 32B/1T 94.5 89.6 91.8 74.1 49.8 30.4 63.5 Kimi-K2-Thinking 32B/1T 98.2 92.7 97.4 82.3 46.8 28.4 - KA T -Dev 32B 90.9 86.6 89.4 76.2 46.2 25.7 58.8 KA T -Dev-72B-Exp 72B 88.4 81.7 85.2 69.3 48.3 26.4 52.9 GLM-4.7 32/355B 87.2 79.9 90.5 75.7 45.7 26.4 70.2 InCoder-32B 32B 94.5 89.6 91.8 78.3 49.8 31.1 57.1 T able 2. Combined performance on code r easoning (CruxEval, LiveCodeBench), code efﬁciency (Mercury), and T ext2SQL (Bird, Spider) benchmarks. Model Size Code Reasoning Code Efﬁciency T ext2SQL CruxEval LiveCodeBench Mercury Bird Spider Input-COT Output-COT V5 V6 Beyond@1 Pass@1 6B+ Models DeepSeek-Coder-V2-Lite-Instruct 2.4/16B 57.1 56.2 13.2 19.4 76.8 91.4 41.6 72.4 Qwen2.5-Coder-7B-Instruct 7B 66.9 66.0 14.4 18.9 69.9 84.8 53.1 79.8 Seed-Coder-8B-Instruct 8B 62.0 66.6 19.2 22.3 78.5 93.8 44.7 72.7 Qwen2.5-Coder-14B-Instruct 14B 75.6 79.2 22.8 24.6 76.7 88.3 59.1 81.3 30B+ Models Qwen3-Coder-30B-A3B-Instruct 3.3/30.5B 76.9 80.5 43.1 36.0 81.1 95.3 59.0 80.9 DeepSeek-v3.2 37/671B 82.1 94.2 - 83.3 81.6 96.9 52.6 77.9 Qwen2.5-Coder-32B-Instruct 32B 78.8 84.0 30.5 27.4 79.1 96.1 62.1 83.9 Qwen3-235B-A22B-Instruct-2507 22/235B 62.0 89.5 53.9 51.8 80.4 96.9 62.8 81.1 Qwen3-235B-A22B-Thinking-2507 22/235B 15.2 46.9 80.2 74.1 61.2 70.3 35.2 42.6 Qwen3-Coder-480B-A35B-Instruct 35/480B 87.1 90.4 48.6 53.9 80.2 96.1 61.3 81.2 Kimi-Dev-72B 72B 33.0 64.2 46.1 40.0 59.1 69.5 - - Kimi-K2-Instruct-0905 32B/1T 86.8 89.5 52.1 53.7 76.1 90.6 60.4 81.1 Kimi-K2-Thinking 32B/1T 92.2 86.2 - 83.1 73.0 85.2 40.6 49.6 KA T -Dev 32B 42.5 65.1 32.9 32.6 75.1 89.1 52.2 77.6 KA T -Dev-72B-Exp 72B 71.4 81.1 13.8 16.0 79.0 94.5 35.2 60.3 GLM-4.7 32/355B 65.6 81.2 - 84.9 74.1 86.7 46.5 62.4 InCoder-32B 32B 62.4 73.9 53.3 49.1 71.4 85.6 55.4 79.7 and Kimi-K2-Thinking [ 111 ] from Moonshot AI, KA T -Dev and KA T -Dev-72B-Exp [ 136 ] from Kuaishou, and GLM-4.7 [ 6 ] from Zhipu AI. For specialized industrial code evaluation, we evaluate DeepSeek-V3.2 [ 68 ], the GLM series (GLM-5 [ 135 ] and GLM-4.7 [ 6 ]) from Zhipu AI, the Kimi family (Kimi-K2.5 [ 112 ] and Kimi- K2 in both Instruct and Thinking variants [ 111 ]) from Moonshot AI, MiniMax-M2.5 [ 80 ], the Qwen ecosystem comprising the general-purpose Qwen3.5 series ranging from 0.8B to 397B- A17B [ 99 ], Qwen3-Next [ 113 ], and the code-specialized Qwen3-Coder series [ 97 , 98 ] from Alibaba, Seed-OSS-36B-Instruct [ 110 ] from ByteDance, and GPT -OSS (120B and 20B) [ 4 ] from OpenAI. For proprietary models, we include Claude-Sonnet-4.6 [ 10 ] from Anthropic. These baselines collectively encompass dense and mixtur e-of-experts ar chitectures across a wide parameter range, enabling a thorough investigation of current capability boundaries acr oss both 8 T able 3. Combined performance on agentic coding tasks (T erminal-Bench v1.0, T erminal-Bench v2.0, SWE-V eriﬁed) and general tool-use tasks (Mind2W eb, BFCL V3, 𝜏 2 -bench). Model Size Agentic Coding General T ool Use T erminal-Bench SWE-bench V eriﬁed Mind2W eb BFCL V3 𝜏 2 -bench v1.0 v2.0 Airline Retail T elecom 6B+ Models DeepSeek-Coder-V2-Lite-Instruct 2.4/16B 5.0 0.0 - 26.7 - 3.5 12.0 - Qwen2.5-Coder-7B-Instruct 7B 6.3 0.0 - 38.4 54.2 - - - Seed-Coder-8B-Instruct 8B 7.5 2.5 - 38.2 - 4.3 32.0 - Qwen2.5-Coder-14B-Instruct 14B 8.8 0.0 - 42.7 59.9 - - - 30B+ Models Qwen3-Coder-30B-A3B-Instruct 3.3/30.5B 23.8 23.8 51.9 36.1 63.4 42.0 25.4 25.4 DeepSeek-v3.2 37/671B 23.8 46.4 73.1 47.2 68.8 63.8 81.1 96.2 Qwen2.5-Coder-32B-Instruct 32B 5.0 4.5 - 32.5 62.3 - - - Qwen3-235B-A22B-Instruct-2507 22/235B 15.0 13.5 45.2 49.0 71.2 50.0 74.6 32.5 Qwen3-235B-A22B-Thinking-2507 22/235B 8.8 3.4 44.6 43.2 71.9 58.0 71.9 45.6 Qwen3-Coder-480B-A35B-Instruct 35/480B 37.5 23.6 67.0 54.0 68.7 60.0 77.5 65.8 Kimi-Dev-72B 72B - 2.3 60.4 - 55.5 21.9 32.0 35.1 Kimi-K2-Instruct-0905 32B/1T 44.5 27.8 69.2 53.4 70.3 56.5 70.6 65.8 Kimi-K2-Thinking 32B/1T 47.1 33.7 71.3 55.7 - - - - KA T -Dev 32B 17.5 10.1 62.4 33.7 64.7 32.0 28.0 35.1 KA T -Dev-72B-Exp 72B 21.3 7.9 74.6 - - - - - GLM-4.7 32/355B 36.3 41.0 73.8 53.7 64.8 60.0 70.2 75.4 InCoder-32B 32B 35.0 22.5 74.8 55.8 61.0 70.0 85.1 86.8 general code and industrial code domains. 4.2. Benchmarks 4.2.1. General Code Evaluation W e evaluate model performance across multiple dimensions: code generation using EvalPlus [ 69 ] (HumanEval [ 22 ] and MBPP [ 12 ]), BigCodeBench [ 140 ] for library-intensive tasks, and FullStack- Bench [ 74 ] for full-stack scenarios; code reasoning with CRUXEval [ 43 ] testing bidirectional execution prediction (I2O and O2I) and LiveCodeBench [ 52 ] for competitive programming; code efﬁciency via Mer cury [ 31 ], which jointly measures corr ectness and runtime performance; T ext-to-SQL capabilities on Spider [ 132 ] for schema linking and BIRD [ 65 ] for value grounding; agentic coding tasks including T erminal-Bench [ 114 ] for terminal workﬂows, SWE-bench [ 55 ] for real-world patch generation, and SWE-bench V eriﬁed [ 108 ] with human-curated instances; and general agentic tasks such as Mind2W eb [ 29 ] for web navigation, BFCL [ 90 ] for multi-turn function calling across heterogeneous APIs, and 𝜏 -bench [ 130 ] with 𝜏 2 -bench [ 15 ] for policy- constrained conversational agents in shared envir onment. of 4.3. Industrial Code Benchmarks As shown in Figure 4 , we also evaluate our model on industrial code tasks, i.e. tasks related to chip design, GPU kernel optimization, code optimization, and 3D modeling. These tasks differ from conventional software engineering in important ways: they requir e r easoning on har dware constraints, low-level performance trade-offs, and domain-speciﬁc corr ectness criteria. 4.3.1. Chip Design V eriScope. W e propose the V erilog generation benchmark comprising 568 problems acr oss ﬁve difﬁculty levels with problems ranging from basic combinational logic, hierarchical module composition, system-level designs, to extreme challenges such as a dual-cor e out-of-order RISC- V SoC with cache coherence at L5. Each problem is evaluated through simulation: a code is 9 scored 0/50/100 when it fails to compile, compiles but fails the test, or passes all unit tests. RealBench. RealBench [ 56 ] targets production-grade IP-level V erilog generation rather than small algorithmic exercises. Built on four real-world open-source IP cor es (AES encryption, SD card controller , and Hummingbir dv2 E203 CPU), it includes 60 module-level subtasks where sub-modules can fall back to golden implementations, and 4 system-level subtasks that require implementing the entire module hierarchy from scratch given only top-level speciﬁcations. Evaluation employs a layered veriﬁcation pipeline: Syn@ 𝑘 checks whether at least one of 𝑘 independently generated candidates compiles successfully , while Func@ 𝑘 further veriﬁes functional correctness via testbench simulation on compilable candidates, wher e 𝑘 denotes the number of independent samples. W e report both metrics at 𝑘 ∈ { 1, 5 } for system-level and module-level tasks separately . ArchXBench. ArchXBench [ 94 ] pr ovides 51 complex digital system designs across six dif ﬁ culty levels with domains including cryptography , signal processing, image processing, and machine learning. Designs range from arithmetic circuits to complete subsystems like AES cor es and streaming FFT/DCT pipelines. Each task includes a problem description, an interface speciﬁca- tion, and a testbench. W e generate ﬁve independent candidates per task and report r esults using an ( 𝑛 , 𝑡 ) format, where 𝑛 denotes the number of syntactically correct candidates out of ﬁve and 𝑡 measures the per centage of testbench assertions passed by the best candidate among them. V eriRepair . V eriRepair is our proposed V erilog error diagnosis and repair dataset, constructed by systematically injecting realistic bugs into correct implementations from V eriCoder-R TLCoder . The error taxonomy spans 4 major categories and 20 err or types, including syntax errors, type and structural err ors, timing and FSM err ors, and semantic/logic errors. The dataset contains approximately 22,000 training and 300 test samples, each annotated with buggy code, error types, error locations, corrected r eference, and testbench. It supports three evaluation tasks—error classiﬁcation, error localization, and err or repair —and we report the r epair success rate (Fix), measured as the fraction of buggy programs that the model successfully corr ects as veriﬁed by testbench. 4.3.2. GPU Optimization KernelBench. KernelBench [ 88 ] evaluates LLMs on 250 PyT orch ML workloads acr oss three levels: single operations at Level 1, operator sequences amenable to fusion at Level 2, and end-to-end architectur es at Level 3. The model receives a PyT orch refer ence implementation and must pr oduce an optimized version using any available tool. The benchmark introduces the fast 𝑝 metric, measuring the fraction of tasks where the generated kernel is both corr ect and faster than threshold 𝑝 over the PyT orch baseline. W e report fast 1 separately for each level (L1, L2, L3). T ritonBench. T ritonBench [ 63 ] is the ﬁrst comprehensive benchmark for T riton operator gener- ation, tar geting the Python-like GPU programming language adopted in vLLM and Liger-kernel. It features two evaluation tracks: T ritonBench-G with 184 curated r eal-world operators from GitHub across ﬁve dif ﬁculty levels, and T ritonBench-T with tasks aligned to PyT orch interfaces. For each track, we report two metrics: call accuracy (whether the generated code executes without error) and execution accuracy (whether the output further matches the r eference imple- mentation), yielding four metrics in total: G-call, G-exe, T -call, and T -exe. 10 T able 4. Performance on chip design benchmarks. InCoder-32B results ar e highlighted in gray . Model Size V eriScope V eriRepair RealBench ArchXBench Score Fix (%) System Module 𝑛 𝑡 Syn@1 Syn@5 Syn@1 Syn@5 Func@1 Func@5 6B+ Models Qwen3.5-9B 9B 32.0 0.0 0.0 0.0 6.3 15.8 4.3 8.5 1.9 44.3 GPT -OSS-20B 3.6/21B 73.9 86.7 3.8 17.4 22.9 47.9 9.8 21.0 3.1 53.5 Qwen3.5-27B 27B 55.7 60.0 6.2 20.1 17.1 33.8 10.6 17.8 2.6 48.3 30B+ Models Qwen3-Coder-30B-A3B-Instruct 3.3/30.5B 66.2 76.7 0.0 0.0 23.0 35.2 5.2 8.2 2.4 37.3 Seed-OSS-36B-Instruct 36B 67.2 66.7 5.0 21.3 14.3 23.0 11.5 20.3 2.5 43.7 GPT -OSS-120B 5.1/117B 82.2 76.7 5.0 21.3 37.8 64.3 17.5 30.8 3.4 54.8 MiniMax-M2.5 10/230B 75.1 66.7 23.8 48.5 17.2 38.4 6.9 15.9 2.9 46.0 GLM-4.7 32/355B 81.2 63.3 12.5 24.6 25.4 46.9 11.6 21.2 3.2 51.4 GLM-5 40/744B 83.2 90.0 2.5 11.2 22.2 43.4 12.2 22.6 3.1 53.2 Kimi-K2.5 32B/1T 73.1 83.3 5.0 17.9 43.7 52.2 23.1 25.7 3.8 49.7 Kimi-K2-Instruct 32B/1T 82.4 76.7 6.2 26.2 50.1 70.1 22.2 28.3 2.9 44.9 Kimi-K2-Thinking 32B/1T 73.1 80.0 0.0 0.0 27.8 59.4 14.1 28.9 1.5 30.1 DeepSeek-V3.2 37/671B 76.1 77.0 18.8 55.1 39.3 52.7 17.2 21.4 3.6 53.9 Qwen3.5-397B-A17B 17/397B 44.8 86.7 11.2 38.1 35.2 59.5 16.4 28.3 3.1 53.5 Qwen3-Coder-480B-A35B-Instruct 35/480B 80.8 76.7 0.0 0.0 28.9 39.5 14.8 20.6 3.0 43.9 InCoder-32B 32B 80.7 80.0 10.0 23.7 74.8 83.3 62.7 70.5 3.4 51.0 Closed-APIs Models Claude-Sonnet-4.6 µ 87.7 83.3 41.2 50.0 69.2 77.7 33.5 37.2 4.4 58.2 4.3.3. Code Optimization EmbedCGen. EmbedCGen is our proposed benchmark for bare-metal embedded C code generation tar geting resour ce-constrained microcontr ollers. It comprises 500 problems acr oss ﬁve difﬁculty levels, ranging from basic peripheral control and register -level operations at lower levels to complex multi-peripheral integration, DMA mechanisms, and state machine coordination at higher levels. Generated code must satisfy HAL conventions and hard real- time constraints. Evaluation follows a strict serial pipeline, including code generation, cross- compilation via the ARM GCC toolchain, and functional veriﬁcation thr ough the Renode system-level simulator . W e report the average pass rate (Main) across all 500 pr oblems. SuperCoder . SuperCoder [ 119 ] frames assembly superoptimization as an LLM task: given a C program and its gcc -O3 output, the model must generate a semantically equivalent but faster assembly program. The benchmark contains 8,072 x86-64 assembly programs averaging 130 lines with loops, accompanied by test suites achieving 96.2% line and 87.3% branch coverage. Evaluation employs two metrics: accuracy assesses functional correctness through test suite validation, and average speedup quantiﬁes performance gain relative to the compiler-optimized baseline. 4.3.4. 3D Modeling CAD-Coder . CAD-Coder [ 45 ] reformulates text-to-CAD generation as pr oducing executable CadQuery scripts, a Python-based parametric CAD language built on the OpenCascade kernel. The benchmark is constructed by transforming the T ext2CAD dataset [ 58 ] into 110K veriﬁed text-CadQuery-3D model triplets, stratiﬁed by quality into 8K high-quality , 70K medium-quality , and 32K har d cases, with an additional 1.5K chain-of-thought annotated samples. Evaluation employs two key metrics: compilation success rate measures the percentage of generated scripts that can be successfully executed to produce valid 3D geometry , while IoU quantiﬁes the geometric ﬁdelity by calculating the volumetric overlap between the generated voxels and the ground tr uth. 11 T able 5. Performance on GPU optimization, code optimization, and 3D modeling benchmarks. InCoder-32B r esults are highlighted in gray . Model Size CAD-Coder EmbedCGen SuperCoder T ritonBench KernelBench Comp. IoU Main (%) Acc. (%) Spd. G-call (%) G-exe (%) T -call (%) T -exe (%) L1 L2 L3 6B+ Models Qwen3.5-9B 9B 2.0 0.0 10.0 36.0 1.0 × 2.7 100.0 3.6 100.0 0.0 0.0 2.0 GPT -OSS-20B 3.6/21B 2.0 2.0 30.6 16.0 1.0 × 2.2 100.0 1.2 100.0 5.1 10.0 2.0 Qwen3.5-27B 27B 4.0 0.3 9.6 6.0 1.9 × 5.4 100.0 25.9 97.7 5.1 6.0 4.0 30B+ Models Qwen3-Coder-30B-A3B-Instruct 3.3/30.5B 0.0 0.0 15.4 50.0 1.0 × 8.7 100.0 24.1 67.5 2.0 0.0 0.0 Seed-OSS-36B-Instruct 36B 2.0 2.0 10.2 8.0 1.1 × 1.6 100.0 1.8 100.0 1.0 2.0 2.0 GPT -OSS-120B 5.1/117B 4.0 1.9 17.8 8.0 1.2 × 3.8 85.7 12.7 95.2 6.1 15.0 2.0 MiniMax-M2.5 10/230B 4.0 0.4 22.2 20.0 3.5 × 5.4 100.0 15.1 100.0 7.1 14.0 8.0 GLM-4.7 32/355B 12.0 6.0 89.6 20.0 8.6 × 3.3 100.0 6.0 100.0 8.1 19.0 0.0 GLM-5 40/744B 38.0 18.8 90.2 54.0 1.87 × 1.6 100.0 1.2 100.0 16.2 23.0 4.0 Kimi-K2.5 32B/1T 40.0 12.1 81.0 46.0 1.9 × 12.5 100.0 7.8 100.0 13.1 23.0 6.0 Kimi-K2-Instruct 32B/1T 2.0 1.1 69.6 12.0 1.1 × 15.8 96.5 13.9 91.3 6.1 0.0 0.0 Kimi-K2-Thinking 32B/1T 48.0 20.0 - 24.0 1.2 × 17.4 100.0 19.9 84.8 9.1 16.0 4.0 DeepSeek-V3.2 37/671B 14.0 4.6 84.4 30.0 1.8 × 19.6 100.0 18.1 13.3 3.0 0.0 0.0 Qwen3.5-397B-A17B 17/397B 36.0 14.2 17.8 34.0 1.2 × 7.6 100.0 16.3 92.6 4.0 10.0 0.0 Qwen3-Coder-480B-A35B-Instruct 35/480B 10.0 4.7 9.0 64.0 2.0 × 20.1 100.0 31.9 56.6 3.0 6.0 0.0 InCoder-32B 32B 82.0 53.5 35.2 91.0 1.3 × 18.5 100.0 19.3 93.8 22.2 36.0 14.0 Closed-APIs Models Claude-Sonnet-4.6 µ 77.0 32.4 79.0 88.0 4.6 × 28.8 98.1 41.6 1.4 11.1 28.0 2.0 4.4. Main Results Results on General Code Benchmarks As shown in T able 1 , T able 2 and T able 3 , InCoder -32B as a 32B dense model achieves br oadly competitive results with signiﬁcantly lar ger open-weight models across code generation, code reasoning, and T ext2SQL, while ranking ﬁrst among all open-weight baselines on SWE-bench V eriﬁed, Mind2W eb, and 𝜏 2 -bench. Although a gap remains on certain r easoning and efﬁciency benchmarks compared to the largest MoE models, InCoder-32B delivers str ong overall performance given its compact size. Results on Industrial Code Benchmarks As shown in T able 4 and T able 5 , InCoder -32B demonstrates clear advantages on industrial code tasks. On chip design, it achieves the best open-weight results on RealBench module-level tasks by a wide margin, while remaining competitive on V eriScope and V eriRepair . On 3D modeling and GPU optimization, InCoder-32B leads all open-weight baselines on CAD-Coder and KernelBench across all three levels, and even surpasses the proprietary Claude-Sonnet-4.6 on CAD-Coder IoU and KernelBench L1/L2/L3. While tasks such as EmbedCGen and SuperCoder r emain challenging, the results overall conﬁrm that InCoder-32B pr ovides strong industrial code capabilities. benchmarksabout that go beyond standard functional testinge at L1 thr ough sequential cir cuitsand -based testingcode that fails to compile receives a score of 0, code that compiles successfully but fails tests receives 50 points, and code that passes all tests receives the full 100 points.provides, covering domainsThe benchmarkuses evaluation, reporting r esults as ( 𝑛 , 𝑡 ) pairs where 𝑛 is the number of compilable candidates out of ﬁve attempts, and 𝑡 is the testbench pass rate of the best candidate.covering acrossEemploys 5. Related W ork 5.1. Large Language Models for Code The code intelligence landscape has evolved rapidly , building on a rich history of code-oriented LLM research [ 1 , 4 , 5 , 26 , 49 , 50 , 61 , 76 , 82 , 100 , 101 , 110 , 137 ]. On the open-weight side, the DeepSeek series [ 27 , 68 ], the GLM series [ 6 , 135 ], the Kimi family [ 111 , 112 ], the MiniMax se- ries [ 78 – 80 ], and the Qwen ecosystem [ 97 – 99 , 113 ] have demonstrated str ong performance on both standard code generation benchmarks and agentic coding tasks. On the closed-source side, the Claude series [ 8 – 11 ], the GPT series [ 84 – 86 ], and the Gemini series [ 41 , 42 ] continue to push 12 the fr ontier , excelling at complex code r easoning, multi-step planning, and tool-augmented agen- tic workﬂows. A rich set of benchmarks has been developed in parallel to track these advances, covering function-level corr ectness [ 22 , 69 ], multilingual evaluation [ 19 , 57 ], complex real-world tasks [ 32 , 74 , 140 ], contamination-free assessment [ 53 ], human prefer ence alignment [ 125 ], and agentic coding capabilities [ 29 , 55 , 90 , 114 , 130 ]. Despite this rapid pr ogr ess, existing LLMs for code ar e predominantly trained and evaluated on general software engineering tasks such as algorithm implementation, web development, and scripting. The speciﬁc demands of industrial programming domains, including har dware description languages, high-performance GPU kernel programming, compiler optimization, and computer-aided design, r emain largely underexplor ed. 5.2. Industrial Code Intelligence Industrial software development involves specialized domains where code LLMs must handle domain-speciﬁc language syntax, har dware-aware optimization, and strict functional correct- ness, posing unique challenges beyond general-purpose programming. Recent work has begun to address these challenges, though mostly in a domain-speciﬁc manner . Chip design has received the most attention. Early efforts such as V eriGen [ 115 ] and R TL- Coder [ 72 ] ﬁne-tuned LLMs for R TL code generation, with subsequent work expanding to V erilog debugging, multi-modal synthesis, and veriﬁcation [ 20 , 33 , 116 , 123 , 134 ]. Recent methods increasingly adopt r einforcement learning with har dware-speciﬁc rewar ds, including CodeV -R1 [ 139 ], V eriReason [ 117 ], and ReasoningV [ 95 ]. Corresponding benchmarks such as V erilogEval [ 70 , 91 ], RealBench [ 56 ], Ar chXBench [ 93 ], CVDP [ 92 ], MetRex [ 2 ], and V eriBench [ 3 ] provide increasingly rigor ous evaluation for hardwar e design. For embedded systems, Em- bedGenius [ 124 ] and EmbedAgent [ 121 ] explore automated IoT development and benchmark- ing [ 13 , 28 , 34 , 96 ]. GPU kernel optimization has also seen rapid progr ess [ 37 , 77 , 133 ]. Kevin [ 14 ], ConCuR [ 60 ], and CUDA Agent [ 25 ] apply reinforcement learning to CUDA kernel generation, while As- cendKernelGen [ 18 ] and Dr .Kernel [ 75 ] further extend to NPU kernels and proﬁling-based rewar d design, with KernelBench [ 87 ] and T ritonBench [ 64 ] serving as standar d benchmarks. For compiler optimization, SuperCoder [ 118 ], LLM-V ectorizer [ 109 ], LLM Compiler [ 24 ], and LLM-V eriOpt [ 36 ] tar get tasks from assembly superoptimization to LL VM-IR peephole optimiza- tion. In 3D modeling, CAD-Coder [ 44 ] and ReCAD [ 62 ] generate parametric CAD scripts from text and images [ 83 , 138 ], while STEP-LLM [ 103 ] and BrepCoder [ 59 ] target industry-standar d STEP and B-rep formats. While these efforts have signiﬁcantly advanced individual domains, each model or benchmark addresses a single industrial sub-domain, leading to fragmented coverage. In contrast, InCoder- 32B is a uniﬁed industrial code LLM that spans multiple industrial programming domains, bridging the gap between general-purpose code LLMs and industrial software development. 6. Analysis 6.1. Error Analysis Despite strong overall performance, InCoder-32B still exhibits systematic failur e patterns that reveal the remaining challenges of industrial code generation. W e manually inspect all 1,882 failure cases across the 9 industrial benchmarks and categorize them into ﬁve r ecurring err or themes. Figure 5 pr esents the error distribution for each benchmark. The most prevalent failur e category is compilation and syntax errors, which dominate V erilog 13 65 V eriScope F or mat Er r or (46%) W r ong Classification (45%) Compile Er r or (6%) T est F ailur e (3%) 181 V eriRepair L ogic/F unctional (79%) Compilation (21%) 508 RealBench Syntax Er r or (42%) Output Mismatch (29%) Other Compile (19%) P ort/Module Er r or (9%) Bit-width Er r or (2%) 226 ArchXBench F unctional (49%) Other Compile (15%) Syntax Er r or (13%) P ort Er r or (10%) W idth/Concat (6%) T ype Mismatch (4%) Encoding Er r or (4%) 29 CAD -Coder Shape (IoU 0.1-0.5) (45%) Shape (IoU<0.1) (34%) Shape (IoU 0.5-0.7) (14%) R untime Er r or (7%) 397 EmbedCGen Link Er r or (47%) T imeout (31%) Missing main (14%) Compile Er r or (5%) Syntax Er r or (3%) Other (0%) 6 SuperCoder No Optimization (83%) Compile F ailur e (17%) 285 T ritonBench NameEr r or (33%) T ypeEr r or (24%) Compile/Syntax (17%) AssertionEr r or (9%) V alueEr r or (6%) A ttributeEr r or (4%) CUD A Er r or (4%) Other (4%) 185 K ernelBench Cor r ectness (35%) CUD A/Memory (19%) Slow (<0.5x) (17%) Slow (0.5-1x) (16%) Static Check (6%) Build Er r or (6%) Other (1%) Figure 5. Error distribution of InCoder-32B acr oss 9 industrial benchmarks. The center number indicates total failures per benchmark. Errors ar e color-coded by type: reds/oranges for compilation and syntax errors, blues for functional/logic err ors, purples for format errors, greens for performance issues. generation tasks: 71% of RealBench failures involve malformed literals, incorrect port decla- rations, or bit width mismatches, and ArchXBench shows a similar pattern with 51% syntax errors including misuse of named ports on primitive gates and signed literals of indeﬁnite width. Beyond syntax, knowledge of industrial APIs remains incomplete—on EmbedCGen, 47% of failures ar e linker errors from r eferencing undeﬁned or incorrectly typed HAL/CMSIS functions, while T ritonBench sees 33% NameError s and 24% TypeError s from incorr ect T riton API usage. A further 46% of V eriScope failures are unparseable outputs where the model ignor es the requir ed structur ed format entirely . T ogether , these errors suggest that InCoder-32B has acquired br oad domain vocabulary but has not fully internalized the precise syntactic rules and library constraints of industrial languages. When code compiles, functional correctness becomes the dominant bottleneck. V eriRepair illustrates this most sharply: 79% of its failures produce code that compiles but fails test cases, pointing to subtle logic errors in repair reasoning rather than mistakes at the surface level. ArchXBench echoes this with 49% functional failures where generated V erilog simulates incorrectly , and CAD-Coder failur es are almost entir ely geometric (93%)—most stemming from a systematic misinterpretation of Euler angle conventions that rotates extrusions around the wrong axis. These cases share a common structur e: the model pr oduces superﬁcially plausible code that br eaks under pr ecise numerical or state machine semantics, revealing the difﬁculty of reasoning about correctness in domains where small mistakes propagate into observable failures. Benchmarks that measure optimization expose a third gap: generating correct code is necessary but not sufﬁcient for industrial tasks. On KernelBench, 33% of failur es pr oduce functionally correct but insufﬁciently fast GPU kernels, and on SuperCoder 83% of failures reduce to the model simply copying the input x86-64 assembly without modiﬁcation. Collectively , these failure patterns indicate that industrial code intelligence demands mor e than general code training can provide: closing the gaps will requir e data curation targeting rar e APIs and har dwar e semantics, 14 83M 167M 250M 74 76 78 80 82 CAD -Coder 50.5 51.0 51.5 52.0 52.5 53.0 53.5 54.0 54.5 Comp. IoU 83M 167M 250M 70 72 74 76 78 80 V eriRepair Fix 83M 167M 250M 10 20 30 40 50 60 70 80 RealBench Syn@1 (Sys) Syn@5 (Sys) Syn@1 (Mod) Syn@5 (Mod) Func@1 (Mod) Func@5 (Mod) 83M 167M 250M 2.8 3.0 3.2 3.4 3.6 3.8 ArchXBench 47 48 49 50 51 n t 83M 167M 250M 79.4 79.6 79.8 80.0 80.2 80.4 80.6 V eriScope Score 83M 167M 250M 17.5 20.0 22.5 25.0 27.5 30.0 32.5 35.0 EmbedCGen Main 83M 167M 250M 20 40 60 80 100 SuperCoder 0.6 0.8 1.0 1.2 1.4 1.6 Acc. Spd. 83M 167M 250M 20 40 60 80 100 T ritonBench G -call G -exe T -call T -exe 83M 167M 250M 10 15 20 25 30 35 K ernelBench L1 L2 L3 Figure 6. Performance across nine industrial code benchmarks as SFT data scales fr om 83M to 250M tokens. training with veriﬁcation signals in the loop, and explicit r easoning about low-level performance. There is still a long journey to apply LLMs in realistic industrial scenarios. 6.2. Effects of dif ferent Stages T o understand how the volume of industrial SFT data affects downstream performance, we train three checkpoints on 83M, 167M, and 250M tokens r espectively and evaluate each across all nine benchmarks. Figure 6 summarizes these tr ends. The majority of benchmarks show consistent improvements as the data scale grows, conﬁrming that cor e industrial coding and architectural reasoning capabilities beneﬁt fr om larger-scale ﬁne-tuning. Only a few individual sub-metrics on RealBench and T ritonBench exhibit minor regressions at the 250M stage, yet they still remain above or close to the 83M baseline, suggesting that veriﬁcation-related understanding may saturate early with a smaller high-quality SFT dataset. Overall, these ﬁndings conﬁrm that scaling industrial SFT data is a reliable driver for performance, with only mar ginal ﬂuctuations observed at the largest scale. 7. Conclusion In this paper , we have presented InCoder -32B, a code foundation model that bridges the gap between general code intelligence and the stringent demands of industrial softwar e develop- ment. Through a systematic three-stage Code-Flow training pipeline, InCoder -32B effectively acquires r easoning capabilities about har dware behavior and industrial constraints without sac- riﬁcing general pr ogramming performance, while offering both instruction tuning and analytical reasoning within a uniﬁed framework. Extensive evaluations across 14 general benchmarks and 9 industrial benchmarks covering chip design, embedded systems, GPU optimization, and compiler optimization demonstrate that InCoder -32B achieves competitive performance with 15 leading models on general tasks while establishing strong baselines in industrial domains. Our ablation studies further r eveal that r epository transition data outperforms static snapshots for planning signal, mid-training reasoning trajectories serve as a critical scaffold for robustness under distribution shift, and thinking paths unlock emergent capabilities absent in standard SFT . 16 References 1 01.ai. Y i-coder . https://github.com/01- ai/Yi- Coder , 2024. Accessed: 2025-09-20. 2 Manar Abdelatty , Jingxiao Ma, and Sherief Reda. Metrex: A benchmark for verilog code metric reasoning using llms. In Proceedings of the 30th Asia and South Paciﬁc Design Automation Conference , pages 995–1001, 2025. 3 Mihir Agarwal, Zaqi Momin, Kailash Prasad, and Joycee Mekie. V eribench: benchmarking large language models for verilog code generation and design synthesis. In 2025 IEEE International Symposium on Circuits and Systems (ISCAS) , pages 1–5. IEEE, 2025. 4 Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Y u Bai, Bowen Baker , Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint , 2025. 5 Stability AI. Stablecode: A 3b parameter code language model, 2023. URL https://huggin gface.co/stabilityai/stable- code- 3b . Accessed: 2025-09-20. 6 Zhipu AI. Glm-4.7: Advancing the coding capability . https://z.ai/blog/glm- 4.7 , 2025. 7 Nadia Alshahwan, Jubin Chheda, Anastasia Finegenova, Beliz Gokkaya, Mark Harman, Inna Harper , Alexandru Marginean, Shubho Sengupta, and Eddy W ang. Automated unit test improvement using lar ge language models at meta, 2024. URL s/2402.09171 . 8 Anthropic. Introducing claude sonnet 4.5, 2025. URL https://www.anthropic.com/news/c laude- sonnet- 4- 5 . 9 Anthropic. Introducing claude opus 4.5, 2025. URL https://www.anthropic.com/news/cla ude- opus- 4- 5 . 10 Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/c laude- sonnet- 4- 6 . 11 Anthropic. Introducing claude opus 4.6, 2026. URL https://www.anthropic.com/news/cla ude- opus- 4- 6 . 12 Jacob et al. Austin. Program synthesis with lar ge language models. In ICML , 2021. 13 Marek Babiuch and Pavel Smutn ` y. Benchmarking large language models for embedded systems programming in microcontr oller-driven iot applications. Future Internet , 18(2):94, 2026. 14 Carlo Baronio, Pietr o Marsella, Ben Pan, Simon Guo, and Silas Alberti. Kevin: Multi-turn rl for generating cuda kernels. arXiv preprint , 2025. 15 V ictor Barres, Honghua Dong, Soham Ray , Xujie Si, and Karthik Narasimhan. tau2-bench: Evaluating conversational agents in a dual-control envir onment, 2025. URL https://arxiv. org/abs/2506.07982 . 16 V ictor Barr es, Honghua Dong, Soham Ray , Xujie Si, and Karthik Narasimhan. tau2- bench: Evaluating conversational agents in a dual-control environment. arXiv preprint arXiv:2506.07982 , 2025. 17 17 Y oshua Bengio, Jérôme Louradour , Ronan Collobert, and Jason W eston. Curriculum learn- ing. In Pr oceedings of the 26th Annual International Conference on Machine Learning (ICML) , pages 41–48, 2009. 18 Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Y an, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Y ihan Su, et al. Ascendkernelgen: A systematic study of llm-based kernel generation for neural processing units. arXiv preprint , 2026. 19 Federico Cassano, John Gouwar , Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney , Ming-Ho Y ee, Y angtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenber g, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to benchmarking neural code generation, 2022. URL 08.08227 . 20 Kaiyan Chang, Zhirong Chen, Y unhao Zhou, W enlong Zhu, Kun W ang, Haobo Xu, Cangyuan Li, Mengdi W ang, Shengwen Liang, Huawei Li, et al. Natural language is not enough: Benchmarking multi-modal generative ai for verilog generation. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design , pages 1–9, 2024. 21 Sahil Chaudhary . Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca , 2023. 22 Mark Chen, Jerry T worek, Heewoo Jun, Qiming Y uan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Y uri Burda, Nicholas Joseph, Greg Brockman, Alex Ray , Raul Puri, Gretchen Krueger , Michael Petrov , Heidy Khlaaf, Girish Sastry , Pamela Mishkin, Brooke Chan, Scott Gray , Nick R yder , Mikhail Pavlov , Alethea Power , Lukasz Kaiser , Mo- hammad Bavarian, Clemens W inter , Philippe T illet, Felipe Petr oski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-V oss, W illiam Hebgen Guss, Alex Nichol, Alex Paino, Nikolas T ezak, Jie T ang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, W illiam Saunders, Christopher Hesse, Andrew N. Carr , Jan Leike, Josh Achiam, V edant Misra, Evan Morikawa, Alec Radfor d, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer , Peter W elinder , Bob McGrew , Dario Amodei, Sam McCandlish, Ilya Sutskever , and W ojciech Zaremba. Evaluating large language models trained on code, 2021. URL . 23 Chris Cummins, V olker Seeker , Dejan Grubisic, Mostafa Elhoushi, Y ouwei Liang, Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Kim M. Hazelwood, Gabriel Synnaeve, and Hugh Leather . Large language models for compiler optimization. CoRR , abs/2309.07062, 2023. 24 Chris Cummins, V olker Seeker , Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather . Llm compiler: Foundation language models for compiler optimization. In Proceedings of the 34th ACM SIGPLAN International Confer ence on Compiler Construction , pages 141–153, 2025. 25 W einan Dai, Hanlin W u, Qiying Y u, Huan-ang Gao, Jiahao Li, Chengquan Jiang, W eiqiang Lou, Y ufan Song, Hongli Y u, Jiaze Chen, et al. Cuda agent: Lar ge-scale agentic rl for high-performance cuda kernel generation. arXiv preprint , 2026. 26 DeepSeek AI. DeepSeek-Coder-V2: Breaking the barrier of closed-sour ce models in code intelligence. arXiv preprint , 2024. 27 DeepSeek-AI. Deepseek-v3.1-terminus, 2025. URL https://api- docs.deepseek.com/news/ news250922 . 18 28 Matthew DeLorenzo, V asudev Gohil, and Jeyavijayan Rajendran. Cr eativeval: Evaluating creativity of llm-based har dware code generation. In 2024 IEEE LLM Aided Design Workshop (LAD) , pages 1–5. IEEE, 2024. 29 Xiang Deng, Y u Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi W ang, Huan Sun, and Y u Su. Mind2web: T owards a generalist agent for the web. Advances in Neural Information Processing Systems , 36:28091–28114, 2023. 30 Colin Diggs, Michael Doyle, Amit Madan, Siggy Scott, Emily Escamilla, Jacob Zimmer , Naveed Nekoo, Paul Ursino, Michael Bartholf, Zachary Robin, Anand Patel, Chris Glasz, W illiam Macke, Paul Kirk, Jasper Phillips, Arun Sridharan, Doug W endt, Scott Rosen, Nitin Naik, Justin F . Brunelle, and Samruddhi Thaker . Leveraging llms for legacy code modernization: Challenges and opportunities for llm-generated documentation, 2024. URL https://arxiv.org/abs/2411.14971 . 31 Mingzhe Du, Anh T uan Luu, Bin Ji, Qian Liu, and See-Kiong Ng. Mercury: A code ef ﬁciency benchmark for code large language models, 2024. URL 44 . 32 Y axin Du, Y uzhu Cai, Y ifan Zhou, Cheng W ang, Y u Qian, Xianghe Pang, Qian Liu, Y ue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development. arXiv preprint , 2025. 33 S ea Thakur . Benchmarking large language models for automated verilog rtl code generation. DA TE. IEEE , pages 1–6, 2023. 34 Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish Narayanswamy , Joseph Breda, Xin Liu, Shwetak Patel, and V ikram Iyer . Exploring and characterizing lar ge language models for embedded system development and debugging. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages 1–9, 2024. 35 Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish Narayanswamy , Joseph Breda, Xin Liu, Shwetak N. Patel, and V ikram Iyer . Exploring and characterizing large language models for embedded system development and debugging. In Florian ’Floyd’ Mueller , Penny Kyburz, Julie R. W illiamson, and Corina Sas, editors, Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA 2024, Honolulu, HI, USA, May 11-16, 2024 , pages 150:1–150:9. ACM, 2024. doi: 10.1145/3613905.3650764. URL https://doi.org/10.1145/3613905.3650764 . 36 Xiangxin Fang, Jiaqin Kang, Rodrigo Rocha, Sam Ainsworth, and Lev Mukhanov . Llm- veriopt: V eriﬁcation-guided reinfor cement learning for llm-based compiler optimization. In 2026 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) , pages 740–755. IEEE, 2026. 37 Zacharias V Fisches, Sahan Paliskara, Simon Guo, Alex Zhang, Joe Spisak, Chris Cummins, Hugh Leather , Gabriel Synnaeve, Joe Isaacson, Aram Markosyan, et al. Kernelllm: Making kernel development more accessible, 6 2025. Corresponding authors: Aram Markosyan, Mark Sarouﬁm . 38 Cuiyun Gao, Xing Hu, Shan Gao, Xin Xia, and Zhi Jin. The current challenges of softwar e engineering in the era of large language models. ACM T ransactions on Software Engineering and Methodology , 34(5):1–30, May 2025. ISSN 1557-7392. doi: 1 0 . 1 1 4 5 / 37 1 2 0 0 5. URL http://dx.doi.org/10.1145/3712005 . 19 39 Leo Gao, Stella Biderman, Sid Black, Laurence Golding, T ravis Hoppe, Charles Foster , Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser , and Connor Leahy . The Pile: An 800GB Dataset of Diverse T ext for Language Modeling, 2020. URL ht t p s : //arxiv.org/abs/2101.00027 . 40 Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, and Sida W ang. Structur e-awar e ﬁll-in- the-middle pretraining for code, 2025. URL . 41 Google DeepMind. Gemini 3 pro: the frontier of vision ai. https://blog.google/innovati on- and- ai/technology/developers- tools/gemini- 3- pro- vision/ , 2025. 42 Google DeepMind. Gemini 3.1 pro: A smarter model for your most complex tasks. https: //blog. google/ innovat ion- and- ai/mode ls- and- research/ gemini- mode ls/gemi ni- 3- 1- p ro/ , 2026. 43 Alex Gu, Baptiste Rozière, Hugh Leather , Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. W ang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065 , 2024. 44 Y andong Guan, Xilin W ang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Y u. Cad- coder: T ext-to-cad generation with chain-of-thought and geometric rewar d. arXiv preprint arXiv:2505.19713 , 2025. 45 Y andong Guan, Xilin W ang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Y u. Cad- coder: T ext-to-cad generation with chain-of-thought and geometric reward, 2025. URL https://arxiv.org/abs/2505.19713 . 46 Daya Guo, Qihao Zhu, Dejian Y ang, Zhenda Xie, Kai Dong, W entao Zhang, Guanting Chen, Xiao Bi, Y . W u, Y . K. Li, Fuli Luo, Y ingfei Xiong, and W enfeng Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR , abs/2401.14196, 2024. doi: 10.48550/ARXIV.2401.14196. URL https://doi.org/10.48550 /arXiv.2401.14196 . 47 Daya Guo, Qihao Zhu, Dejian Y ang, Zhenda Xie, Kai Dong, W entao Zhang, Guanting Chen, Xiao Bi, Y . W u, Y . K. Li, Fuli Luo, Y ingfei Xiong, and W enfeng Liang. Deepseek-coder: When the lar ge language model meets pr ogramming – the rise of code intelligence, 2024. URL . 48 Shailja Gupta, Rajesh Ranjan, and Surya Narayan Singh. A comprehensive survey of retrieval-augmented generation (RAG): evolution, curr ent landscape and future directions. CoRR , abs/2410.12837, 2024. doi: 10.48550/ARXIV.2410.12837. URL https://doi.org/10 .48550/arXiv.2410.12837 . 49 Siming Huang, T ianhao Cheng, Jason Klein Liu, W eidi Xu, Jiaran Hao, Liuyihan Song, Y ang Xu, Jian Y ang, Jiaheng Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Y uan, Xianzhen Luo, Qiufeng W ang, Y uanT ao Fan, Qingfu Zhu, Zhaoxiang Zhang, Y ang Gao, Jie Fu, Qian Liu, Houyi Li, Ge Zhang, Y uan Qi, Y inghui Xu, W ei Chu, and Zili W ang. Opencoder: The open cookbook for top-tier code large language models. In W anxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad T aher Pilehvar , editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), ACL 2025, V ienna, Austria, July 27 - August 1, 2025 , pages 33167–33193. Association for Computational Linguistics, 2025. URL https://aclanthology.org/2025.acl- long.1591/ . 20 50 Binyuan Hui, Jian Y ang, Zeyu Cui, Jiaxi Y ang, Dayiheng Liu, Lei Zhang, T ianyu Liu, Jiajun Zhang, Bowen Y u, Kai Dang, An Y ang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report. CoRR , abs/2409.12186, 2024. doi: 10.48550/ARXIV.24 09.12186. URL https://doi.org /10.48550/arXi v.2409.12 186 . 51 Binyuan Hui, Jian Y ang, Zeyu Cui, Jiaxi Y ang, Dayiheng Liu, Lei Zhang, T ianyu Liu, Jiajun Zhang, Bowen Y u, Keming Lu, Kai Dang, Y ang Fan, Y ichang Zhang, An Y ang, Rui Men, Fei Huang, Bo Zheng, Y ibo Miao, Shanghaoran Quan, Y unlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. URL . 52 Naman Jain, King Han, Alex Gu, W en-Ding Li, Fanjia Y an, T ianjun Zhang, Sida W ang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contam- ination free evaluation of lar ge language models for code. arXiv preprint , 2024. 53 Naman Jain, King Han, Alex Gu, W en-Ding Li, Fanjia Y an, T ianjun Zhang, Sida W ang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of lar ge language models for code, 2024. URL h t t p s : //arxiv.org/abs/2403.07974 . 54 Nan Jiang, Xiaopeng Li, Shiqi W ang, Qiang Zhou, Soneya B Hossain, Baishakhi Ray , V arun Kumar , Xiaofei Ma, and Anoop Deoras. Ledex: T raining llms to better self-debug and explain code. Advances in Neural Information Processing Systems , 37:35517–35543, 2024. 55 Carlos E. Jimenez, John Y ang, Alexander W ettig, Shunyu Y ao, Kexin Pei, Oﬁr Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, ICLR 2024, V ienna, Austria, May 7-11, 2024 . OpenReview .net, 2024. URL https://openreview.net/forum?id=VTF8yNQM 66 . 56 Pengwei Jin, Di Huang, Chongxiao Li, Shuyao Cheng, Y ang Zhao, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Bohan Dou, Rui Zhang, et al. Realbench: Benchmarking verilog generation models with real-world ip designs. arXiv preprint , 2025. 57 Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, W eishi W ang, Md Rizwan Parvez, and Shaﬁq Joty . xcodeeval: A large scale multilingual multitask benchmark for code understanding, generation, translation and r etrieval, 2023. URL abs/2303.03004 . 58 Mohammad S Khan, Sankalp Sinha, T alha U Sheikh, Didier Stricker , Sk A Ali, and Muham- mad Z Afzal. T ext2cad: Generating sequential cad designs from beginner-to-expert level text prompts. Advances in Neural Information Processing Systems , 37:7552–7579, 2024. 59 Mingi Kim, Y ongjun Kim, Jungwoo Kang, and Hyungki Kim. Brepcoder: A uniﬁed multi- modal large language model for multi-task b-r ep reasoning. arXiv preprint , 2026. 60 Lingcheng Kong, Jiateng W ei, Hanzhang Shen, and Huan W ang. Concur: Conciseness makes state-of-the-art kernel generation. arXiv preprint , 2025. 21 61 Jia Li, Hao Zhu, Huanyu Liu, Xianjie Shi, He Zong, Y ihong Dong, Kechi Zhang, Siyuan Jiang, Zhi Jin, and Ge Li. aixcoder-7b-v2: T raining llms to fully utilize the long context in repository-level code completion. arXiv preprint , 2025. 62 Jiahao Li, Y usheng Luo, Y unzhong Lou, and Xiangdong Zhou. Recad: Reinforcement learning enhanced parametric cad model generation with vision-language models. arXiv preprint arXiv:2512.06328 , 2025. 63 Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Y uxuan Li, Zefan W ang, Jiacheng Huang, Haojie W ang, Jianr ong W ang, Xu Han, Zhiyuan Liu, and Maosong Sun. T ritonbench: Benchmarking large language model capabilities for generating triton operators, 2025. URL https://arxiv.org/abs/2502.14752 . 64 Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Y uxuan Li, Zefan W ang, Jiacheng Huang, Haojie W ang, Jianr ong W ang, Xu Han, Zhiyuan Liu, and Maosong Sun. T ritonbench: Benchmarking large language model capabilities for generating triton operators, 2025. URL https://arxiv.org/abs/2502.14752 . 65 Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Y ang, Bowen Li, Bailin W ang, Bowen Qin, Rongyu Cao, Ruiying Geng, et al. Can llm already serve as a database interface. A big bench for large-scale database gr ounded text-to-sqls. CoRR, abs/2305.03111 , 2023. 66 Raymond Li, Loubn a Ben allal, Y angtian Zi, Niklas Muennighoff, Denis Kocetkov , Chenghao Mou, Marc Mar one, Christopher Akiki, Jia LI, Jenny Chim, Qian Liu, Evgenii Zheltonozh- skii, T erry Y ue Zhuo, Thomas W ang, Olivier Dehaene, Joel Lamy-Poirier , Joao Monteiro, Nicolas Gontier , Ming-Ho Y ee, Logesh Kumar Umapathi, Jian Zhu, Ben Lipkin, Muh- tasham Oblokulov , Zhiruo W ang, Rudra Murthy , Jason T Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov , Marco Zocca, Manan Dey , Zhihan Zhang, Urvashi Bhattacharyya, W enhao Y u, Sasha Luccioni, Paulo V illegas, Fedor Zhdanov , T ony Lee, Nadav T imor , Jen- nifer Ding, Claire S Schlesinger , Hailey Schoelkopf, Jan Ebert, T ri Dao, Mayank Mishra, Alex Gu, Car olyn Jane Anderson, Br endan Dolan-Gavitt, Danish Contractor , Siva Reddy , Daniel Fried, Dzmitry Bahdanau, Y acine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas W olf, Arjun Guha, Leandro V on W erra, and Harm de V ries. Starcoder: may the source be with you! T ransactions on Machine Learning Resear ch , 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=KoFOg41haE . Reproducibility Certiﬁcation. 67 Y ujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser , Rémi Leblond, T om Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science , 378(6624):1092–1097, 2022. 68 Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan W ang, Bingzheng Xu, Bochao W u, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint , 2025. 69 Jiawei Liu, Chunqiu Steven Xia, Y uyao W ang, and Lingming Zhang. Is your code generat ed by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems , 2023. URL h t t p s : //openreview.net/forum?id=1qvx610Cu7 . 70 Mingjie Liu, Nathaniel Pinckney , Brucek Khailany , and Haoxing Ren. V erilogEval: evalu- ating large language models for verilog code generation. In 2023 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , 2023. 22 71 Mingjie Liu, Nathaniel Ross Pinckney , Brucek Khailany , and Haoxing Ren. Invited paper: V erilogeval: Evaluating large language models for verilog code generation. In IEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 - Nov . 2, 2023 , pages 1–8. IEEE, 2023. doi: 10.1109/ICCAD57390.2023.10323812. URL https://doi.org/10.1109/ICCAD57390.2023.10323812 . 72 Shang Liu, W enji Fang, Y ao Lu, Jing W ang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. Rtlcoder: Fully open-source and efﬁcient llm-assisted rtl code generation technique. IEEE T ransactions on Computer-Aided Design of Integrated Circuits and Systems , 44(4):1448–1461, 2024. 73 Shangqing Liu, Y anzhou Li, Xiaofei Xie, W ei Ma, Guozhu Meng, and Y ang Liu. Au- tomated commit intelligence by pr e-training. ACM T ransactions on Software Engineer- ing and Methodology , July 2024. ISSN 1557-7392. doi: 1 0 . 1 1 4 5 / 3 6 7 4 7 3 1. URL h t tp : //dx.doi.org/10.1145/3674731 . 74 Siyao Liu, Ge Zhang, Boyuan Chen, Jialiang Xue, and Zhendong Su. FullStack Bench: Evaluating llms as full stack coders. arXiv preprint , 2024. URL h t t p s : //arxiv.org/abs/2412.00535 . 75 W ei Liu, Jiawei Xu, Y ingru Li, Longtao Zheng, T ianjian Li, Qian Liu, and Junxian He. Dr . kernel: Reinforcement learning done right for triton kernel generations. arXiv preprint arXiv:2602.05885 , 2026. 76 Anton Lozhkov , Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier , Nouamane T azi, Ao T ang, Dmytro Pykhtar , Jiawei Liu, Y uxiang W ei, T ianyang Liu, Max T ian, Denis Kocetkov , Arthur Zucker , Y ounes Belkada, Zijian W ang, Qian Liu, Dmitry Ab- ulkhanov , Indraneil Paul, Zhuang Li, W en-Ding Li, Megan Risdal, Jia Li, Jian Zhu, T erry Y ue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, W enhao Y u, Lucas Krauß, Naman Jain, Y ixuan Su, Xuanli He, Manan Dey , Edoar do Abati, Y ekun Chai, Niklas Muennighoff, Xiangru T ang, Muhtasham Oblokulov , Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, T ri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry , Canwen Xu, Julian J. McAuley , Han Hu, T orsten Scholak, Sébastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, and et al. Starcoder 2 and the stack v2: The next generation. CoRR , abs/2402.19173, 2024. doi: 10 .4855 0/AR XIV.2 402. 19173. URL https://doi.org/10.48550/arXiv.2402.19173 . 77 R yo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki, T etsuya Hoshino, and T akahiro Katagiri. Improving hpc code generation capability of llms via online reinfor cement learning with real-machine benchmark r ewards. arXiv preprint , 2026. 78 MiniMax. Minimax m2 and agent: Ingenious in simplicity . https://www.minimax.io/news/ minimax- m2 , 2025. 79 MiniMax. Minimax m2.1: Signiﬁcantly enhanced multi-language programming, built for real-world complex tasks. https://www.minimax.io/news/minimax- m21 , 2025. 80 MiniMax. Minimax m2.5: Built for real-world pr oductivity . https://www.minimax.io/new s/minimax- m25 , 2026. 81 Kaixiang Mo, Y uxin Shi, W eiwei W eng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng. Mid-training of large language models: A survey , 2025. URL h t t p s : //arxiv.org/abs/2510.06826 . 23 82 Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Y ingbo Zhou. Codegen2: Lessons for training llms on pr ogramming and natural languages. CoRR , abs/2305.02309, 2023. doi: 10.48550/ARXIV.2305.02309. URL https://doi.org/10.48550 /arXiv.2305.02309 . 83 Ke Niu, Haiyang Y u, Zhuofan Chen, Mengyang Zhao, T eng Fu, Bin Li, and Xiangyang Xue. From intent to execution: Multimodal chain-of-thought reinfor cement learning for precise cad code generation. arXiv preprint , 2025. 84 OpenAI. Introducing gpt-5 for developers. https://openai.com/index/introducing- gpt - 5- for- developers/ , 2025. Accessed: 2025-09-30. 85 OpenAI. Introducing gpt-5.3-codex, 2026. URL https://openai. com/index/introd ucing - gpt- 5- 3- codex/ . 86 OpenAI. Introducing gpt-5.4, 2026. URL https://openai.com/index/introducing- gpt- 5 - 4/ . 87 Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, W illiam Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efﬁcient gpu kernels?, 2025. URL https://arxiv.org/abs/2502.10517 . 88 Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, W illiam Hu, Christopher Ré, and Azalia Mirhoseini. Kernelbench: Can llms write efﬁcient gpu kernels?, 2025. URL https://arxiv.org/abs/2502.10517 . 89 Long Ouyang, Jeff W u, Xu Jiang, Diogo Almeida, Carroll W ainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , et al. T raining language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 90 Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Y an, V ishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of lar ge language models. In Advances in Neural Information Processing Systems , 2024. 91 Nathaniel Pinckney , Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany . Revisiting verilogeval: Newer llms, in-context learning, and speciﬁcation-to-rtl tasks, 2024. URL . 92 Nathaniel Pinckney , Chenhui Deng, Chia-T ung Ho, Y un-Da T sai, Mingjie Liu, W enfei Zhou, Brucek Khailany , and Haoxing Ren. Comprehensive verilog design problems: A next- generation benchmark dataset for evaluating large language models and agents on rtl design and veriﬁcation. arXiv preprint , 2025. 93 Suresh Purini, Siddhant Garg, Mudit Gaur , Sankalp Bhat, Sohan Mupparapu, and Arun Ravindran. Ar chxbench: A complex digital systems benchmark suite for llm driven rtl synthesis. In 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD) , pages 1–10. IEEE, 2025. 94 Suresh Purini, Siddhant Garg, Mudit Gaur , Sankalp Bhat, and Ar un Ravindran. Ar chxbench: A complex digital systems benchmark suite for llm driven rtl synthesis. In Proceedings of the 7th ACM/IEEE International Symposium on Machine Learning for CAD (MLCAD) , pages pp. xx–yy , Santa Cruz, California, USA, September 2025. ACM and IEEE. 24 95 Haiyan Qin, Zhiwei Xie, Jingjing Li, Liangchen Li, Xiaotong Feng, Junzhan Liu, and W ang Kang. Reasoningv: Efﬁcient verilog code generation with adaptive hybrid r easoning model. arXiv preprint arXiv:2504.14560 , 2025. 96 Pengrui Quan, Xiaomin Ouyang, Jeya V ikranth Jeyakumar , Ziqi W ang, Y ang Xing, and Mani Srivastava. Sensorbench: Benchmarking llms in coding-based sensor processing. In Proceedings of the 26th International Workshop on Mobile Computing Systems and Applications , pages 25–30, 2025. 97 Qwen. Qwen3-coder: Agentic coding in the world, 2025. URL https://qwenlm.github.io /blog/qwen3- coder . 98 Qwen T eam. Qwen3-coder-next technical r eport. URL https://github.com/QwenLM/Qwen 3- Coder/blob/main/qwen3_coder_next_tech_report.pdf . Accessed: 2026-02-03. 99 Qwen T eam. Qwen3.5: T owards native multimodal agents, February 2026. URL ht tp s : //qwen.ai/blog?id=qwen3.5 . 100 Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, et al. Code Llama: Open foundation models for code. arXiv preprint , 2024. 101 ByteDance Seed, Y uyu Zhang, Jing Su, Y ifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. Seed-coder: Let the code model curate data for itself. arXiv preprint , 2025. 102 TOCOL SER VERS. Mcp-universe: Benchmarking large language models with r eal-world model context pro-tocol servers. origins , 25:55–128. 103 Xiangyu Shi, Junyang Ding, Xu Zhao, Sinong Zhan, Payal Mohapatra, Daniel Quispe, Kojo W elbeck, Jian Cao, W ei Chen, Ping Guo, et al. Step-llm: Generating cad step models from natural language with large language models. arXiv preprint , 2026. 104 Kalahasti Ganesh Srivatsa, Sabyasachi Mukhopadhyay , Ganesh Katrapati, and Manish Shrivastava. A survey of using large language models for generating infrastructur e as code, 2024. URL . 105 swe-bench live. swe-bench-live, 2025. URL https://swe- bench- live.github.io/ . 106 swebench. swebench, 2025. URL https://www.swebench.com/original.html . 107 swebenchmultilingual. swebenchmultilingual, 2025. URL https://www.swebench.com/mul tilingual.html . 108 swebenchveriﬁed. swebenchveriﬁed, 2025. URL https://openai.com/index/introducing - swe- bench- verified/ . 109 Jubi T aneja, A very Laird, Cong Y an, Madan Musuvathi, and Shuvendu K Lahiri. Llm- vectorizer: Llm-based veriﬁed loop vectorizer . In Proceedings of the 23rd ACM/IEEE Interna- tional Symposium on Code Generation and Optimization , pages 137–149, 2025. 110 ByteDance Seed T eam. Seed-oss open-source models. https://github.com/ByteDance- See d/seed- oss , 2025. 111 Kimi T eam, Y ifan Bai, Y iping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Y anru Chen, Y uankun Chen, Y utian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534 , 2025. 25 112 Kimi T eam, T ongtong Bai, Y ifan Bai, Y iping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: V isual agentic intelligence. arXiv preprint arXiv:2602.02276 , 2026. 113 Qwen T eam. Qwen3-next, 2025. URL http s:/ /hu ggi ngf ace .co /col lec tio ns/ Qwe n/q wen 3- next . Accessed: 2025. 114 The T erminal-Bench T eam. T erminal-bench: A benchmark for ai agents in terminal envir on- ments, Apr 2025. URL https://github.com/laude- institute/terminal- bench . 115 Shailja Thakur , Baleegh Ahmad, Hammond Fan, et al. V eriGen: A large language model for verilog code generation. arXiv preprint , 2023. 116 Ning W ang, Bingkun Y ao, Jie Zhou, Y uchen Hu, Xi W ang, Zhe Jiang, and Nan Guan. V eridebug: A uniﬁed llm for verilog debugging via contrastive embedding and guided correction. In 2025 IEEE International Confer ence on LLM-Aided Design (ICLAD) , pages 61–67. IEEE, 2025. 117 Y iting W ang, Guoheng Sun, W anghao Y e, Gang Qu, and Ang Li. V erireason: Reinforcement learning with testbench feedback for reasoning-enhanced verilog generation. arXiv preprint arXiv:2505.11849 , 2025. 118 Anjiang W ei, T arun Sur esh, Huanmi T an, Y inglun Xu, Gagandeep Singh, Ke W ang, and Alex Aiken. Super coder: Assembly program super optimization with lar ge language models. arXiv preprint arXiv:2505.11480 , 2025. 119 Anjiang W ei, T arun Sur esh, Huanmi T an, Y inglun Xu, Gagandeep Singh, Ke W ang, and Alex Aiken. Supercoder: Assembly program superoptimization with large language models, 2026. URL . 120 Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Brian Ichter , Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits r easoning in large language models, 2023. URL . 121 Ruiyang Xu, Jialun Cao, Mingyuan W u, W enliang Zhong, Y aojie Lu, Ben He, Xianpei Han, Shing-Chi Cheung, and Le Sun. Embedagent: Benchmarking lar ge language models in embedded system development. arXiv preprint , 2025. 122 An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Y u, Chang Gao, Chengen Huang, Chenxu Lv , et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 , 2025. 123 Guang Y ang, W ei Zheng, Xiang Chen, Dong Liang, Peng Hu, Y ukui Y ang, Shaohang Peng, Zhenghan Li, Jiahui Feng, Xiao W ei, et al. Large language model for verilog code generation: Literature r eview and the road ahead. arXiv preprint , 2025. 124 Huanqi Y ang, Mingzhe Li, Mingda Han, Zhenjiang Li, and W eitao Xu. Embedgenius: T owards automated software development for generic embedded iot systems. arXiv preprint arXiv:2412.09058 , 2024. 125 Jian Y ang, Jiaxi Y ang, Ke Jin, Y ibo Miao, Lei Zhang, Liqun Y ang, Zeyu Cui, Y ichang Zhang, Binyuan Hui, and Junyang Lin. Evaluating and aligning codellms on human prefer ence. arXiv preprint arXiv:2412.05210 , 2024. 26 126 Jian Y ang, W ei Zhang, Shark Liu, Jiajun W u, Shawn Guo, and Y izhi Li. From code foundation models to agents and applications: A practical guide to code intelligence. arXiv preprint arXiv:2511.18538 , 2025. 127 Jian Y ang, W ei Zhang, Shawn Guo, Zhengmao Y e, Lin Jing, Shark Liu, Y izhi Li, Jiajun W u, Cening Liu, X Ma, et al. Iquest-coder-v1 technical report. arXiv preprint , 2026. 128 John Y ang, Carlos E. Jimenez, Alexander W ettig, Kilian Lieret, Shunyu Y ao, Karthik Narasimhan, and Oﬁr Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URL . 129 Zonghan Y ang, Shengjie W ang, Kelin Fu, W enyang He, W eimin Xiong, Y ibo Liu, Y ibo Miao, Bofei Gao, Y ejie W ang, Y ingwei Ma, et al. Kimi-dev: Agentless training as skill prior for swe-agents. arXiv preprint , 2025. 130 Shunyu Y ao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. taubench: A bench- mark for tool-agent-user interaction in r eal-world domains. arXiv preprint , 2024. 131 He Y e, Matias Martinez, Xiapu Luo, T ao Zhang, and Martin Monperrus. Selfapr: Self- supervised pr ogram repair with test execution diagnostics. In Proceedings of the 37th IEEE/ACM International Conference on Automated Softwar e Engineering , pages 1–13, 2022. 132 T ao Y u, Rui Zhang, Kai Y ang, Michihiro Y asunaga, Dongxu W ang, Zifan Li, James Ma, Irene Li, Qingning Y ao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 , 2018. 133 Y ang Y u, Peiyu Zang, Chi Hsu T sai, Haiming W u, Y ixin Shen, Jialing Zhang, Haoyu W ang, Zhiyou Xiao, Jingze Shi, Y uyu Luo, et al. T owards automated kernel generation in the era of llms. arXiv preprint , 2026. 134 Patrick Y ubeaton, Andre Nakkab, W eihua Xiao, Luca Collini, Ramesh Karri, Chinmay Hegde, and Siddharth Gar g. V erithoughts: Enabling automated verilog code generation using reasoning and formal veriﬁcation. arXiv preprint , 2025. 135 Aohan Zeng, Xin Lv , Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Y in, Chendi Ge, Chengxing Xie, Cunxiang W ang, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763 , 2026. 136 Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui W ang, Huaixi T ang, Zhiyi Lai, Haoyang Huang, W en Xiang, Kun W u, W enhao Zhuang, et al. Kat-coder technical report. arXiv preprint arXiv:2510.18779 , 2025. 137 Qinkai Zheng, Xiao Xia, Xu Zou, et al. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on HumanEval-X. KDD , 2023. 138 Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, and Shuyou Zhang. Cad- judge: T oward efﬁcient morphological grading and veriﬁcation for text-to-cad generation. arXiv preprint arXiv:2508.04002 , 2025. 139 Y aoyu Zhu, Di Huang, Hanqi L yu, Xiaoyun Zhang, Chongxiao Li, W enxuan Shi, Y utong W u, Jianan Mu, Jinghua W ang, Y ang Zhao, et al. Qimeng-codev-r1: Reasoning-enhanced verilog generation. arXiv preprint , 2025. 27 140 T erry Y ue Zhuo, Minh Chien V u, Jenny Chim, Han Hu, W enhao Y u, Ratnadira W idyasari, Imam Nur Bani Y usuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions. arXiv preprint arXiv:2406.15877 , 2024. 28 A. Pre-training Details A.1. Model Architecture InCoder-32B adopts a standard decoder-only T ransformer architectur e. The detailed model conﬁgurations are pr ovided in T able 6 . A.2. Pre-training Data Existing large-scale code corpora, such as The Stack v2 [ 76 ], StarCoderData [ 66 ], and The Pile [ 39 ], are predominantly composed of web-oriented languages (JavaScript, T ypeScript, Python for web frameworks), while industrial code remains underrepr esented. V erilog and VHDL occupy only a negligible fraction of total tokens. CUDA and T riton kernels ar e not separately categorized and are often mixed into general C/C++ code. Low-level systems code such as device drivers and ﬁrmware is dif ﬁcult to distinguish fr om ordinary C pr ograms. This imbalance directly limits the industrial capabilities of models trained on such data. T o address this, we construct a data pipeline for collecting, cleaning, and reﬁning industrial code. A.2.1. Data Collection. W e collect industrial code from three complementary sour ces. Repository-level recall. W e deﬁne a domain taxonomy covering digital cir cuit design, high- performance computing, embedded systems, and CAD automation, and retrieve r elevant code from large-scale public r epositories through a thr ee-step r ecall strategy with increasing coverage. In the ﬁrst step, we apply rule-based ﬁltering using ﬁle extensions, directory naming conventions, and domain-speciﬁc keywords (e.g., endmodule / posedge for V erilog, __global__ / «<...»> for CUDA, #pragma HLS for FPGA synthesis) to collect ﬁles with obvious industrial characteristics. In the second step, we train a FastT ext classiﬁer on a manually labeled seed set to recall additional industrial code that shares similar statistical patterns but does not match simple rules. In the third step, we employ a domain-adapted semantic encoder to r etrieve samples that lack both lexical and statistical features captured by the pr evious two steps, such as hardwar e-related C/C++ libraries and low-level driver code that are syntactically indistinguishable from general- purpose programs. T echnical literature via OCR. W e apply OCR to extract code snippets and structured content from technical books and har dware refer ence manuals, recovering domain knowledge that is largely absent fr om public code repositories. Domain-speciﬁc web sources. W e further collect data fr om technical forums, vendor docu- mentation, and engineering reports, which pr ovide practical code samples and usage patterns not covered by the above two sour ces. A.2.2. Data Cleaning. The collected data under goes multi-step cleaning. W e ﬁrst ﬁlter out ﬁles with r estrictive licenses, remove personally identiﬁable information and embedded credentials, and discard trivially invalid ﬁles such as empty stubs, auto-generated boilerplate, and binary artifacts. Deduplication is then performed at four levels: exact hash matching, token-level near-duplicate detection via MinHash LSH [ 76 ], repository-level fork consolidation, and cross-sour ce deduplication. W e also apply lightweight domain-speciﬁc checks, including syntax parsing for V erilog and SystemV erilog, header consistency checks for C/C++, and format validation for GPU kernel conﬁgurations. Samples that fail these checks ar e r outed to the reﬁnement step for repair or removed. 29 A.2.3. Data Reﬁnement. W e reﬁne the cleaned data in two steps. The ﬁrst step normalizes surface-level formatting by unifying style conventions, standardizing module boundaries, and removing noise such as commented-out dead code and outdated compatibility blocks. The second step adds structur ed annotations, including cross-ﬁle dependency resolution, platform and constraint metadata, and natural-language descriptions at the function, module, and project levels for code-text alignment. All reﬁned samples are veriﬁed through AST comparison and re-compilation to ensure corr ectness. A.3. T raining W e train InCoder-32B on 4,096 GPUs with autoregr essive language modeling and ﬁll-in-the- middle (FIM) completion [ 40 , 46 ], using a constant learning rate of 3 × 10 − 4 and a global batch size of 2,048 for a total of 15T tokens. T raining data follows a curriculum schedule [ 17 ] that progr esses from function-level, single-ﬁle samples to multi-ﬁle, pr oject-level data. A proportion of general-purpose code and text is r etained in every batch to preserve general programming capabilities. W e periodically evaluate on held-out industrial benchmarks and adjust data sampling weights accordingly . B. Mid-T raining Loss Curves This appendix presents the training loss curves for the two stages of mid-training described in subsection 3.2 . Figure 7 shows the training dynamics for both Stage 2.1 (8K to 32K extension) and Stage 2.2 (32K to 128K extension). In Stage 2.1, the loss curve demonstrates stable convergence with the cosine learning rate decay schedule, indicating successful adaptation to longer context windows while maintaining training stability . In Stage 2.2, the graduated warm-up strategy for long-sequence samples (starting at 10% and increasing to 50%) is reﬂected in the loss dynamics, with stable convergence achieved after the warm-up period. 0 50 100 150 200 250 T okens (B) 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Loss 32K Context (a) Stage 2.1: 8K to 32K context extension 0 50 100 150 200 250 T okens (B) 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Loss 128K Context (b) Stage 2.2: 32K to 128K context extension Figure 7. T raining loss curves for both stages of mid-training. (a) Stage 2.1 shows consistent convergence with cosine learning rate decay . (b) Stage 2.2 reﬂects the graduated warm-up strategy for long-sequence samples. C. Mid-T raining Details Industrial hardware and systems development demands capabilities beyond what general- purpose code models provide. Production hardwar e designs: R TL circuits, GPU kernels, systems code, FPGA synthesis: involve domain-speciﬁc languages (V erilog, C/C++, T riton), stringent 30 timing and resour ce constraints, and veriﬁcation methodologies rarely seen in general software repositories [ 30 , 38 ]. Models trained on web application code lack the structured reasoning for hardwar e scenarios, often producing outputs that violate timing constraints or introduce synthesis errors. Mid-training bridges this gap by continuing pr etraining on har dware-aware industrial data with targeted objectives [ 81 ], transforming the base model into an industrial-grade hardware-awar e code foundation model capable of reasoning about timing, resource optimisation, and performance- critical system implementations. C.1. Progressive Context Extension Strategy Our mid-training adopts a two-stage progressive curriculum extending the model’s context from pr etrained 8K tokens to 128K tokens. Stage 2.1: 8K to 32K Extension. Stage 2.1 extends from the pretrained 8K checkpoint to 32K tokens, tar geting ﬁle-level tasks: completing RTL modules, inﬁlling kernel functions, and generating testbenches. W e train directly on sequences up to 32K without positional encoding interpolation. The data mixture emphasises reasoning QA (40%), agent trajectories (20%), commits (15%), industrial artefacts (15%), and FIM (10%), with cosine learning rate decay to avoid catastrophic forgetting. The training loss curve demonstrates stable convergence throughout this stage (see Figur e 7a in Appendix B ). Stage 2.2: 32K to 128K Extension. Stage 2.2 extends to 128K tokens, unlocking long-context capabilities: extended debugging sessions, large har dware projects with cross-module depen- dencies, and multi-ﬁle r efactoring campaigns. Long-sequence samples (>32K) are intr oduced via graduated warm-up: starting at 10% and increasing linearly to 50% over the ﬁrst quarter of training steps. Data r eplay (5–10% from Stage 2.1) preserves ﬁle-scale performance. The learning rate is reset and follows a second cosine decay . T raining dynamics reﬂect the graduated warm-up strategy with stable convergence after the warm-up period (see Figur e 7b in Appendix B ). C.2. Industrial Data Synthesis and Curation Our training data consists of two complementary sources: curated industrial code data and syn- thetically generated industrial reasoning QA. While curated data provides authentic examples from r eal-world codebases, synthetic data enables systematic coverage of specialised industrial scenarios underrepr esented in public repositories. C.2.1. Synthetic Industrial Code QA Motivation. Industrial hardware and systems engineering spans diverse domains: digital circuit design (RTL/V erilog), GPU computing (T riton operators, CUDA kernels), systems pr o- gramming (C/C++/Rust kernels), FPGA synthesis (HLS), CAD tool integration, and embedded systems—each with domain-speciﬁc challenges, timing constraints, resour ce budgets, and veri- ﬁcation methodologies. Publicly available code repositories exhibit strong distributional bias toward web applications and high-level software, leaving substantial gaps in hardwar e and low-level systems coverage. Synthetic data generation addresses this gap by systematically constructing r easoning QA pairs that reﬂect the full spectrum of har dware-awar e industrial contexts, ensuring the model’s capabilities generalise beyond the limited distribution of or ganic training data. Three-Stage Synthesis Pipeline. Our synthesis pipeline operates in three stages designed to produce industrially gr ounded, factually correct reasoning data: 31 (i) Industrial scenario speciﬁcation. W e identify concr ete industrial engineering contexts through consultation with practising har dware and systems engineers and systematic analysis of industrial design veriﬁcation reports. T arget scenarios include: debugging timing violations in digital cir cuits (setup/hold time failures); optimising V erilog RTL designs for area-power- performance trade-offs; implementing high-performance GPU operators in T riton with memory coalescing and shar ed memory tiling; verifying functional corr ectness of C/C++ kernels with formal methods or symbolic execution; r esolving synthesis errors in HLS-generated FPGA designs; diagnosing memory safety issues (buf fer overﬂows, use-after-free) in systems code; analysing waveform traces to localise hardwar e bugs; and implementing backward-compatible hardwar e-software interface changes. Each scenario is speciﬁed with concr ete industrial char - acteristics: typical design scale, hardwar e description language, timing/r esource constraints, veriﬁcation requir ements, and common failure modes. (ii) Seed code generation. For each scenario, we generate or extract r epresentative code frag- ments that exhibit target industrial characteristics in hardware and systems domains. These seed codes ar e constructed to reﬂect r ealistic har dware design patterns, timing-critical imple- mentations, memory management idioms, and domain-speciﬁc conventions. W e prioritise code patterns that appear frequently in industrial practice but are underrepresented in aca- demic benchmarks: ﬁnite state machines in R TL designs, pipelined datapath implementations, GPU kernel memory access patterns (coalesced vs. strided), SIMD vectorisation in C kernels, interrupt-driven embedded ﬁrmwar e, and CAD tool scripting for design automation. (iii) QA pair synthesis with automated veriﬁcation. For each seed code, we synthesise question- answer pairs wher e questions pr obe industrial r easoning competencies—r oot-cause analysis, performance diagnosis, correctness veriﬁcation, r efactoring trade-of fs, and impact analysis—and answers provide step-by-step reasoning traces following the chain-of-thought paradigm [ 120 ]. Crucially , all synthesised reasoning traces undergo automated veriﬁcation to guarantee factual correctness: • Code execution validation : computational steps are veriﬁed by executing code snippets and checking outputs against expected results; • Static analysis : semantic pr operties (type correctness, control-ﬂow reachability , data-ﬂow dependencies) are conﬁrmed via static analysis tools; • Logical consistency checks : reasoning chains ar e validated for logical soundness, ensuring each step follows from pr evious steps without contradictions. This veriﬁcation layer distinguishes our approach from naive pr ompt-based synthesis. It ensur es the model trains only on industrially grounded, factually correct reasoning patterns. This eliminates a critical failur e mode: the model learning ﬂuent but semantically incorr ect r easoning that appears correct superﬁcially but causes failur es in production systems. Coverage and Scale. Our synthetic QA corpus systematically covers industrial scenarios across: domains : R TL design (V erilog/SystemV erilog), GPU computing (T riton/CUDA), sys- tems programming (C/C++/Rust kernels), FPGA synthesis (HLS/V ivado), CAD automation (T cl/Python scripting), embedded ﬁrmware; reasoning types : timing analysis (setup/hold), re- source optimisation (ar ea/power/latency), memory safety veriﬁcation, concurr ency correctness, numerical precision analysis; and operational contexts : R TL veriﬁcation (waveform debugging, assertion failures), kernel performance tuning, synthesis err or r esolution, har dware-softwar e co-design, and design-for -test integration. The resulting synthetic data complements curated samples by ﬁlling coverage gaps and providing dense training signal for the structured reasoning patterns that hardwar e-aware industrial engineering demands. 32 C.2.2. Curated Industrial Code Data T o complement synthetic QA, we curate authentic industrial code data spanning four categories: Agent T rajectories. Multi-step debugging and repair trajectories following Thought-Action- Observation cycles [ 128 ], capturing closed-loop r easoning with tool feedback from hardware simulators, synthesis tools, C/C++ compilers, and formal veriﬁcation engines. These trajectories teach the model to diagnose errors from industrial signals (synthesis warnings, simulation mismatches, timing violations, memory sanitiser reports) and iterate towar d correct solutions. Code Commits. V ersion-control commits pairing developer intent (commit message) with before-after code states, curated from large-scale open-source repositories [ 73 ]. T o enhance coverage of har dware and systems scenarios, we supplement general software commits with hardwar e-characteristic samples: R TL optimisation for timing closur e, memory hierarchy r efac- toring in kernels, ABI compatibility preservation, and hardwar e bug ﬁxes with testbench updates. Industrial Code Artefacts. Beyond standard sour ce code, we include auxiliary artefacts that reﬂect the operational context of professional development: test suites, API speciﬁcations, conﬁguration ﬁles, and log samples [ 7 , 104 ]. For hardwar e and systems domains speciﬁcally , we incorporate hardwar e testbenches (SystemV erilog/UVM), timing constraints (SDC), synthesis scripts, GPU proﬁling traces, and memory sanitiser logs. File-Level Fill-in-the-Middle (FIM). Structur e-aware FIM samples targeting function bodies, class methods, and code blocks [ 40 , 47 ]. AST -guided masking respects str uctural boundaries to ensure syntactic validity . For har dware and systems code, we apply domain-speciﬁc masking strategies: respecting always blocks and port declarations in V erilog, and targeting semantically coherent units (loop bodies, SIMD intrinsic blocks) in C/C++/T riton kernels. Data Mixture. Stage 2.1 weights synthetic reasoning QA most heavily (40%), complemented by agent trajectories (20%), commits (15%), industrial artefacts (15%), and FIM (10%). Stage 2.2 shifts weight toward long-context data: agent trajectories increase to 30%, FIM to 25%, while maintaining reasoning QA at 25% and other categories at 10% each. This progr essive reweighting ensures the model ﬁrst builds strong reasoning foundations before scaling to long-context industrial workﬂows. D. Post-training Details D.1. Data Construction General-purpose SFT datasets [ 21 , 89 ], such as StackOverﬂow threads, coding exercises, and synthetic instruction-r esponse pairs, carry little signal for industrial code tasks. A prompt asking the model to “optimise this V erilog module for timing closure” requires not only a correct rewrite but also compilability under speciﬁc synthesis constraints, pass/fail against a testbench, and measurable improvement in area or latency . W e therefor e construct 2.5M SFT samples directly fr om real industrial code tasks, using an execution-grounded pipeline that pr oduces three complementary sample types: direct solution, defect r epair , and performance optimisation. W e train for approximately 4.9k steps with a global batch size of 512. 33 D.2. T ask Extraction and Normalization. W e start fr om industrial coding tasks spanning across hardware design, GPU kernel develop- ment, systems pr ogramming, and embedded ﬁrmwar e. Each task is decomposed into a struc- tured instr uction comprising a natural language requir ement description, interface constraints (port lists, function signatur es, API contracts), the tar get platform and toolchain, dependency conﬁgurations, and associated veriﬁcation scripts. This normalization step is essential because raw industrial tasks vary wildly in how they specify intent (some are terse commit messages, while others ar e multi-page design documents), and the model needs a consistent instruction format to learn from. Multi-Source Candidate Synthesis. For each structured instruction, we generate a diverse set of candidate solutions through ﬁve complementary channels: rewriting from refer ence implementations with controlled variation, template-based perturbation that systematically alters design parameters and coding patterns, cross-language migration (e.g., translating a C kernel to T riton or porting a V erilog module between coding styles), retrieval-augmented generation [ 48 ] that gr ounds the solution in r elevant code fragments fr om the curated corpus, and direct generation from strong frontier models. The key motivation is diversity: a single generation method tends to produce solutions clustered in a narr ow r egion of the solution space, while the combination covers a much br oader range of implementation strategies, coding styles, and optimisation trade-offs. Execution-Based V eriﬁcation. Every candidate solution is validated in a real execution envir on- ment rather than judged by heuristic or model-based scoring alone [ 22 , 67 ]. Depending on the domain, veriﬁcation includes compilation, simulation, test execution, unit testing, performance proﬁling, and formal checking. Solutions that pass all checks become high-conﬁdence samples. This is the critical differ ence from typical SFT pipelines that rely on model self-evaluation or surface-level pattern matching: our veriﬁcation is grounded in the same toolchains that engineers use in production, so corr ectness is deﬁned by the domain rather than by a proxy . D.3. Feedback-Driven Repair Not all candidates pass veriﬁcation, and the failures ar e too valuable to discard. When a candi- date fails (e.g., a compilation err or , a simulation mismatch, or a performance regr ession), the pipeline captures the full feedback context: compiler error messages, r untime logs, counter ex- ample inputs, waveform dif ferences, or pr oﬁling bottlenecks. This feedback, together with the failed solution, is fed back into the generation stage to produce a r epaired version, which is then re-veriﬁed. The result is a closed-loop repair trajectory [ 54 , 131 ]: the original failed attempt, the environment feedback, and the corrected solution. These trajectories are explicitly included in the SFT corpus because they teach the model a skill that dir ect-solution samples alone cannot: diagnosing failures fr om real tool output and iterating toward a ﬁx. In practice, this mirrors the workﬂow of an experienced engineer who reads a synthesis report, identiﬁes the r oot cause, and revises the design accor dingly . D.4. Quality Filtering and Final Composition The veriﬁed sample pool under goes a ﬁnal round of ﬁltering along thr ee axes: executability (solutions must compile and run cleanly acr oss repeated trials), stability (r esults must be deter- ministic and repr oducible), and information density (trivially simple samples that provide little learning signal are downweighted). The surviving samples are or ganized into three categories that together cover the cor e competencies we want the model to acquir e: direct solution samples repr esenting the r equirement-to-implementation path, defect r epair samples capturing the failure- feedback-ﬁx loop, and performance and structural optimization samples wher e a functionally correct 34 solution is further impr oved in efﬁciency , r eadability , or ar chitectural quality . This three-way composition ensures the SFT stage does not merely teach the model to produce corr ect code from scratch, but also to debug, repair , and reﬁne existing implementations, all of which are indispensable capabilities in real industrial workﬂows. E. Detailed Synthetic Data Generation Pipeline This appendix provides detailed information about the synthetic industrial code QA generation pipeline described in subsubsection C.2.1 . E.1. Industrial Scenario Speciﬁcation W e identify concr ete industrial engineering contexts through consultation with practising hard- ware and systems engineers and systematic analysis of industrial design veriﬁcation reports. T arget scenarios include: • Debugging timing violations in digital circuits (setup/hold time failur es); • Optimising V erilog RTL designs for ar ea-power-performance trade-offs; • Implementing high-performance GPU operators in T riton with memory coalescing and shared memory tiling; • V erifying functional correctness of C/C++ kernels with formal methods or symbolic execution; • Resolving synthesis errors in HLS-generated FPGA designs; • Diagnosing memory safety issues (buffer overﬂows, use-after -free) in systems code; • Analysing waveform traces to localise hardware bugs; • Implementing backward-compatible har dware-softwar e interface changes. Each scenario is speciﬁed with concrete industrial characteristics: typical design scale, hardwar e description language, timing/resour ce constraints, veriﬁcation requirements, and common failure modes. E.2. Seed Code Generation For each scenario, we generate or extract representative code fragments that exhibit target industrial characteristics in hardware and systems domains. These seed codes ar e constr ucted to reﬂect realistic hardwar e design patterns, timing-critical implementations, memory management idioms, and domain-speciﬁc conventions. W e prioritise code patterns that appear frequently in industrial practice but ar e underr epresented in academic benchmarks: • Finite state machines in R TL designs; • Pipelined datapath implementations; • GPU kernel memory access patterns (coalesced vs. strided); • SIMD vectorisation in C kernels; • Interrupt-driven embedded ﬁrmware; • CAD tool scripting for design automation. E.3. QA Pair Synthesis with Automated V eriﬁcation For each seed code, we synthesise question-answer pairs where questions probe industrial reasoning competencies: root-cause analysis, performance diagnosis, corr ectness veriﬁcation, refactoring trade-offs, and impact analysis. Answers provide step-by-step reasoning traces following the chain-of-thought paradigm [ 120 ]. 35 Crucially , all synthesised reasoning traces undergo automated veriﬁcation to guarantee factual correctness: • Code execution validation : computational steps are veriﬁed by executing code snippets and checking outputs against expected results; • Static analysis : semantic pr operties (type correctness, control-ﬂow reachability , data-ﬂow dependencies) are conﬁrmed via static analysis tools; • Logical consistency checks : reasoning chains ar e validated for logical soundness, ensuring each step follows from pr evious steps without contradictions. This veriﬁcation layer distinguishes our appr oach from naive pr ompt-based synthesis, ensuring the model trains only on industrially grounded, factually corr ect reasoning patterns. E.4. Coverage and Scale Our synthetic QA corpus systematically covers industrial scenarios across: Domains: R TL design (V erilog/SystemV erilog), GPU computing (T riton/CUDA), systems pro- gramming (C/C++/Rust kernels), FPGA synthesis (HLS/V ivado), CAD automation (T cl/Python scripting), and embedded ﬁrmware. Reasoning T ypes: T iming analysis (setup/hold), resource optimisation (ar ea/power/latency), memory safety veriﬁcation, concurrency corr ectness, and numerical precision analysis. Operational Contexts: R TL veriﬁcation (waveform debugging, assertion failures), kernel performance tuning, synthesis err or r esolution, har dware-softwar e co-design, and design-for- test integration. The resulting synthetic data complements curated samples by ﬁlling coverage gaps and pr ovid- ing dense training signal for the structur ed reasoning patterns that hardwar e-aware industrial engineering demands. 36 F . Model Conﬁguration T able 6 summarizes the architectural details of InCoder-32B. T able 6. Model conﬁguration of InCoder-32B. Hyperparameter V alue Parameters ∼ 32B Layers 64 Hidden Size 5120 Intermediate Size 27648 Attention Heads 40 KV Heads (GQA) 8 Head Dimension 128 V ocabulary Size 76800 Max Context Length 131072 Activation SiLU Positional Encoding RoPE ( 𝜃 =500000) Precision BFloat16 T ie Embeddings No G. Example Case G.1. V eriScope Benchmark – Level 3 Example Case V eriScope Problem 221: UAR T T ransmitter (8N1) Category: serial_comm Level: 3 (Module Design) --- PROBLEM DESCRIPTION --- Design a UART Transmitter module implementing the standard 8N1 serial protocol (8 data bits, No parity, 1 stop bit). The module should: - Accept an 8-bit parallel data input and serialize it onto a single TX line - Transmit in LSB-first order - Output a start bit (logic 0) before the data bits - Output a stop bit (logic 1) after the data bits - Assert a busy flag during transmission - Idle the TX line high (logic 1) when not transmitting - Be synthesizable and resource-efficient --- MODULE INTERFACE --- module uart_tx( input clk, // System clock input rst, // Synchronous reset input tx_start, // Start transmission input [7:0] tx_data, // Parallel data input output reg tx, // Serial data output output reg tx_busy // Busy flag ); 37 --- PORT SPECIFICATION (from config.yaml) --- Signal Dir Width Description --------- ------ ----- ------------------------- clk input 1 System clock rst input 1 Synchronous reset tx_start input 1 Start transmission pulse tx_data input 8 Parallel data to transmit tx output 1 Serial TX line (idle=1) tx_busy output 1 Busy flag --- REFERENCE SOLUTION --- module uart_tx( input clk, rst, tx_start, input [7:0] tx_data, output reg tx, output reg tx_busy ); reg [3:0] bit_cnt; reg [7:0] shift_reg; always @(posedge clk) begin if (rst) begin tx <= 1’b1; tx_busy <= 1’b0; bit_cnt <= 0; end else begin if (!tx_busy && tx_start) begin shift_reg <= tx_data; tx_busy <= 1’b1; bit_cnt <= 0; tx <= 1’b0; // Start bit end else if (tx_busy) begin if (bit_cnt < 8) begin tx <= shift_reg[0]; shift_reg <= {1’b0, shift_reg[7:1]}; bit_cnt <= bit_cnt + 1; end else begin tx <= 1’b1; // Stop bit tx_busy <= 1’b0; end end end end endmodule --- TESTBENCH --- module tb_uart_tx; reg clk, rst, tx_start; reg [7:0] tx_data; wire tx, tx_busy; uart_tx uut( .clk(clk), .rst(rst), .tx_start(tx_start), .tx_data(tx_data), .tx(tx), .tx_busy(tx_busy) ); initial clk = 0; 38 always #5 clk = ~clk; integer pass = 0, fail = 0; initial begin rst = 1; tx_start = 0; tx_data = 0; #20; rst = 0; tx_data = 8’hA5; tx_start = 1; #10; tx_start = 0; #200; pass = 1; if (fail == 0) $display("TEST PASSED"); else $display("TEST FAILED"); $finish; end endmodule --- EVALUATION CRITERIA --- Metric Weight -------------------- ------ Functional correct. 70% Synthesis pass 20% Resource efficiency 10% Compile timeout: 15s Simulate timeout: 60s --- TRANSMISSION PROTOCOL (8N1) --- Idle + +-D0-+-D1-+-D2-+-D3-+-D4-+-D5-+-D6-+-D7-+-Stop-+- Idle | | | | | | | | | | | +---+ LSB-first data bits (8 bits) | | Start Stop Idle bit(0) bit(1) (1) Figure 8. V eriScope Problem 221: UART T ransmitter (8N1) – Complete Example Case. G.2. V erilog Bug Fix – Example Case V erilog Bug Fix: Karnaugh Map 5-V ariable --- INPUT PROMPT --- Given the problem description and the buggy Verilog code below, determine the type of error in the code. Return the result as a list, and there can be 1 to 4 error types. Problem Description: Please act as a professional verilog designer. You are given a Karnaugh map for a 5-variable function F(A,B,C,D,E): ABCDE\F | 00 | 01 | 11 | 10 | --------|----|----|----|----| 00000 | 0 | 1 | 1 | 0 | 00001 | 1 | 0 | 0 | 1 | 00011 | 0 | 1 | 1 | 0 | 00010 | 1 | 0 | 0 | 1 | 00110 | 0 | 1 | 1 | 0 | 00111 | 1 | 0 | 0 | 1 | 39 00101 | 0 | 1 | 1 | 0 | 00100 | 1 | 0 | 0 | 1 | 01100 | 0 | 1 | 1 | 0 | 01101 | 1 | 0 | 0 | 1 | 01111 | 0 | 1 | 1 | 0 | 01110 | 1 | 0 | 0 | 1 | 01010 | 0 | 1 | 1 | 0 | 01011 | 1 | 0 | 0 | 1 | 01001 | 0 | 1 | 1 | 0 | 01000 | 1 | 0 | 0 | 1 | Create a module named "karnaugh_map_5" to implement this combinational function. - Inputs: A, B, C, D, E (5-bit binary inputs) - Output: F (1-bit binary output) Buggy Verilog Code: module karnaugh_map_5( input wire A, B, C, D, E, output reg F ); always @* begin case ({A,B}) 2’b00: begin case ({C,D,E}) 3’b000: F = 1’b0; 3’b001: F = 1’b1; 3’b010: F = 1’b1; 3’b011: F = 1’b0; 3’b100: F = 1’b1; 3’b101: F = 1’b0; 3’b110: F = 1’b0; 3’b111: F = 1’b1; endcase 2’b01: begin case ({C,D,E}) 3’b000: F = 1’b1; 3’b001: F = 1’b0; 3’b010: F = 1’b0; 3’b011: F = 1’b1; 3’b100: F = 1’b0; 3’b101: F = 1’b1; 3’b110: F = 1’b1; 3’b111: F = 1’b0; endcase end 2’b10: begin case ({C,D,E}) 3’b000: F = 1’b0; 3’b001: F = 1’b1; 3’b010: F = 1’b1; 3’b011: F = 1’b0; 3’b100: F = 1’b1; 3’b101: F = 1’b0; 3’b110: F = 1’b0; 3’b111: F = 1’b1; endcase end 2’b11: begin 40 case ({C,D,E}) 3’b000: F = 1’b1; 3’b001: F = 1’b0; 3’b010: F = 1’b0; 3’b011: F = 1’b1; 3’b100: F = 1’b0; 3’b101: F = 1’b1; 3’b110: F = 1’b1; 3’b111: F = 1’b0; endcase end endcase end endmodule Error type(s) ONLY from predefined categories: missing semicolon, missing end, unclosed brackets, illegal keyword usage, invalid literal format, bit-width mismatch, faulty indexing, type error, undefined module, module connection error, sensitivity list error, blocking/non-blocking misuse, reset logic error, state machine error, race condition, non-synthesizable construct, algorithm error, gate logic error, condition error, latch inference --- MODEL OUTPUT --- I’ll analyze the buggy Verilog code carefully. Looking at the code structure: 1. For the 2’b00 case block: 2’b00: begin case ({C,D,E}) ... endcase There is no "end" to close the "begin" that was opened after 2’b00:. The "endcase" closes the inner case, but the "begin" block is never closed with "end". 2. Looking at the values in the K-map vs the code: For AB=00, the K-map rows show: - ABCDE=00000 -> F=0 (CDE=000, F=0) - ABCDE=00001 -> F=1 (CDE=001, F=1) - ABCDE=00011 -> F=0 The K-map columns are labeled 00|01|11|10 which represent DE values. The rows represent ABC values. For AB=00 and AB=10, the outputs are the same. For AB=01 and AB=11, outputs are the same. This means F doesn’t depend on A, only on B, C, D, E. The clear structural bug is the missing "end" for the "begin" block in the 2’b00 case. Result: ["missing end"] 41 Figure 9. V erilog Bug Fix: Karnaugh Map 5-V ariable – Complete Example Case. G.3. CAD-Coder – Example Case CAD-Coder: Rectangular Block Generation --- INPUT PROMPT --- Please based on the following description, create a CAD-Query Code to generate a model (the final model in the code is represented by r): Start by creating a new coordinate system with Euler angles set to zero and a translation vector also set to zero. Next, draw a two-dimensional sketch on the first face. This sketch consists of a single loop made up of four lines. The first line starts at the origin (0.0, 0.0) and ends at (0.6, 0.0). The second line starts at (0.6, 0.0) and ends at (0.6, 0.375). The third line starts at (0.6, 0.375) and ends at (0.0, 0.375). Finally, the fourth line completes the loop by starting at (0.0, 0.375) and ending at the origin (0.0, 0.0). After drawing the sketch, apply a scale factor of 0.6 to the entire sketch. Ensure that the sketch remains aligned with the original coordinate system by rotating it using Euler angles set to zero and translating it using a vector set to zero. To transform the scaled two-dimensional sketch into a three-dimensional model, extrude the sketch 0.075 units along the normal direction. Do not extrude in the opposite direction of the normal. This operation will create a new solid body. The final dimensions of the rectangular block are: - Length: 0.6 units - Width: 0.375 units - Height: 0.075 units --- MODEL OUTPUT --- import cadquery as cq sketch = ( cq.Sketch() .rect(0.6, 0.375, centered=False) .scale(0.6) ) r = ( cq.Workplane("XY") .placeSketch(sketch) .extrude(0.075) ) 42 Figure 10. CAD-Coder: Rectangular Block Generation – Complete Example Case. G.4. C Compile – Example Case C Compile: CUDA Matrix Multiplication Optimization --- INPUT PROMPT --- You write custom CUDA operators to replace the pytorch operators in the given architecture to get speedups. You have complete freedom to choose the set of operators you want to replace. You may replace multiple operators with custom implementations, consider operator fusion opportunities (combining multiple operators into a single kernel, e.g., combining matmul+relu), or algorithmic changes (such as online softmax). Here’s an example to show you the syntax of inline embedding custom CUDA operators in PyTorch: --- Example Input Architecture --- import torch import torch.nn as nn import torch.nn.functional as F class Model(nn.Module): def __init__(self) -> None: super().__init__() def forward(self, a, b): return a + b def get_inputs(): a = torch.randn(1, 128).cuda() b = torch.randn(1, 128).cuda() return [a, b] def get_init_inputs(): return [] --- Example Optimized with CUDA --- import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.cpp_extension import load_inline elementwise_add_source = """ #include #include __global__ void elementwise_add_kernel( const float* a, const float* b, float* out, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { 43 out[idx] = a[idx] + b[idx]; } } torch::Tensor elementwise_add_cuda( torch::Tensor a, torch::Tensor b) { auto size = a.numel(); auto out = torch::zeros_like(a); const int block_size = 256; const int num_blocks = (size + block_size - 1) / block_size; elementwise_add_kernel<<>>( a.data_ptr(), b.data_ptr(), out.data_ptr(), size); return out; } """ elementwise_add_cpp_source = ( "torch::Tensor elementwise_add_cuda(" "torch::Tensor a, torch::Tensor b);" ) elementwise_add = load_inline( name="elementwise_add", cpp_sources=elementwise_add_cpp_source, cuda_sources=elementwise_add_source, functions=["elementwise_add_cuda"], verbose=True, extra_cflags=[""], extra_ldflags=[""], ) class ModelNew(nn.Module): def __init__(self) -> None: super().__init__() self.elementwise_add = elementwise_add def forward(self, a, b): return self.elementwise_add \ .elementwise_add_cuda(a, b) --- Target Architecture --- import torch import torch.nn as nn class Model(nn.Module): """ Simple model that performs a single square matrix multiplication (C = A * B) """ def __init__(self): super(Model, self).__init__() def forward(self, A, B): return torch.matmul(A, B) N = 2048 * 2 44 def get_inputs(): A = torch.rand(N, N) B = torch.rand(N, N) return [A, B] def get_init_inputs(): return [] Note: The kernels should be optimized for FP32 (32-bit floating point) precision. Optimize the architecture named Model with custom CUDA operators! Name your optimized output architecture ModelNew. Output the new code in codeblocks. Please generate real code, NOT pseudocode, make sure the code compiles and is fully functional. --- MODEL OUTPUT --- import torch import torch.nn as nn from torch.utils.cpp_extension import load_inline matmul_cuda_source = """ #include #include #include #define TILE_SIZE 32 __global__ void matmul_tiled_kernel( const float* A, const float* B, float* C, int N) { __shared__ float As[TILE_SIZE][TILE_SIZE]; __shared__ float Bs[TILE_SIZE][TILE_SIZE]; int bx = blockIdx.x, by = blockIdx.y; int tx = threadIdx.x, ty = threadIdx.y; int row = by * TILE_SIZE + ty; int col = bx * TILE_SIZE + tx; float sum = 0.0f; for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; ++t) { if (row < N && (t * TILE_SIZE + tx) < N) As[ty][tx] = A[row*N + t*TILE_SIZE + tx]; else As[ty][tx] = 0.0f; if ((t * TILE_SIZE + ty) < N && col < N) Bs[ty][tx] = B[(t*TILE_SIZE + ty) * N + col]; else Bs[ty][tx] = 0.0f; __syncthreads(); 45 for (int k = 0; k < TILE_SIZE; ++k) sum += As[ty][k] * Bs[k][tx]; __syncthreads(); } if (row < N && col < N) C[row * N + col] = sum; } torch::Tensor matmul_cuda( torch::Tensor A, torch::Tensor B) { int N = A.size(0); auto C = torch::zeros({N, N}, A.options()); dim3 blockDim(TILE_SIZE, TILE_SIZE); dim3 gridDim( (N + TILE_SIZE - 1) / TILE_SIZE, (N + TILE_SIZE - 1) / TILE_SIZE); matmul_tiled_kernel<<>>( A.data_ptr(), B.data_ptr(), C.data_ptr(), N); return C; } """ matmul_cpp_source = ( "torch::Tensor matmul_cuda(" "torch::Tensor A, torch::Tensor B);" ) matmul_cuda_op = load_inline( name="matmul_cuda", cpp_sources=matmul_cpp_source, cuda_sources=matmul_cuda_source, functions=["matmul_cuda"], verbose=False, extra_cflags=["-O3"], extra_ldflags=["-lcublas"], ) class ModelNew(nn.Module): """ Optimized model using custom CUDA kernel with tiling optimization for matrix multiplication (C = A * B) """ def __init__(self): super(ModelNew, self).__init__() self.matmul_cuda = matmul_cuda_op def forward(self, A, B): return self.matmul_cuda.matmul_cuda(A, B) Figure 11. C Compile: CUDA Matrix Multiplication Optimization – Complete Example Case. 46 G.5. T ritonBench – Example Case T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Input Pr ompt You are an expert in Triton programming, capable of writing corresponding Triton kernels and wrapper functions based on functional descriptions and function parameters. Ensure that the wrapper function fully corresponds to the provided function information. Functional Description: Performs a fused operation combining batch matrix multiplication, RMS normalization, GELU activation, dropout, and subtraction. The function takes three input tensors, performs batch matrix multiplication on the first two, applies RMS normalization, GELU activation, and dropout, and finally subtracts the third tensor from the result. Wrapper Entry Information: fused_bmm_rmsnorm_gelu_dropout_sub( input1, input2, other, normalized_shape, dropout_p=0.5, training=True, approximate=’none’, eps=1e-5, *, out=None ) -> Tensor Args: input1 (Tensor): First input tensor for batch matrix multiplication, of shape (B, N, M), where B is the batch size. input2 (Tensor): Second input tensor for batch matrix multiplication, of shape (B, M, P). other (Tensor): Tensor to subtract from the result after dropout, must be broadcastable to the shape of the output. normalized_shape (int or list or torch.Size): Shape over which RMS normalization is applied, typically the size of the last dimension P. dropout_p (float, optional): Probability of an element to be zeroed in the dropout layer. Default: 0.5. training (bool, optional): Apply dropout if True. Default: True. approximate (str, optional): Can be ’none’ or ’tanh’. The approximation to use for GELU. Default: ’none’. eps (float, optional): A value added to the denominator for numerical stability in RMS normalization. Default: 1e-5. out (Tensor, optional): Output tensor. Ignored if None. Default: None. Shape: Input1: (B, N, M) Input2: (B, M, P) Other: broadcastable to (B, N, P) Output: (B, N, P) After generation, verify if the Triton wrapper aligns with the provided func_inputs. If not, regenerate. 47 Figure 12. T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Input Pr ompt. T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Model Analysis The user wants a Triton kernel and a corresponding Python wrapper function for a specific fused operation: fused_bmm_rmsnorm_gelu_dropout_sub. 1. Analyze the Functional Description: Operation Sequence: 1) Batch Matrix Multiplication (BMM): input1 (B,N,M) @ input2 (B,M,P) -> intermediate 2) RMS Normalization: Applied to intermediate over last dimension (P). 3) GELU Activation: Applied to the normalized result. 4) Dropout: Applied to the GELU result. 5) Subtraction: Subtract other from the dropout result. 2. Design the Triton Kernel: Grid Strategy: - Output shape is (B, N, P). - B and N are the "outer" dimensions, P is the "inner" dimension for RMSNorm reduction. - BMM requires loading tiles of input1 (M) and input2 (M, P_tile). - Program ID: pid = tl.program_id(axis=0). Map pid to (b, n). - One program handles one row (b,n) of output. Key Design Decisions: - To fuse BMM (reduction over M) with RMSNorm (reduction over P), the kernel must compute the full row statistics before normalizing. - BLOCK_SIZE_P covers the whole P dimension. - BLOCK_SIZE_M is a tuning parameter for the M-dimension loop. RMSNorm: x = x * rsqrt(mean(x^2) + eps) GELU: Exact: 0.5*x*(1 + erf(x/sqrt(2))) Tanh: 0.5*x*(1 + tanh(sqrt(2/pi)*(x+0.044715*x^3))) Dropout: mask = (rand > dropout_p) x = x * mask / (1 - dropout_p) Other Broadcasting: If other is (P): stride_b=0, stride_n=0, stride_p=1 If other is (N,P): stride_b=0, stride_n=P, stride_p=1 If other is (B,N,P): full strides 48 Figure 13. T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Model Analysis. T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Output Code (Part 1: Kernel) import torch import triton import triton.language as tl @triton.jit def fused_bmm_rmsnorm_gelu_dropout_sub_kernel( input1_ptr, input2_ptr, other_ptr, output_ptr, B, N, M, P, stride_input1_b, stride_input1_n, stride_input1_m, stride_input2_b, stride_input2_m, stride_input2_p, stride_other_b, stride_other_n, stride_other_p, stride_output_b, stride_output_n, stride_output_p, eps, dropout_p, seed, approximate_mode, BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_P: tl.constexpr ): # Map program ID to (b, n) pair pid = tl.program_id(axis=0) n = pid % N b = pid // N # Adjust pointers for current batch and row input1_ptr += b * stride_input1_b \ + n * stride_input1_n input2_ptr += b * stride_input2_b output_ptr += b * stride_output_b \ + n * stride_output_n other_ptr += b * stride_other_b \ + n * stride_other_n # Accumulator (float32 for precision) acc = tl.zeros([BLOCK_SIZE_P], dtype=tl.float32) # Iterate over M dimension in blocks for k in range(0, M, BLOCK_SIZE_M): offs_m = k + tl.arange(0, BLOCK_SIZE_M) mask_m = offs_m < M a = tl.load( input1_ptr + offs_m * stride_input1_m, mask=mask_m, other=0.0) offs_p = tl.arange(0, BLOCK_SIZE_P) mask_p = offs_p < P mask_b = mask_m[:, None] & mask_p[None, :] b_vals = tl.load( input2_ptr + offs_m[:, None] * stride_input2_m + offs_p[None, :] * stride_input2_p, mask=mask_b, other=0.0) acc += tl.dot(a, b_vals) # RMS Normalization 49 acc_masked = acc * mask_p sq_sum = tl.sum(acc_masked * acc_masked, axis=0) mean_sq = sq_sum / P rms = tl.sqrt(mean_sq + eps) acc = acc / rms # GELU Activation if approximate_mode == 1: cdf = 0.5 * (1.0 + tl.tanh( 0.7978845608 * ( acc + 0.044715 * acc * acc * acc))) acc = acc * cdf else: cdf = 0.5 * (1.0 + tl.erf( acc / 1.4142135623730951)) acc = acc * cdf # Dropout if dropout_p > 0.0: random_offsets = ( pid * BLOCK_SIZE_P + tl.arange(0, BLOCK_SIZE_P)) rand_vals = tl.rand(seed, random_offsets) keep_mask = rand_vals > dropout_p scale = 1.0 / (1.0 - dropout_p) acc = acc * keep_mask * scale # Subtraction other_vals = tl.load( other_ptr + offs_p * stride_other_p, mask=mask_p, other=0.0) acc = acc - other_vals # Store result tl.store( output_ptr + offs_p * stride_output_p, acc, mask=mask_p) Figure 14. T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Output Code (Part 1: Kernel). T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Output Code (Part 2: W rap- per) def fused_bmm_rmsnorm_gelu_dropout_sub( input1, input2, other, normalized_shape, dropout_p=0.5, training=True, approximate=’none’, eps=1e-5, *, out=None ): if input1.dim() != 3 or input2.dim() != 3: raise ValueError( "input1 and input2 must be 3D tensors") B, N, M = input1.shape _, _, P = input2.shape # Handle normalized_shape 50 if isinstance(normalized_shape, int): norm_dim = normalized_shape elif isinstance(normalized_shape, (list, tuple, torch.Size)): norm_dim = (normalized_shape[-1] if len(normalized_shape) > 0 else P) else: raise TypeError( "normalized_shape must be int/list/Size") if norm_dim != P: raise ValueError( f"normalized_shape ({norm_dim}) must " f"match last dim of input2 ({P})") if not training: dropout_p = 0.0 # Handle output tensor if out is not None: if out.shape != (B, N, P): raise ValueError( f"out must have shape ({B},{N},{P})") output = out else: output = torch.empty( (B, N, P), dtype=input1.dtype, device=input1.device) # Handle ’other’ broadcasting strides def _get_broadcast_stride(dim, idx): if dim < 3: if idx < 3 - dim: return 0 else: return other.stride(idx-(3-dim)) else: return other.stride(idx) if other.dim() == 0: s_ob, s_on, s_op = 0, 0, 0 else: s_ob = _get_broadcast_stride(other.dim(), 0) s_on = _get_broadcast_stride(other.dim(), 1) s_op = _get_broadcast_stride(other.dim(), 2) # Grid: one kernel instance per (b, n) pair grid = lambda META: (B * N, ) BLOCK_SIZE_M = 32 BLOCK_SIZE_P = triton.next_power_of_2(P) approximate_mode = ( 1 if approximate == ’tanh’ else 0) seed = 12345 # Launch kernel fused_bmm_rmsnorm_gelu_dropout_sub_kernel[grid]( 51 input1, input2, other, output, B, N, M, P, input1.stride(0), input1.stride(1), input1.stride(2), input2.stride(0), input2.stride(1), input2.stride(2), s_ob, s_on, s_op, output.stride(0), output.stride(1), output.stride(2), eps, dropout_p, seed, approximate_mode, BLOCK_SIZE_M=BLOCK_SIZE_M, BLOCK_SIZE_P=BLOCK_SIZE_P ) return output Figure 15. T ritonBench: Fused BMM-RMSNorm-GELU-Dropout-Sub – Output Code (Part 2: W rapper). G.6. KernelBench – Example Case KernelBench: CUDA Matrix Multiplication – Input Prompt You write custom CUDA operators to replace the pytorch operators in the given architecture to get speedups. You have complete freedom to choose the set of operators you want to replace. You may replace multiple operators with custom implementations, consider operator fusion opportunities (combining multiple operators into a single kernel, e.g., combining matmul+relu), or algorithmic changes (such as online softmax). --- Example Input Architecture --- import torch import torch.nn as nn import torch.nn.functional as F class Model(nn.Module): def __init__(self) -> None: super().__init__() def forward(self, a, b): return a + b def get_inputs(): a = torch.randn(1, 128).cuda() b = torch.randn(1, 128).cuda() return [a, b] def get_init_inputs(): return [] --- Example Optimized with CUDA --- import torch 52 import torch.nn as nn import torch.nn.functional as F from torch.utils.cpp_extension import load_inline elementwise_add_source = """ #include #include __global__ void elementwise_add_kernel( const float* a, const float* b, float* out, int size) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { out[idx] = a[idx] + b[idx]; } } torch::Tensor elementwise_add_cuda( torch::Tensor a, torch::Tensor b) { auto size = a.numel(); auto out = torch::zeros_like(a); const int block_size = 256; const int num_blocks = (size + block_size - 1) / block_size; elementwise_add_kernel<<>>( a.data_ptr(), b.data_ptr(), out.data_ptr(), size); return out; } """ elementwise_add_cpp_source = ( "torch::Tensor elementwise_add_cuda(" "torch::Tensor a, torch::Tensor b);" ) elementwise_add = load_inline( name="elementwise_add", cpp_sources=elementwise_add_cpp_source, cuda_sources=elementwise_add_source, functions=["elementwise_add_cuda"], verbose=True, extra_cflags=[""], extra_ldflags=[""], ) class ModelNew(nn.Module): def __init__(self) -> None: super().__init__() self.elementwise_add = elementwise_add def forward(self, a, b): return self.elementwise_add \ .elementwise_add_cuda(a, b) --- Target Architecture --- import torch import torch.nn as nn 53 class Model(nn.Module): """ Simple model that performs a single square matrix multiplication (C = A * B) """ def __init__(self): super(Model, self).__init__() def forward(self, A, B): return torch.matmul(A, B) N = 2048 * 2 def get_inputs(): A = torch.rand(N, N) B = torch.rand(N, N) return [A, B] def get_init_inputs(): return [] Note: The kernels should be optimized for FP32 (32-bit floating point) precision. Optimize the architecture named Model with custom CUDA operators! Name your optimized output architecture ModelNew. Output the new code in codeblocks. Please generate real code, NOT pseudocode, make sure the code compiles and is fully functional. Figure 16. KernelBench: CUDA Matrix Multiplication – Input Prompt. KernelBench: CUDA Matrix Multiplication – Model Output import torch import torch.nn as nn from torch.utils.cpp_extension import load_inline cuda_source = """ #include #include #define TILE_SIZE 32 __global__ void matmul_kernel( const float* __restrict__ A, const float* __restrict__ B, float* __restrict__ C, int N) { int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int row = by * TILE_SIZE + ty; int col = bx * TILE_SIZE + tx; 54 __shared__ float As[TILE_SIZE][TILE_SIZE]; __shared__ float Bs[TILE_SIZE][TILE_SIZE]; float sum = 0.0f; for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; ++t) { if (row < N && (t * TILE_SIZE + tx) < N) As[ty][tx] = A[row * N + t * TILE_SIZE + tx]; else As[ty][tx] = 0.0f; if ((t * TILE_SIZE + ty) < N && col < N) Bs[ty][tx] = B[(t * TILE_SIZE + ty) * N + col]; else Bs[ty][tx] = 0.0f; __syncthreads(); #pragma unroll for (int k = 0; k < TILE_SIZE; ++k) sum += As[ty][k] * Bs[k][tx]; __syncthreads(); } if (row < N && col < N) C[row * N + col] = sum; } torch::Tensor matmul_cuda( torch::Tensor A, torch::Tensor B) { TORCH_CHECK(A.device().is_cuda(), "A must be on CUDA device"); TORCH_CHECK(B.device().is_cuda(), "B must be on CUDA device"); TORCH_CHECK(A.dim() == 2 && B.dim() == 2, "Inputs must be 2D matrices"); TORCH_CHECK(A.size(1) == B.size(0), "Inner dimensions must match"); int N = A.size(0); int K = A.size(1); int M = B.size(1); auto C = torch::zeros({N, M}, A.options()); const int block_size = TILE_SIZE; dim3 dimBlock(block_size, block_size); dim3 dimGrid( (M + block_size - 1) / block_size, (N + block_size - 1) / block_size); matmul_kernel<<>>( A.data_ptr(), B.data_ptr(), C.data_ptr(), K); 55 return C; } """ cpp_source = ( "torch::Tensor matmul_cuda(" "torch::Tensor A, torch::Tensor B);" ) matmul_extension = load_inline( name="matmul_extension", cpp_sources=cpp_source, cuda_sources=cuda_source, functions=["matmul_cuda"], verbose=True, extra_cflags=[""], extra_ldflags=[""], ) class ModelNew(nn.Module): """ Optimized model using custom CUDA operator for matrix multiplication with shared memory tiling optimization. """ def __init__(self): super(ModelNew, self).__init__() self.matmul = matmul_extension.matmul_cuda def forward(self, A, B): return self.matmul(A, B) N = 2048 * 2 def get_inputs(): A = torch.rand(N, N) B = torch.rand(N, N) return [A, B] def get_init_inputs(): return [] Figure 17. KernelBench: CUDA Matrix Multiplication – Model Output. G.7. EmbeddedC – Example Case EmbeddedC: STM32F407 PWM with TIM12 – Problem Description --- PROBLEM --- STM32F407 PWM Interrupt & GPIO Alternate Function Configure TIM12 channel 2 to output 55% duty cycle PWM at 800 Hz on PH7 with update interrupt enabled and TIM8_BRK_TIM12_IRQHandler. 56 Figure 18. EmbeddedC: STM32F407 PWM with TIM12 – Problem Description. EmbeddedC: STM32F407 PWM with TIM12 – Reference Code (Part 1: Infrastructure) #include // --- INFRASTRUCTURE (Always Available) --- typedef struct { volatile uint32_t CR; volatile uint32_t PLLCFGR; volatile uint32_t CFGR; volatile uint32_t CIR; volatile uint32_t AHB1RSTR; volatile uint32_t AHB2RSTR; volatile uint32_t AHB3RSTR; uint32_t _reserved0; volatile uint32_t APB1RSTR; volatile uint32_t APB2RSTR; uint32_t _reserved1[2]; volatile uint32_t AHB1ENR; volatile uint32_t AHB2ENR; volatile uint32_t AHB3ENR; uint32_t _reserved2; volatile uint32_t APB1ENR; volatile uint32_t APB2ENR; } RCC_TypeDef; #define RCC ((RCC_TypeDef *) 0x40023800) typedef struct { volatile uint32_t MODER; volatile uint32_t OTYPER; volatile uint32_t OSPEEDR; volatile uint32_t PUPDR; volatile uint32_t IDR; volatile uint32_t ODR; volatile uint32_t BSRR; volatile uint32_t LCKR; volatile uint32_t AFR[2]; } GPIO_TypeDef; #define GPIOA ((GPIO_TypeDef *) 0x40020000) #define GPIOB ((GPIO_TypeDef *) 0x40020400) #define GPIOC ((GPIO_TypeDef *) 0x40020800) #define GPIOD ((GPIO_TypeDef *) 0x40020C00) #define GPIOE ((GPIO_TypeDef *) 0x40021000) #define GPIOH ((GPIO_TypeDef *) 0x40021C00) typedef struct { volatile uint32_t ISER[8]; uint32_t RESERVED0[24]; volatile uint32_t ICER[8]; uint32_t RESERVED1[24]; volatile uint32_t ISPR[8]; uint32_t RESERVED2[24]; volatile uint32_t ICPR[8]; uint32_t RESERVED3[24]; volatile uint32_t IABR[8]; uint32_t RESERVED4[56]; volatile uint32_t IPR[60]; 57 } NVIC_TypeDef; #define NVIC ((NVIC_TypeDef *) 0xE000E100) typedef struct { volatile uint32_t CR1; volatile uint32_t CR2; volatile uint32_t SMCR; volatile uint32_t DIER; volatile uint32_t SR; volatile uint32_t EGR; volatile uint32_t CCMR1; volatile uint32_t CCMR2; volatile uint32_t CCER; volatile uint32_t CNT; volatile uint32_t PSC; volatile uint32_t ARR; uint32_t _reserved1; volatile uint32_t CCR1; volatile uint32_t CCR2; volatile uint32_t CCR3; volatile uint32_t CCR4; } TIM_TypeDef; #define TIM12 ((TIM_TypeDef *) 0x40001800) void TIM8_BRK_TIM12_IRQHandler(void); volatile uint32_t tim12_update_flag = 0; Figure 19. EmbeddedC: STM32F407 PWM with TIM12 – Reference Code (Part 1: Infrastructure). EmbeddedC: STM32F407 PWM with TIM12 – Reference Code (Part 2: Main & IRQ) int main(void) { // 1. Enable peripheral clocks RCC->AHB1ENR |= (1UL << 7); // GPIOH RCC->APB1ENR |= (1UL << 6); // TIM12 // 2. Configure PH7 as AF (TIM12_CH2) // PH7 mode: AF (bits 15:14 = 0b10) GPIOH->MODER &= ~(3UL << 14); GPIOH->MODER |= (2UL << 14); // PH7 output type: Push-pull (bit 7 = 0) GPIOH->OTYPER &= ~(1UL << 7); // PH7 speed: High (bits 15:14 = 0b11) GPIOH->OSPEEDR |= (3UL << 14); // PH7 pull: None (bits 15:14 = 0b00) GPIOH->PUPDR &= ~(3UL << 14); // PH7 AF9 (TIM12) AFR[0] bits 31:28 = 0b1001 GPIOH->AFR[0] &= ~(0xFUL << 28); GPIOH->AFR[0] |= (9UL << 28); // 3. Configure TIM12: 800 Hz PWM, 55% duty // APB1 timer clock = 84 MHz TIM12->PSC = 0; // ARR = (84e6 / 800) - 1 = 104999 TIM12->ARR = 104999UL; // CCMR1: OC2M=PWM mode 1, OC2PE=1 TIM12->CCMR1 &= ~(0x7FUL << 8); 58 TIM12->CCMR1 |= (0x6UL << 12) | (1UL << 11); // Enable CC2 output (CC2E = 1) TIM12->CCER |= (1UL << 4); // CCR2 = 0.55 * (ARR+1) = 57750 TIM12->CCR2 = 57750UL; // 4. Enable update interrupt TIM12->DIER |= (1UL << 0); // UIE = 1 // 5. Configure NVIC for TIM12 // TIM8_BRK_TIM12 IRQ = 43 // ISER[1] bit 11 (43-32=11) NVIC->ISER[1] |= (1UL << 11); NVIC->IPR[43] = 0; // 6. Start TIM12 TIM12->CR1 |= (1UL << 0); // CEN = 1 while (1) { for (volatile uint32_t i = 0; i < 10000; ++i) { __asm__("nop"); } if (tim12_update_flag) { tim12_update_flag = 0; } } } // --- INTERRUPT HANDLER --- void TIM8_BRK_TIM12_IRQHandler(void) { if (TIM12->SR & (1UL << 0)) { // UIF TIM12->SR &= ~(1UL << 0); // Clear UIF tim12_update_flag = 1; } } Figure 20. EmbeddedC: STM32F407 PWM with TIM12 – Reference Code (Part 2: Main & IRQ Handler). G.8. RealBench – Example Case RealBench: AES Cipher T op – Design Speciﬁcation System Prompt: You are an expert Verilog/RTL hardware design engineer. Given a design specification, generate correct, synthesizable Verilog code that meets all the requirements described in the specification. --- DESIGN SPECIFICATION --- aes_cipher_top Design Specification 1. Introduction The aes_cipher_top module is the core control module of the entire AES encryption system, responsible for 59 coordinating and controlling the entire encryption process. This module implements all round transformations of the AES encryption algorithm, including SubBytes, ShiftRows, MixColumns, and AddRoundKey. 2. Interface Signal Dir Width Description --------- ------ ----- ---------------------- clk input 1 Clock signal rst input 1 Reset signal ld input 1 Load enable done output 1 Encryption complete key input 128 Input key text_in input 128 Input plaintext text_out output 128 Output ciphertext 3. Registers Register Width Reset Description ----------- ----- ----- ---------------------- text_in_r 128 0 Input text buffer sa[0:3][0:3] 8 0 4x4 state matrix dcnt 4 0 Round counter (init=11) ld_r 1 0 Load operation flag text_out 128 0 Output ciphertext 4. Operation Principle AES encryption performs 10 rounds of transformation on 128-bit input data. Each round includes: 1) SubBytes - Non-linear byte substitution 2) ShiftRows - Row shifting operation 3) MixColumns - Column mixing (except final) 4) AddRoundKey - Round key addition State Matrix Structure: sa00 sa01 sa02 sa03 sa10 sa11 sa12 sa13 sa20 sa21 sa22 sa23 sa30 sa31 sa32 sa33 (4x4 byte matrix, column-major order) ShiftRows: Row 0: [a b c d] -> [a b c d] // No shift Row 1: [a b c d] -> [b c d a] // Shift 1 Row 2: [a b c d] -> [c d a b] // Shift 2 Row 3: [a b c d] -> [d a b c] // Shift 3 MixColumns (in GF(2^8)): [02 03 01 01] [s0] [out0] [01 02 03 01] x [s1] = [out1] [01 01 02 03] [s2] [out2] [03 01 01 02] [s3] [out3] 5. State Machine IDLE: Reset state, wait for ld=1. INIT_ROUND: Load input, initial AddRoundKey, 60 set dcnt=0xB (11). ROUND_OP: Standard rounds 1-9. FINAL_ROUND: Round 10 (no MixColumns). DONE: Set done signal, hold result. 6. Submodules aes_key_expand_128: Ports: clk, kld, key[127:0], wo_0[31:0], wo_1[31:0], wo_2[31:0], wo_3[31:0] aes_sbox: Ports: a[7:0] (input), b[7:0] (output) 7. Constraints - 16 S-box modules required - 4x4 state matrix register array - One round per clock cycle - Register updates on rising edge Figure 21. RealBench: AES Cipher T op – Design Speciﬁcation. RealBench: AES Cipher T op – Model Output (Part 1: Declarations & Control) module aes_cipher_top( clk, rst, ld, done, key, text_in, text_out); input clk, rst; input ld; output done; input [127:0] key; input [127:0] text_in; output [127:0] text_out; wire [31:0] w0, w1, w2, w3; reg [127:0] text_in_r; reg [127:0] text_out; reg [7:0] sa00, sa01, sa02, sa03; reg [7:0] sa10, sa11, sa12, sa13; reg [7:0] sa20, sa21, sa22, sa23; reg [7:0] sa30, sa31, sa32, sa33; wire [7:0] sa00_next, sa01_next, sa02_next, sa03_next; wire [7:0] sa10_next, sa11_next, sa12_next, sa13_next; wire [7:0] sa20_next, sa21_next, sa22_next, sa23_next; wire [7:0] sa30_next, sa31_next, sa32_next, sa33_next; wire [7:0] sa00_sub, sa01_sub, sa02_sub, sa03_sub; wire [7:0] sa10_sub, sa11_sub, sa12_sub, sa13_sub; wire [7:0] sa20_sub, sa21_sub, sa22_sub, sa23_sub; wire [7:0] sa30_sub, sa31_sub, sa32_sub, sa33_sub; 61 wire [7:0] sa00_sr, sa01_sr, sa02_sr, sa03_sr; wire [7:0] sa10_sr, sa11_sr, sa12_sr, sa13_sr; wire [7:0] sa20_sr, sa21_sr, sa22_sr, sa23_sr; wire [7:0] sa30_sr, sa31_sr, sa32_sr, sa33_sr; wire [7:0] sa00_mc, sa01_mc, sa02_mc, sa03_mc; wire [7:0] sa10_mc, sa11_mc, sa12_mc, sa13_mc; wire [7:0] sa20_mc, sa21_mc, sa22_mc, sa23_mc; wire [7:0] sa30_mc, sa31_mc, sa32_mc, sa33_mc; reg done, ld_r; reg [3:0] dcnt; // --- Round Counter --- always @(posedge clk) if(!rst) dcnt <= 4’h0; else if(ld) dcnt <= 4’hb; else if(|dcnt) dcnt <= dcnt - 4’h1; // --- Done & Load Control --- always @(posedge clk) done <= !(|dcnt[3:1]) & dcnt[0] & !ld; always @(posedge clk) if(ld) text_in_r <= text_in; always @(posedge clk) ld_r <= ld; Figure 22. RealBench: AES Cipher T op – Model Output (Part 1: Declarations & Control). RealBench: AES Cipher T op – Model Output (Part 2: State Matrix & ShiftRows) // --- State Matrix Update (Initial Round + Next) --- always @(posedge clk) sa33 <= ld_r ? text_in_r[007:000]^w3[07:00] : sa33_next; always @(posedge clk) sa23 <= ld_r ? text_in_r[015:008]^w3[15:08] : sa23_next; always @(posedge clk) sa13 <= ld_r ? text_in_r[023:016]^w3[23:16] : sa13_next; always @(posedge clk) sa03 <= ld_r ? text_in_r[031:024]^w3[31:24] : sa03_next; always @(posedge clk) sa32 <= ld_r ? text_in_r[039:032]^w2[07:00] : sa32_next; always @(posedge clk) sa22 <= ld_r ? text_in_r[047:040]^w2[15:08] : sa22_next; always @(posedge clk) 62 sa12 <= ld_r ? text_in_r[055:048]^w2[23:16] : sa12_next; always @(posedge clk) sa02 <= ld_r ? text_in_r[063:056]^w2[31:24] : sa02_next; always @(posedge clk) sa31 <= ld_r ? text_in_r[071:064]^w1[07:00] : sa31_next; always @(posedge clk) sa21 <= ld_r ? text_in_r[079:072]^w1[15:08] : sa21_next; always @(posedge clk) sa11 <= ld_r ? text_in_r[087:080]^w1[23:16] : sa11_next; always @(posedge clk) sa01 <= ld_r ? text_in_r[095:088]^w1[31:24] : sa01_next; always @(posedge clk) sa30 <= ld_r ? text_in_r[103:096]^w0[07:00] : sa30_next; always @(posedge clk) sa20 <= ld_r ? text_in_r[111:104]^w0[15:08] : sa20_next; always @(posedge clk) sa10 <= ld_r ? text_in_r[119:112]^w0[23:16] : sa10_next; always @(posedge clk) sa00 <= ld_r ? text_in_r[127:120]^w0[31:24] : sa00_next; // --- ShiftRows --- // Row 0: no shift assign sa00_sr = sa00_sub; assign sa01_sr = sa01_sub; assign sa02_sr = sa02_sub; assign sa03_sr = sa03_sub; // Row 1: shift left 1 assign sa10_sr = sa11_sub; assign sa11_sr = sa12_sub; assign sa12_sr = sa13_sub; assign sa13_sr = sa10_sub; // Row 2: shift left 2 assign sa20_sr = sa22_sub; assign sa21_sr = sa23_sub; assign sa22_sr = sa20_sub; assign sa23_sr = sa21_sub; // Row 3: shift left 3 assign sa30_sr = sa33_sub; assign sa31_sr = sa30_sub; assign sa32_sr = sa31_sub; assign sa33_sr = sa32_sub; Figure 23. RealBench: AES Cipher T op – Model Output (Part 2: State Matrix & ShiftRows). 63 RealBench: AES Cipher T op – Model Output (Part 3: MixColumns & AddRoundKey) // --- MixColumns --- wire [31:0] mix_col_out0, mix_col_out1, mix_col_out2, mix_col_out3; assign mix_col_out0 = mix_col(sa00_sr,sa10_sr,sa20_sr,sa30_sr); assign mix_col_out1 = mix_col(sa01_sr,sa11_sr,sa21_sr,sa31_sr); assign mix_col_out2 = mix_col(sa02_sr,sa12_sr,sa22_sr,sa32_sr); assign mix_col_out3 = mix_col(sa03_sr,sa13_sr,sa23_sr,sa33_sr); assign {sa00_mc,sa10_mc,sa20_mc,sa30_mc} = mix_col_out0; assign {sa01_mc,sa11_mc,sa21_mc,sa31_mc} = mix_col_out1; assign {sa02_mc,sa12_mc,sa22_mc,sa32_mc} = mix_col_out2; assign {sa03_mc,sa13_mc,sa23_mc,sa33_mc} = mix_col_out3; // --- AddRoundKey (next state) --- assign sa00_next = sa00_mc ^ w0[31:24]; assign sa01_next = sa01_mc ^ w1[31:24]; assign sa02_next = sa02_mc ^ w2[31:24]; assign sa03_next = sa03_mc ^ w3[31:24]; assign sa10_next = sa10_mc ^ w0[23:16]; assign sa11_next = sa11_mc ^ w1[23:16]; assign sa12_next = sa12_mc ^ w2[23:16]; assign sa13_next = sa13_mc ^ w3[23:16]; assign sa20_next = sa20_mc ^ w0[15:08]; assign sa21_next = sa21_mc ^ w1[15:08]; assign sa22_next = sa22_mc ^ w2[15:08]; assign sa23_next = sa23_mc ^ w3[15:08]; assign sa30_next = sa30_mc ^ w0[07:00]; assign sa31_next = sa31_mc ^ w1[07:00]; assign sa32_next = sa32_mc ^ w2[07:00]; assign sa33_next = sa33_mc ^ w3[07:00]; Figure 24. RealBench: AES Cipher T op – Model Output (Part 3: MixColumns & AddRoundKey). RealBench: AES Cipher T op – Model Output (Part 4: Output & Submodules) // --- Output Generation (Final Round) --- always @(posedge clk) text_out[127:120] <= sa00_sr ^ w0[31:24]; always @(posedge clk) text_out[095:088] <= sa01_sr ^ w1[31:24]; always @(posedge clk) text_out[063:056] <= sa02_sr ^ w2[31:24]; always @(posedge clk) text_out[031:024] <= sa03_sr ^ w3[31:24]; always @(posedge clk) text_out[119:112] <= sa10_sr ^ w0[23:16]; 64 always @(posedge clk) text_out[087:080] <= sa11_sr ^ w1[23:16]; always @(posedge clk) text_out[055:048] <= sa12_sr ^ w2[23:16]; always @(posedge clk) text_out[023:016] <= sa13_sr ^ w3[23:16]; always @(posedge clk) text_out[111:104] <= sa20_sr ^ w0[15:08]; always @(posedge clk) text_out[079:072] <= sa21_sr ^ w1[15:08]; always @(posedge clk) text_out[047:040] <= sa22_sr ^ w2[15:08]; always @(posedge clk) text_out[015:008] <= sa23_sr ^ w3[15:08]; always @(posedge clk) text_out[103:096] <= sa30_sr ^ w0[07:00]; always @(posedge clk) text_out[071:064] <= sa31_sr ^ w1[07:00]; always @(posedge clk) text_out[039:032] <= sa32_sr ^ w2[07:00]; always @(posedge clk) text_out[007:000] <= sa33_sr ^ w3[07:00]; // --- GF(2^8) Functions --- function [7:0] xtime; input [7:0] b; xtime = {b[6:0],1’b0} ^ (8’h1b & {8{b[7]}}); endfunction function [31:0] mix_col; input [7:0] s0, s1, s2, s3; reg [7:0] s0_o, s1_o, s2_o, s3_o; begin mix_col[31:24] = xtime(s0)^xtime(s1)^s1^s2^s3; mix_col[23:16] = s0^xtime(s1)^xtime(s2)^s2^s3; mix_col[15:08] = s0^s1^xtime(s2)^xtime(s3)^s3; mix_col[07:00] = xtime(s0)^s0^s1^s2^xtime(s3); end endfunction // --- Key Expansion Submodule --- aes_key_expand_128 u0( .clk(clk), .kld(ld), .key(key), .wo_0(w0), .wo_1(w1), .wo_2(w2), .wo_3(w3)); // --- 16 S-Box Instances --- aes_sbox us00(.a(sa00), .b(sa00_sub)); aes_sbox us01(.a(sa01), .b(sa01_sub)); aes_sbox us02(.a(sa02), .b(sa02_sub)); aes_sbox us03(.a(sa03), .b(sa03_sub)); aes_sbox us10(.a(sa10), .b(sa10_sub)); aes_sbox us11(.a(sa11), .b(sa11_sub)); aes_sbox us12(.a(sa12), .b(sa12_sub)); aes_sbox us13(.a(sa13), .b(sa13_sub)); 65 aes_sbox us20(.a(sa20), .b(sa20_sub)); aes_sbox us21(.a(sa21), .b(sa21_sub)); aes_sbox us22(.a(sa22), .b(sa22_sub)); aes_sbox us23(.a(sa23), .b(sa23_sub)); aes_sbox us30(.a(sa30), .b(sa30_sub)); aes_sbox us31(.a(sa31), .b(sa31_sub)); aes_sbox us32(.a(sa32), .b(sa32_sub)); aes_sbox us33(.a(sa33), .b(sa33_sub)); endmodule Figure 25. RealBench: AES Cipher T op – Model Output (Part 4: Output Generation & Submodules). G.9. ArchXBench – Example Case ArchXBench: 8-to-3 Binary Encoder – Design Speciﬁcation System Prompt: You are an expert Verilog/RTL hardware design engineer. Given a design specification, generate correct, synthesizable Verilog code that meets all the requirements described in the specification. --- DESIGN SPECIFICATION --- Title: 8-to-3 Binary Encoder Objective: Design an 8-to-3 binary encoder that outputs the binary representation of the index of the highest active input, assuming one-hot input. Background: Binary encoders are combinational circuits that reduce multiple input lines to a binary representation. A binary encoder assumes that only one input is active (one-hot input), suitable for controlled environments where inputs are guaranteed to be mutually exclusive. Design Constraints: - Assumes only one input is high at a time. - Purely combinational design. - When no bits set or more than one bit set, drive outputs to 0. - Any X or Z on any input bit must propagate through so that corresponding output bits become X. Design Name: binary_encoder_8to3 Inputs: in[7:0] // One-hot 8-bit input Outputs: out[2:0] // 3-bit binary output Module Signature: 66 module binary_encoder_8to3 ( input [7:0] in, output [2:0] out ); Design Notes: - Assumes exactly one input is high. - Outputs the corresponding binary index. - Use purely combinational logic. Figure 26. ArchXBench: 8-to-3 Binary Encoder – Design Speciﬁcation. ArchXBench: 8-to-3 Binary Encoder – Model Output module binary_encoder_8to3 ( input [7:0] in, output [2:0] out ); // Check for any X or Z in input wire has_xz; assign has_xz = (^in === 1’bx); // Count number of 1’s in input wire [3:0] count_ones; assign count_ones = {1’b0, in[0]} + {1’b0, in[1]} + {1’b0, in[2]} + {1’b0, in[3]} + {1’b0, in[4]} + {1’b0, in[5]} + {1’b0, in[6]} + {1’b0, in[7]}; // Check if exactly one bit is high wire valid_one_hot; assign valid_one_hot = (count_ones == 4’d1); // Encoder for position (one-hot input) wire [2:0] encoded_out; assign encoded_out[2] = in[7] | in[6] | in[5] | in[4]; assign encoded_out[1] = in[7] | in[6] | in[3] | in[2]; assign encoded_out[0] = in[7] | in[5] | in[3] | in[1]; // Final output with X propagation // and validity check assign out = has_xz ? 3’bxxx : (valid_one_hot ? encoded_out : 3’b000); endmodule --- ALTERNATIVE IMPLEMENTATION (casez) --- module binary_encoder_8to3 ( input [7:0] in, output [2:0] out 67 ); wire has_xz; assign has_xz = (^in === 1’bx); reg [3:0] count; integer i; always @(*) begin count = 4’d0; for (i = 0; i < 8; i = i + 1) begin if (in[i] === 1’b1) count = count + 1; end end wire valid_one_hot; assign valid_one_hot = (count == 4’d1); reg [2:0] encoded_out; always @(*) begin encoded_out = 3’b000; casez (in) 8’b00000001: encoded_out = 3’b000; 8’b00000010: encoded_out = 3’b001; 8’b00000100: encoded_out = 3’b010; 8’b00001000: encoded_out = 3’b011; 8’b00010000: encoded_out = 3’b100; 8’b00100000: encoded_out = 3’b101; 8’b01000000: encoded_out = 3’b110; 8’b10000000: encoded_out = 3’b111; default: encoded_out = 3’b000; endcase end assign out = has_xz ? 3’bxxx : (valid_one_hot ? encoded_out : 3’b000); endmodule --- KEY DESIGN DECISIONS --- 1. X/Z propagation: Using === 1’bx checks for X or Z values. If any input bit is X/Z, the reduction XOR returns X. 2. Valid one-hot detection: Count the number of 1’s in the input. If exactly one bit is high, it’s valid. 3. Output selection: - X/Z detected -> output 3’bxxx - Valid one-hot -> output encoded position - Otherwise -> output 3’b000 Figure 27. ArchXBench: 8-to-3 Binary Encoder – Model Output. 68

InCoder-32B: Code Foundation Model for Industrial Scenarios

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment