RTLSeek: Boosting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning

RTLSeek: Boosting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning Xinyu Zhang ∗ State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China University of Chinese Academy of Sciences China Zhiteng Chao ∗ State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China Y onghao W ang State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China Bin Sun State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China University of Chinese Academy of Sciences China Tianyun Ma State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China University of Chinese Academy of Sciences China Tianmeng Y ang Peking University China Jianan Mu † State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China Jing Justin Y e State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China University of Chinese Academy of Sciences China CASTEST Co., Ltd. China Huawei Li ‡ State Ke y Lab of Processors, Institute of Computing T echnology , Chinese Academy of Sciences China University of Chinese Academy of Sciences China CASTEST Co., Ltd. China Abstract Register Transfer Lev el (RTL) design translates high-level spec- ications into hardware using HDLs such as V erilog. Although LLM-based RTL generation is promising, the scarcity of function- ally veriable high-quality data limits both accuracy and diver- sity . Existing post-training typically produces a single HDL imple- mentation per specication, lacking awareness of RTL variations needed for dierent design goals. W e propose RTLSeek, a post- training paradigm that applies rule-based Diversity-Oriented Rein- forcement Learning to improve RTL correctness and diversity . Our Diversity-Centric Multi-Objective Reward Scheduling integrates expert knowledge with ED A feedback, and a three-stage framework maximizes the utility of limited data. Experiments on the RTLLM benchmark show that RTLSeek surpasses prior methods, with ab- lation results conrming that encouraging broader design-space exploration impr oves RTL quality and achie ves the principle of “the more generated, the better results. ” Implementation framework, ∗ Both authors contributed equally to this research. † Corresponding author: mujianan@ict.ac.cn. ‡ Corresponding author: lihuawei@ict.ac.cn. including the dataset, source code, and model weights, is shown at https://anonymous.4open.science/r/DA C2026ID71- A CB4/. 1 Introduction Register Transfer Le vel (RTL) design translates high-lev el func- tional descriptions into circuit blo ck architectures and remains the most complex stage of digital hardware customization. Un- like downstream logic and physical design, RTL coding is still poorly automated and heavily dependent on engineer expertise. Recent advances in Large Language Models (LLMs) [ 1 , 2 , 15 ] have enabled automatic code generation across many programming languages, motivating research on LLM-driven RTL generation. Early work shows that Supervised Fine- Tuning (SFT) can align human language with RTL design languages and functional speci- cations [ 3 , 4 , 10 , 11 , 14 , 19 ]. However , progress remains limited: even on simple benchmarks, RTL generation accuracy stays below 60% [ 13 ]. A primary bottlene ck is the scarcity of high-quality train- ing data—only about one thousand examples include testbenches, which are essential for verifying correctness. Fr om an academic standpoint, a substantial expansion of op en-source, functionally veriable chip design data appears unlikely in the near term. This raises a fundamental question for the community: How can we Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y Chao and Zhang, et al. C u r r en t P o s t - T r ai n i n g : O n e T o O n e Question: Y ou are an expert in digital circuit design and hardware description languagess ...... Question: Y ou are an expert in digital circuit design and hardware ......, Generate as many structurally diverse Verilog code as possible ...... A n s w er : A n s w er : module Design_2 ( input A, input B, output Sum, output Cout ) ...... module Design_1 ( input A, input B, output Sum, output Cout ) ...... module Design_0 ( input A, input B, output Sum, output Cout ) ...... module Design ( input A, input B, output Sum, ) ...... D i s ad v an t ag es H D L H D L E D A  T o o l s .rpt R ep o r t s .rp t .rpt A d v an t ag es Insufficient use of data Sufficient use of data D i v er s i t y - O r i en t ed P o s t - T r ai n i n g : O n e T o Man y module Design ( input A, input B, output Sum, output Cout ) ...... G r o u n d T r u t h : H D L Learning by drawing inferences Learning by being spoon-feed Mu l t i . E D A T o o l s F eed B ac k D r aw I n f er en c es ! approach Figure 1: Motivation of RTLSeek: diversity-oriented post- training alleviates data scarcity limitations in current SFT - based RTL generation. leverage circuit design expertise to improve LLM training without relying on more testbench-equipped datasets? Human learners often approach RTL design through “learning by seeking” : exploring multiple solutions for a single problem, val- idating them, and extracting deeper insights from limited examples. In contrast, as illustrated in Figure 2, the widely used one-to-one post-training paradigm (e .g., SFT) encourages rote memorization rather than genuine reasoning about RTL design principles. Inspired by human learning, we propose encouraging LLMs to actively explore div erse RTL implementations—generating mul- tiple structural variants for the same description and evaluating them—thereby maximizing the utility of scarce testbench-equipp ed data. Achieving this requires over coming two challenges: 1 ○ Train- ing eciency: enabling diverse exploration with a limited dataset; 2 ○ Diversity–correctness balance: ensuring rigorous functional correctness while broadening design space exploration. Recent progress, such as DeepSeek R1 [ 6 ], demonstrates that Group Rel- ative Policy Optimization ( GRPO)-based Reinforcement Learning (RL) can unlock emergent reasoning abilities in LLMs, reshaping post-training methodologies. Motivated by human learning and these methodological ad- vances, we introduce RTLSeek , a diversity-oriented RL framew ork for LLM-based RTL code generation. T o maximize scarce training data, RTLSeek integrates SFT with a two-stage GRPO process. W e design a specialized reward mechanism—automatically generated via circuit analysis tools—to balance correctness learning and di- versity exploration. T o further promote structurally diverse design attempts, we compute diversity r ewards thr ough Abstract Syntax Tr ee (AST)–based structural equivalence analysis. Our contributions are summarized as follows: • W e present RTLSeek, a diversity-oriented reinforcement learning framework for LLM-based RTL generation. By in- tegrating GRPO into training, RTLSeek systematically ex- plores structural diversity and substantially improves RTL code generation quality . • T o balance diversity and correctness, RTLSeek emplo ys a Multi-Objective Reward Scheduling strategy that com- bines expert IC design constraints, ED A tool feedback, and dynamic assessments of intermediate-generation quality . • T o address the scarcity of veriable datasets, RTLSeek intro- duces a three-stage training pipeline : (1) SFT warm-up, (2)  < s y s t em> P l eas e ac t as a p r o f es s i o n al v eri l o g d es i g n er ...... < / s y s t em> User Prompt < d es c r i p t i o n >  I mp l emen t a mo d u l e o f a 16- b i t f u l l ad d er i n c o mb i n at i o n al l o g i c n amd ad d er_16b i t . In p u t p o r t s : ... O u t p u t p o r t s : ... Imp l emen t at i o n d et ai l s : ......  < / d es c r i p t i o n > S p ec . 1 User Prompt 2 G en erat e as man y s t r u c t u r al l y d i v ers e V eri l o g c o d e as p o s s i b l e , o p t i mi zed f o r area, p o w er an d mi n i mi zi n g t h e w o r s t p at h d el ay G i v e me a c o mp l et e c o d e. One output V erilog module Multiple output V erilog modules ... D es i g n D es c r i p t i o n S y s t em P r o mp t or Figure 2: Obser vation from GPT -4’s RTL design: achieving T est Time Scale (T TS) [ 23 ] by designing a multi-sample user prompt enhances the correctness of generated RTL syntax and functionality . diversity-oriented exploration, and (3) multi-objective explo- ration, achieving an eective trade-o between exploration depth and functional correctness. • Experimental results show that our post-training paradigm improves Qwen 2.5’s RTL generation performance by o ver 40%, surpassing other methods. Ablation studies conrm that removing diversity rewards or any training stage degrades performance, highlighting the eectiveness of our diversity- oriented RL approach in data-scarce RTL tasks. 2 Observation T est Time Scale (T TS) technology is usually used to enhance gener- ation ability of LLMs through the strategic design of user prompts as shown in Figure 2. For more challenging generation tasks, by adjusting input prompts during the test phase, the same LLM is able to generate multiple structurally diverse RTL code snippets, among which a functionally correct mo dule is more likely to emerge. If the LLM is only asked to produce a single output, it b ecomes much harder to generate the correct result in one attempt. This observation also highlights a key distinction: unlike soft- ware code, RTL exhibits inher ent concurrency , where ev en minor structural changes can signicantly impact functional correctness. This raises the question of whether enhancing the mo del’s rea- soning and RTL understanding during training—so as to generate structurally diverse RTL modules—could improv e the success rate of LLMs in RTL generation tasks, ultimately advancing toward the goal of “the more generated, the better the results. ” 3 Design of RTLSeek In this section, we present the design of RTLSeek as shown in Figure 3. First, we formalize the problem formulation for LLM- based RTL generation and outline the overall design of RTLSeek. Next, we detail our hybrid training paradigm that integrates SFT with GRPO-based reinforcement learning, follo wed by the multi- objective r eward mechanism that dynamically balances corr ectness and diversity . Finally , we present our AST -based structural analysis approach for diversity quantication. RTLSeek: Bo osting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y S t ag e 2 G R P O : L ear n t o g en er at e d i v er s e H D L s S p ec . H D L R eas o n module Design ( input A, input B, output Sum, output Cout ) ... ... module Design ( input A, input B, output Sum, output Cout ) ... ... module Design ( input A, input B, output Sum, output Cout ) ...... .doc S t ag e 1 S F T W ar m - u p : L ear n w h at i s H D L ...... D i v er s i t y R ew ar d ... KL . doc . doc .doc S t ag e 3 G R P O : L ear n t o g en er at e r i g h t  an d d i v er s e H D L s D i v er s i t y R ew ar d ... K L G r o u p  C o m p u t at i o n G r o u p  C o m p u t at i o n S p ec . .doc F u n c . R ew ar d S y n . R ew ar d S y n . R ew ar d m odule Design() . ..... m odule Design() . ..... module Design() ...... m odule Design() . ..... m odule Design() . ..... module Design() ...... S p ec . m odule Design() . ..... ...... m odule Design() . ..... m odule Design() . ..... module Design() ...... m odule Design() . ..... m odule Design() . ..... module Design() ...... module Design() . ..... module Design() ...... module Design() ...... C o n t ext R ew ar d C o n t ext R ew ar d ... ... Mu l t i - O b j ec t i v e R ew ar d S c h ed u l i n g .tb T es t B en c h Figure 3: The three-stage hybrid training paradigm in RTLSeek. 3.1 Problem Denition Our problem can be dened as follows. Given a natural language specication 𝑆 describing desired hardware behavior (see user prompt in Figure 4), the RTL generation task aims to produce a set of design implementations { 𝐷 1 , 𝐷 2 , . . ., 𝐷 𝑛 } satisfying: (1) Functional Equivalence : ∀ 𝐷 𝑖 , 𝐷 𝑗 ∈ { 𝐷 } : 𝐷 𝑖 ≡ 𝐷 𝑗 where ≡ denotes behavioral equivalence veried through testbench. (2) Structural Diversity : Struct ( 𝐷 𝑖 ) ≠ Struct ( 𝐷 𝑗 ) for 𝑖 ≠ 𝑗 , where Struct ( ·) represents the Abstract Syntax T ree (AST) r epresenta- tion. (3) Non-trivial V ariation : Diversity ( Struct ( 𝐷 𝑖 ) , Struct ( 𝐷 𝑗 ) ) > 𝛿 , where 𝛿 is the minimum variation threshold excluding super- cial changes (e.g., variable r enaming). 3.2 Overall Design This pap er addresses a critical problem in RTL generation using LLMs: how to eectively leverage existing circuit design expertise to enhance model training, given the severe scarcity of high-quality , veriable training data. The current landscape presents a funda- mental limitation that only several thousand samples with accom- panying testbenches (essential for validating model-generated RTL code) are publicly available. This data scarcity creates a pressing need for alternative training paradigms that can compensate for the lack of large-scale, high-quality datasets. Existing solutions fall short in two ke y aspects. First, verication- based RTL generation methods [ 8 , 16 ] achieve only supercial di- versity improv ements (e.g., simple variable renaming) at high com- putational cost. Second, while SFT ensures basic correctness, it fails to produce suciently diverse designs; conversely , pure diversity- driven approaches often generate non-functional code. This reveals a fundamental tension in RTL generation: ho w to simultaneously ensure functional correctness while e xploring meaningful design variations under tight data constraints. T o address these challenges, we propose RTLSeek, a GPRO-based training paradigm that combines three key innovations: • A hybrid training paradigm integrating SFT with two-stage GRPO optimization (as shown in Figure 3); • An automated reward system using circuit analysis tools to balance correctness and diversity (as sho wn in Figure 4); • AST -based structural equivalence analysis to quantify and encourage design diversity (as shown in Figur e 5). Unlike previous methods that treat correctness and diversity as competing objectives, RTLSeek’s integrated approach enables e- cient exploration of the design space while maintaining functional validity , even with limited training data. 3.3 The Hybrid T raining Paradigm In this section, we introduce the hybrid training paradigm integrat- ing the SFT and the two-stage GRPO-based RL. W e rst present the preliminary of the GRPO-base d RL. Then we give the details of the training paradigm. 3.3.1 GPRO-based Reinforcement Learning. GRPO extends pol- icy gradient methods by introducing group-wise relativ e compar- isons, addr essing critical limitations in conventional RL approaches for generative tasks. Originally demonstrated in DeepSeek R1[ 6 ], GRPO’s unique characteristics make it particularly suitable for RTL generation: • Stabilized Group Up dates : By constraining policy updates within output groups through KL-divergence regularization, GRPO mitigates the mode collapse problem prevalent in PPO while maintaining exploration eciency . The group-wise mechanism provides more reliable gradient estimates compared to single- sample methods. • Relative Quality Evaluation : GRPO’s novel optimization sur- face considers relative rankings within output gr oups, contrast- ing with PPO’s absolute advantage estimation and DPO’s pair- wise comparisons. This enables joint evaluation of functionally equivalent but structurally diverse RTL designs. Formally , giv en a question 𝑞 , GRPO samples a group of outputs { 𝑜 1 , 𝑜 2 , . . ., 𝑜 𝐺 } from the old p olicy 𝜋 𝜃 old and optimizes the new policy 𝜋 𝜃 by maximizing the objective function: J GRPO ( 𝜃 ) = E  𝑞 ∼ 𝑃 ( 𝑄 ) , { 𝑜 𝑖 } 𝐺 𝑖 = 1 ∼ 𝜋 𝜃 old ( 𝑂 [ 𝑞 ] )  1 𝐺 𝐺  𝑖 = 1  min  𝜋 𝜃 ( 𝑜 𝑖 | 𝑞 ) 𝜋 𝜃 old ( 𝑜 𝑖 | 𝑞 ) 𝐴 𝑖 , C  − 𝛽 D 𝐾 𝐿 ( 𝜋 𝜃 ∥ 𝜋 𝑟 𝑒 𝑓 )  , C = clip  𝜋 𝜃 ( 𝑜 𝑖 | 𝑞 ) 𝜋 𝜃 old ( 𝑜 𝑖 | 𝑞 ) , 1 − 𝜀 , 1 + 𝜀  𝐴 𝑖 , (1) where 𝜋 𝜃 is the current policy , 𝜋 𝜃 old is the old p olicy , 𝐴 𝑖 is the advantage of the i-th output, and 𝜀 is the hyperparameter used to control the update amplitude of the policy . 𝛽 is KL divergence penalty term coecient, D 𝐾 𝐿 ( 𝜋 𝜃 ∥ 𝜋 𝑟 𝑒 𝑓 ) is KL divergence between the strategy 𝜋 𝜃 and the reference strategies 𝜋 𝑟 𝑒 𝑓 . Figure 4 illustrates the group comparison mechanism with 𝐺 = 2 , showing how relative evaluations enable diverse solution generation. 3.3.2 Training Paradigm. As shown in Figure 3, to maximize the utility of limited training data, we develop a hybrid training para- digm combining SFT with a two-stage GRPO optimization process. The complete training scheme enables simultaneous optimization for both correctness and diversity , overcoming the limitations of conventional approaches that typically prioritize one aspect at the Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y Chao and Zhang, et al. F u n c t i o n C h ec k er A S T - b as ed C h ec k er Y o u ar e an exp er t i n d i g i t al c i r c u i t d es i g n an d h ar d w ar e d es c r i p t i o n l an g u ag es w i t h a d eep u n d er s t an d i n g o f V er i l o g s y n t ax, s t r u c t u r e, an d d es i g n p at t er n s . P l ac e y o u r t h o u g h t p r o c es s b et w een < t h i n k > an d < / t h i n k > ,  G en er at e as m an y s t r u c t u r al l y d i v er s e V er i l o g c o d e as p o s s i b l e, o p t i m i zed f o r ar ea, p o w er , an d m i n i m i zi n g t h e w o r s t p at h d el ay . T h e g en er at ed v er i l o g c o d e s h o u l d b e p l ac ed b et w een < t o t al _d es i g n > an d < / t o t al _d es i g n > g r o u p _n u m = 2 .doc Spec. User R T L - S eek A n s w er 1 : < t h i n k > I s h o u l d c o n s i d er ab o u t ..., s o , I c o u l d m ak e  2 d i ﬀ er en t t y p e  d es i g n < / t h i n k > A n s w er 2 : < t h i n k > I s h o u l d c o n s i d er ab o u t ..., s o , I c o u l d m ak e  4 d i ﬀ er en t t y p e  d es i g n < / t h i n k > .tb T es t B en c h module Design_B ( input A, input B, output Sum, output Cout ) ...... module Design_C ( input A, input B, output Sum, output Cout ) ...... module Design_A ( input A, input B, output Sum, output Cout ) ...... A S T E q u i v al en c e ! module Design_B ( input A, input B, output Sum, output Cout ) ...... module Design_D ( input A, input B, output Sum, output Cout ) ...... module Design_A ( input A, input B, output Sum, output Cout ) ...... S y n t ax C h ec k er D i ﬀ er i n A S T S t r u c t u r e! C o n t ext C h ec k er C o r r ec t A n s w er B u t Mo n o t o n e R es p o n s e C o r r ec t A n s w er an d D i v er s e R es p o n s e S m u l at i o n P A S S S m u l at i o n P A S S S y n t h es i s P A S S S y n t h es i s P A S S C o n t en t C o n f o r m s C o n t en t C o n f o r m s L o w R ew ar d H i g h R ew ar d E D A T o o l s Figure 4: Diversity-Centric Multi-Objective Reward reinforcement learning in RTLSeek. expense of the other . The pipelines of the three stages are detailed as follows (Shown in Figure 3). Stage 1. SFT W arm-up: Learn what is HDL. W e rst train on cu- rated RTL datasets containing both veried and unveried examples from reputable open-source projects. These natural language-to- V erilog pairs establish fundamental syntax and functional under- standing, stabilizing later RL training. The SFT phase only requires a single V erilog module output for a quer y and focuses on learning basic code patterns and common design constructs, while ltering out low-quality examples through careful dataset curation. This initialization is crucial as it pro vides a strong starting point for subsequent reinforcement learning, preventing the mo del from exploring invalid design spaces during RL training. Stage 2. GRPO: Learn to generate diverse HDLs. Using large- scale unveried datasets, this phase employs diversity rewards to: (1) maximize data utility and generalization, while (2) dev elop- ing varied yet syntactically valid RTL generation capability . W e specically design the diversity reward to encourage structural variations in control logic, datapath organization, and module hi- erarchy . The group-based sampling in GRPO allo ws the model to compare multiple design alternativ es simultaneously , learning to generate dierent implementations for the same specication. This phase signicantly expands the model’s design repertoire beyond what could be learned through SFT alone. Stage 3. GRPO: Learn to generate right and div erse HDLs. The nal stage applies combined diversity and functional r ewards on veried datasets. This renes the model’s functional-co de corre- spondence understanding, requiring careful parameter calibration from previous phases. W e implement a dynamic weighting scheme that automatically adjusts the balance between diversity and cor- rectness rewards based on validation performance. The testbench verication provides precise feedback for functional correctness, while the diversity component maintains the variation learned in previous stages. This two-tiered rewar d structure ensures the model generates both correct and innovative designs. 3.4 Multi-Objective Reward Scheduling The key challenge in improving the exploration quality of LLMs in RL lies in designing an appropriate r eward mechanism that both renes decisions and provides systematic guidance throughout the learning process. Traditional post-training paradigms based on SFT typically uti- lize syntax and functional correctness feedback from ED A tools [ 3 ] to rene reference responses. Howev er , these approaches often un- derutilize scarce training datasets and currently lack eective veri- cation methods to ensure functional correctness for most testbench- free datasets. As illustrated in Fig 4, we propose Multi-Obje ctive Reward Sched- uling. This approach not only incorporates syntax and functional correctness feedback but also integrates a diversity reward obtained through Abstract Syntax T ree ( AST) analysis to guide LLMs in gen- erating structurally diverse RTLs. Additionally , we introduce a context rewar d to regulate reasoning quality , thereby achieving an eective exploration-exploitation balance. The whole reward function 𝑅 𝑡 𝑜 𝑡 𝑎𝑙 is as follows: 𝑅 𝑡 𝑜 𝑡 𝑎𝑙 = 𝑅 𝑠 𝑦𝑛 + 𝑅 𝑓 𝑢 𝑛𝑐 + 𝑅 𝑑 𝑖 𝑣 + 𝑅 𝑐𝑜 𝑛𝑡 , (2) Here, 𝑅 epresents the rewar d functions we have designed sepa- rately . By decomposing and combining dierent rewards for various RL training stages, we ar e able to maximize the e xploration capa- bilities of the LLM within the constraints of limited training data. These components are explained in detail below . 3.4.1 Syntax Correctness Re ward. 𝑅 𝑠 𝑦𝑛 is obtained by verifying whether the generated V erilog code adheres to the syntactic rules of the V erilog language, using the syntax analysis tool, Pyverilog [ 17 ]: if any RTL in the generated set passes the syntax check, then 𝑅 𝑠 𝑦𝑛 = 1 ; other wise, 𝑅 𝑠 𝑦𝑛 = 0 . 3.4.2 Function Correctness Reward. 𝑅 𝑓 𝑢 𝑛𝑐 aims to guide the LLM in generating functionally correct RTL code. Only a small subset of datasets have a testbench verication set, which is used with the commercial simulation tool V CS to simulate the generated RTL. RTLSeek: Bo osting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y If any RTL in the generated set passes the testbench simulation, 𝑅 𝑓 𝑢 𝑛𝑐 = 1 ; other wise, 𝑅 𝑓 𝑢 𝑛𝑐 = 0 . 3.4.3 Diversity Reward. Since dierent RTL implementations can exist for the same functional specication under varying design ob- jectives in IC design, we expect the LLM to explore as many diverse design solutions as possible, rather than repeatedly generating the same design. T o achieve this goal, we propose 𝑅 𝑑 𝑖 𝑣 as follows: 𝑅 𝑑 𝑖 𝑣 = 𝑁 𝑐 + 𝑁 𝑠 , (3) where 𝑁 𝑐 and 𝑁 𝑆 represent the number of generated heteroge- neous RTL codes that pass syntax check and functional verication, respectively . 3.4.4 Context Reward. 𝑅 𝑐𝑜 𝑛𝑡 consists of two components: format correctness and reasoning length reward. The format correctness re- quires the model to enclose the r easoning process of the given prob- lem within the < 𝑡 ℎ𝑖𝑛𝑘 > < / 𝑡 ℎ𝑖 𝑛𝑘 > tags, and to place the nal mul- tiple RTL solutions within the < 𝑡 𝑜 𝑡 𝑎𝑙 _ 𝑑 𝑒 𝑠 𝑖𝑔𝑛 > < / 𝑡 𝑜 𝑡 𝑎𝑙 _ 𝑑 𝑒 𝑠 𝑖𝑔𝑛 > tags. If the output match the format, a reward of 0.5 will be given; otherwise, a penalty of -0.5 will be given. The reasoning length reward is obtaine d from the signicant corr elation b etween the length of the reasoning response and the quality of the generated output [ 22 ]. Specically , if the reasoning response is too short, the model may fail to fully explain the reasoning steps, resulting in reduced accuracy and incompleteness in the generated response. Conversely , when the reasoning response is too long, the model may produce redundant or irrelevant content or become entangled in unnecessar y , lengthy reasoning. T o address this, we propose the context rewar d, which dynamically assigns rewards based on both the format of the answer and the length of the reasoning response. W e dene a satisfaction indicator 𝐼 𝑠 to r eect the quality of the current answer . If the sum of 𝑅 𝑠 𝑦𝑛 , 𝑅 𝑓 𝑢 𝑛𝑐 and 𝑅 𝑑 𝑖 𝑣 is greater than a threshold Δ = 4 , then 𝐼 𝑠 = 1 ; otherwise, 𝐼 𝑠 = 0 . Overall, the form of the reward function is as follo ws: 𝑅 𝑐𝑜 𝑛𝑡 = { 0 . 5 𝐿 𝑡 + 0 . 5 𝐼 𝑓 if 𝐼 𝑠 = 1 − 0 . 5 𝐿 𝑡 + 0 . 5 𝐼 𝑓 if 𝐼 𝑠 = 0 (4) where, 𝐼 𝑓 indicates whether the output meets the required format. If so, it is assigned a value of 1; other wise, it is assigned -1, 𝐿 𝑡 represents the ratio of the average length of the chains of thought from the previous four r esponses to the length of the current chain of thought. 3.5 AST -Based Div ersity Analysis T o determine whether V erilog co de segments dier substantively or merely in supercial aspects like variable naming or statement reordering, we implement equivalence verication based on AST , hierarchical representations that capture the syntactic structure of code while abstracting away lexical details such as whitespace and comments. For V erilog hardware description language, these tree structures eectively preserve module denitions, p ort de clarations, signal assignments, and control structures in a form conducive to programmatic analysis [18]. W e parse source code into AST s using pyverilog, then apply a recursive comparison algorithm that performs a depth-rst traver- sal of both trees as shown in Figure 5. This algorithm metho dically compares node types, attributes, and structural relationships to Not Eq. AST legal? Y N Set topmodule to current node for both Design_A and Design_B, N Y Y Recursively compare subree of each child in the order of childrenlist N N Y Recursive Comparison Not Eq. Not Eq. Not Eq. Current node type Eq. ? Current node attribute Eq. ? # Children of current node Eq? All the subrees have been traversed Eq. Input verilog codes [Design_A, Design_B] Generate AST Figure 5: V erilog structural equivalence testing algorithm based on abstract syntax tree. determine complete equivalence. By operating at the structural level rather than the textual lev el, our approach overcomes limita- tions of text-based comparison by accurately identifying structural equivalence despite surface dierences. When co de segments dier only in variable names (e.g., "p" versus "q"), our system r ecognizes their equivalence, while corr ectly distinguishing functionally dierent implementations ev en when they share certain syntactic patterns. This recursive verication system supports diversity exploration in RTL code generation for RTL-Seek by ltering out supercially dierent but structurally identical variants. 4 Experiments 4.1 Experimental Settings 4.1.1 Dataset. Due to limited high-quality V erilog data, we built our dataset by combining op en-source sources from GitHub and Hugging Face. After pr eliminary cleaning with Design Compiler , 5167 synthesis-passe d co de-description pairs were selected for Stage 1 (SFT). For Stage 2, 3570 natural language descriptions were extracted from another subset. Stage 3 used 829 functionally veried samples from V erilog-eval [11] and similar sources. 4.1.2 Training and T esting Seing. W e performed ne-tuning us- ing GRPOTrainer , a reinforcement learning framework based on the trl library . T o select a suitable base model, we ran pilot tests comparing dierent large language models on V erilog co de genera- tion. W e ultimately chose Qwen 2.5-7B-Instruct-1M (abbreviated as Qwen 2.5) for its superior code-generation accuracy and output volume. All training was carried out on 8 N VIDIA A100 GP Us, each with 40 GB of memory . W e adopt the LoRA method for ne- tuning [ 7 ], with the follo wing hyperparameters: the rank of the LoRA layers is set to 4, the alpha parameter to 8, and the dropout rate to 0.1. During training, we use the Adam optimizer [ 9 ] with a learning rate of 5e-5. W e used all designs from the publicly available RTLLM benchmark [ 14 ] as our test set. The temperature for LLM inference is set to 0.2. Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y Chao and Zhang, et al. Figure 6: The trend graph of the total reward obtaine d by every training step. 4.1.3 Evaluation Metrics. W e denote syntax correctness as Syn., functional correctness as Fun., generation performance as Gen., and evaluate the model performance based on the following aspects: • OPMO_Pass@ k : The traditional pass@ k , referred to in this paper as One-Prompt-One-Output pass@ k (OPOO_Pass@ k ), evaluates whether at least one of the T op- k outputs is correct when each prompt generates a single V erilog module. T o bet- ter assess the eectiveness of our approach, we introduce One-Prompt-Multi-Output pass@ k (OPMO_Pass@ k ), which evaluates whether any correct result exists among the multi- ple V erilog modules generated within the T op- k outputs for a single prompt. Specically , we only consider the cases of k = 1 and k = 5. • Success Rate: W e use the proportion of correct RTL mo d- ule code across all generated modules obtained from ve queries as the Success Rate, T able 1 and T able 2 show only functional Success Rate, detailed syntactical Success Rate could be found in T echnical App endix. • Quantitative Metrics: Gen.Num: The numb er of V erilog co des generated per prompt. Syn.Num: The number of syntactically valid codes per prompt. Fun.Num: The number of function- ally correct codes per prompt. 4.2 Evaluation Results 4.2.1 Training Dynamics of Reinforcement Learning. Figure 6 shows the original data and the 30-step mo ving average of r eward in Stage 3, where the overall trend exhibits a clear upward mov ement. This suggests that as training progresses, the model’s rewards consis- tently increase, demonstrating the eectiveness of the training method. 4.2.2 Overall Evaluation on RTLLM. W e evaluate RTLSeek on RTLLM v1.1 in Table 1 against both commercial foundation models–Qwen2.5- Instruct [ 21 ], GPT -4o [ 1 ], and DeepSeek-R1 [ 6 ]–and open-source baselines, including the super vised-ne-tuned (SFT) DeepRTL [ 13 ], Thakur [ 20 ], and RTLCoder [ 12 ], as well as RL post-training aca- demic works like ChipSeek-R1 [ 5 ] and Code V-R1 [ 24 ]. For founda- tion models we apply tw o prompt-engineering strategies: OPOO and OPMO. For the academic ne-tuned models (Thakur and RTL- Coder) we report results with OPOO prompting and the corre- sponding OPOO_pass@k metric; OPMO prompting fails on these models because domain-spe cic ne-tuning degrades instruction- following ability—a manifestation of catastrophic forgetting that prevents generation of multiple corr ectly-formatted RTL modules. All other entries in the table are taken directly from the respec- tive papers; however , there is every reason to expect that the same catastrophic-forgetting eect would occur in those models as well. Restricting the comparison to OPOO prompting is therefor e fully justied. As shown in T able 1, RTLSeek raises the average func- tional success rate by 29% over Qwen2.5, despite using only 1% of R1’s parameters. Its enhanced diversity yields more correct outputs per prompt, substantially improving OPMO_pass@5 and overall functional success rate. 4.3 Ablation study 4.3.1 Study Seings and Model Descriptions. T o verify our hypothe- ses regarding which methods can eectively enhance the V erilog generation capabilities of LLMs, we conducted an ablation study . The ablated models we implemente d include: (1) RTLSeek.w/o DR: RTLSeek without the Diversity Reward in Stage 2 and 3 RL training. (2) RTLSeek.w/o S3: RTLSeek without Stage 3 RL training, leaving only Stages 1 and 2. (3) RTLSeek.w/o S2: RTLSeek without Stage 2 RL training; after Stage 1, the process directly moved to Stage 3. (4) RTLSeek.only S1: RTLSeek only after Stage 1 OPOO SFT training. 4.3.2 Ablation Study for Diversity Reward. As indicate d in T a- ble 2, RTLSe ek outperforms RTLSeek. w/o DR in terms of Suc- cess Rate. This comparison suggests that diversity-oriented RL training is important for impr oving LLM-based RTL generation. In addition, RTLSeek also shows a signicant incr ease in average Fun.OPMO_pass@5 when compared to RTLSe ek. w/o DR. This improvement mainly stems fr om the diversity-oriented approach, which encourages the model to generate more varied responses per prompt, expanding the solution space and enhancing overall performance. W e also conducted a more detailed study in T able 3. In T able 3, we divide the items of RTLLM v1.1 benchmark into two categories based on whether the model RTLSe ek.w/o DR can pass: (1) w/o DR Pass: Items that RTLSeek.w/o DR can solve correctly at least once within ve attempts. (2) w/o DR NO Pass: Items that RTLSeek.w/o DR can not solve (no correct solutions even after ve attempts). In T able 3, we report the average Success Rate of RTLSe ek and RTLSeek.w/o DR on these two sets of items. (1) For items w/o DR NO pass, RTLSeek.w/o DR’s average Success Rate is 0%, but Ours achieves 13.1%. (2) For items without DR pass, RTLSeek.w/o DR has 47% average Success Rate, while RTLSeek achieves 56%, surpassing Baseline by 19%. These results demonstrate that intro- ducing a diversity re ward not only boosts correctness on items that RTLSeek.w/o DR can already solve but also enables solving some RTLSeek.w/o DR unsolved items. Hence, diversity-oriented rein- forcement learning training has a notable and measurable impact on enhancing LLM-based RTL code generation. 4.3.3 Ablation Study for Multi-Stage RL Training. As shown in T able 2, RTLSeek improves average Success Rate by 166.7% over RTLSeek w/o S3 , conrming the str ong benet of Stage 3, which uses testbench-equipp ed data for diversity-driven exploration and correctness learning. RTLSeek also exceeds RTLSeek w/o S2 by 42.9%, with notably more generated codes, showing that Stage 2 enhances diversity and enables more eective Stage 3 exploration. Overall, the Diversity-Centric Multi-Objective Reward Scheduling and three-stage framework allow RTLSeek to surpass GPT -4o in RTLSeek: Bo osting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y T able 1: Generate result comparison ab out dierent metrics. T ype Model Params (#) Syn.OPOO_pass@1 Syn.OPOO_pass@5 Fun.OPOO_pass@1 Fun.OPOO_pass@5 Foundation Qwen 2.5-Instruct* 7B 0.48 0.71 0.27 0.48 GPT -4o* / 0.80 0.89 0.42 0.66 DeepSeek R1* 671B 0.77 0.86 0.55 0.73 SFT models DeepRTL-2 220M 0.71 0.81 0.32 0.42 DeepRTL-1 16B 0.74 0.77 0.38 0.35 RTLCoder* 7B 0.73 0.89 0.32 0.49 Thakur* 16B 0.83 0.86 0.17 0.24 RL models ChipSeek-R1 7B / 0.96 / 0.83 Code V -R1 7B / / 0.73 0.86 T ype Model Params (#) Syn.OPMO_pass@1 Syn.OPMO_pass@5 Fun.OPMO_pass@1 Fun.OPMO_pass@5 Foundation Qwen 2.5-Instruct* 7B 0.50 0.74 0.32 0.50 GPT -4o* / 0.86 0.93 0.5 0.71 DeepSeek R1* 671B 0.90 0.96 0.65 0.83 Ours RTLSeek* 7B 0.86 0.96 0.76 0.86 Note: Models marked with an asterisk (*) were evaluated by ourselves; all others are cited from the respective papers; bold denotes the best result except commercial LLMs; underline denotes the real best result. T able 2: Ablation study on the diversity rewar d and multi-stage training. Ablation Gen. Num Syn. Num Fun. Num Syn.OPMO _pass@1 Syn.OPMO _pass@5 Fun.OPMO _pass@1 Fun.OPMO _pass@5 Success Rate RTLSeek 3.22 2.23 1.27 0.86 0.96 0.76 0.86 0.40 RTLSeek.w/o DR 1.70 0.85 0.54 0.64 0.73 0.55 0.59 0.35 RTLSeek.w/o S3 4.14 1.31 0.61 0.41 0.73 0.32 0.55 0.15 RTLSeek.w/o S2 2.42 1.33 0.67 0.55 0.77 0.45 0.59 0.28 RTLSeek.only S1 1.15 0.32 0.17 0.32 0.59 0.23 0.31 0.15 T able 3: Fine-grained ablation study on two subsets obtained from the partitioning RTLLM v1.1 for Success Rate about the diversity rewar d. Model w/o DR Pass w/o DR NO Pass RTLSeek 0.56 0.131 RTLSeek.w/o DR 0.47 0 both RTL syntax and functional accuracy , with ablations validating the eectiveness of the diversity-oriented RL paradigm. 5 Conclusion This paper presents RTLSeek, a no vel post-training paradigm us- ing Diversity-Oriented Reinforcement Learning to improve the accuracy and diversity of LLM-generated RTL. With a Diversity- Centric Multi-Objective Reward and a thr ee-stage training frame- work, RTLSeek eectively addresses data scarcity . Experiments show it outperforms GPT -4 and De epSeek R1 on RTLLM b ench- mark, with ablation studies conrming its eectiveness. References [1] Josh A chiam, Steven Adler , Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint (2023). [2] T om Brown, Benjamin Mann, Nick Ryder , Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastr y , Amanda Askell, et al . 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901. [3] Kaiyan Chang, Kun W ang, Nan Y ang, Ying W ang, Dantong Jin, W enlong Zhu, Zhirong Chen, Cangyuan Li, Hao Yan, Yunhao Zhou, Zhuoliang Zhao, Yuan Cheng, Y udong Pan, Yiqi Liu, Mengdi W ang, Shengwen Liang, Yinhe Han, Huawei Li, and Xiaowei Li. 2024. Data is all you need: Finetuning LLMs for Chip Design via an A utomated design-data augmentation framework (D AC ’24) . Association for Computing Machinery , Article 60, 6 pages. doi:10.1145/3649329.3657356 [4] Lei Chen, Yiqi Chen, Zhufei Chu, W enji Fang, Tsung- Yi Ho, Ru Huang, Y u Huang, Sadaf Khan, Min Li, Xingquan Li, et al . 2024. The dawn of ai-native eda: Oppor- tunities and challenges of large circuit models. arXiv preprint (2024). [5] Zhirong Chen, Kaiyan Chang, Zhuolin Li, Xinyang He, Chujie Chen, Cangyuan Li, Mengdi W ang, Haobo Xu, Yinhe Han, and Ying Wang. 2025. ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning. arXiv:2507.04736 [cs.AI] https://ar xiv .org/abs/2507. 04736 [6] Daya Guo, Dejian Y ang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025). [7] Edward J. Hu, Y elong Shen, P hillip W allis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean W ang, Lu Wang, and W eizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs] doi:10.48550/arXiv .2106.09685 [8] Hanxian Huang, Zhenghan Lin, Zixuan W ang, Xin Chen, Ke Ding, and Jishen Zhao. 2024. T owards llm-powered verilog rtl assistant: Self-verication and self-correction. arXiv preprint arXiv:2406.00115 (2024). [9] Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti- mization. arXiv:1412.6980 [cs.LG] [10] Mingjie Liu, T eodor-Dumitru Ene, Robert Kirby , Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerje e, Ismet Bayraktaroglu, et al . 2023. Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176 (2023). [11] Mingjie Liu, Nathaniel Pinckney , Brucek Khailany , and Haoxing Ren. 2023. V erilogEval: Evaluating Large Language Models for V erilog Code Generation. arXiv:2309.07544 [cs] doi:10.48550/arXiv .2309.07544 [12] Shang Liu, W enji Fang, Y ao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. 2024. RTLCo der: Fully Open-Source and Ecient LLM- Assisted RTL Code Generation T echnique. arXiv:2312.08617 [cs.PL] 08617 Conference acronym ’XX, June 03–05, 2018, W oodstock, N Y Chao and Zhang, et al. [13] Yi Liu, Changran Xu, Yunhao Zhou, Zeju Li, and Qiang Xu. 2025. DeepRTL: Bridging V erilog Understanding and Generation with a Unied Representation Model. arXiv:2502.15832 [cs.AR] [14] Y ao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. 2023. RTLLM: An Open- Source Benchmark for Design RTL Generation with Large Language Model. arXiv:2308.05345 [cs] doi:10.48550/arXiv .2308.05345 [15] Long Ouyang, Jerey Wu, Xu Jiang, Diogo Almeida, Carr oll W ainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray , et al . 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744. [16] Humza Sami, Pierre-Emmanuel Gaillardon, Valerio T enace, et al . 2024. Aivril: Ai- driven rtl generation with verication in-the-loop. arXiv preprint (2024). [17] Shinya Takamaeda- Y amazaki. 2015. Pyverilog: A Python-Based Hardware Design Processing T oolkit for V erilog HDL. In Applied Recongurable Computing (Le cture Notes in Computer Science, V ol. 9040) . Springer International Publishing, 451–460. doi:10.1007/978- 3- 319- 16214- 0_42 [18] Shinya T akamaeda- Y amazaki. 2015. Py verilog: A python-based hardware design processing toolkit for verilog hdl. In Applied Recongurable Computing: 11th Inter- national Symp osium, ARC 2015, Bochum, Germany , A pril 13-17, 2015, Proceedings 11 . Springer , 451–460. [19] Shailja Thakur , Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2023. Benchmarking Large Language Models for Automated V erilog RTL Code Generation. In 2023 Design, Automation & T est in Europe Conference & Exhibition (DA TE) . 1–6. doi:10. 23919/DA TE56975.2023.10137086 [20] Shailja Thakur , Baleegh Ahmad, Hammond Pearce, Benjamin T an, Brendan Dolan- Gavitt, Ramesh Karri, and Siddharth Garg. 2024. V eriGen: A Large Language Model for V erilog Code Generation. ACM Transactions on Design Automation of Electronic Systems 29, 3 (May 2024), 1–31. doi:10.1145/3643681 [21] An Y ang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong T u, Jianwei Zhang, Jingr en Zhou, et al . 2025. Q wen2. 5-1m technical report. arXiv preprint arXiv:2501.15383 (2025). [22] Edward Y eo, Yuxuan T ong, Morr y Niu, Graham Neubig, and Xiang Yue. 2025. Demystifying Long Chain-of- Thought Reasoning in LLMs. arXiv:2502.03373 [cs] doi:10.48550/arXiv .2502.03373 [23] Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, W eixu Zhang, Zhihan Guo, Y ufei W ang, Ir win King, Xue Liu, and Chen Ma. 2025. What, how , where, and how well? a survey on test-time scaling in large language models. CoRR (2025). [24] Y aoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, W enxuan Shi, Y utong Wu, Jianan Mu, Jinghua W ang, Y ang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo , Xing Hu, and Y unji Chen. 2025. Co de V -R1: Reasoning-Enhanced V erilog Generation. arXiv:2505.24183 [cs.LG]

RTLSeek: Boosting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment