Execution-Grounded Credit Assignment for GRPO in Code Generation

Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long progra…

Authors: Abhijit Kumar, Natalya Kumar, Shikhar Gupta

Execution-Grounded Credit Assignment for GRPO in Code Generation
E X E C U T I O N - G RO U N D E D C R E D I T A S S I G N M E N T F O R G R P O I N C O D E G E N E R A T I O N ∗ Abhijit Kumar † abhijitkumar4293@gmail.com Natalya Kumar natalya2kumar@gmail.com Shikhar Gupta shik1470@gmail.com A B S T R AC T Critic-free reinforcement learning with verifiable re wards (RL VR) improv es code generation by optimizing unit-test pass rates, but GRPO-style updates suf fer from coarse credit assignment: a single outcome signal is spread uniformly across long programs ev en when failure stems from a localized semantic error . W e propose Execution-Gr ounded Credit Assignment (EGCA) , which localizes GRPO updates using ex ecution traces. For programs that satisfy algorithmic con- straints but f ail tests, EGCA ex ecutes the candidate and a canonical reference solution (curated once of fline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic di vergence, and assigns adv antage only to the corresponding token span while masking do wnstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned ver - ifier , yielding 82.1% pass@1 on HumanEv al (+3.1 o ver GRPO) and 68.9% on MBPP (+1.5) with 18% wall-clock ov erhead. 1 I N T RO D U C T I O N Reinforcement learning with verifiable rewards (RL VR), where generated programs are ev aluated by unit tests, has become a standard post-training approach for improving code generation models. Critic-free objectives such as GRPO (Shao et al., 2024) are appealing: they av oid a value func- tion and directly optimize functional correctness. As base models improve, howe ver , the nature of failures shifts. Modern models increasingly produce code that is syntactically valid, structurally plausible, and fully e xecutable, yet still fails unit tests due to subtle semantic mis takes—an incorrect condition, a misplaced update, or a misinterpreted in variant. Unit tests provide a reliable correctness signal, but it is temporally coarse: it applies to the entire program rather than to the specific decisions that caused failure. Group-based policy gradients distribute this signal uniformly , so near-correct solutions receive gradients too diffuse to correct localized reasoning errors. This paper addresses the problem of semantic credit assignment in critic-free RL VR for code gen- eration and targets the near-correct regime where further gains depend on precise attribution rather than coarse feedback. The following are our k ey contributions: 1. W e show that credit assignment—not re ward sparsity—is the main bottleneck in critic-free RL for code generation once models already produce syntactically valid and structurally reasonable programs. 2. W e introduce EGCA, which routes each sample through deterministic failure-mode gates (syn- tax/constraint/logic) and, for near-correct candidates, localizes the earliest execution diver gence against a reference trace to concentrate the GRPO adv antage on the causal token span. ∗ Accepted to the ICLR 2026 W orkshop on Scaling Post-Training for LLMs (SPO T). † Lead author and corresponding author; primary contribution to this w ork. 1 Figure 1: Motivation: credit smear vs. localized updates. Sequence-lev el RL VR objecti ves apply unit-test outcomes uniformly across long programs, penalizing large spans of correct code for a localized semantic bug. EGCA concentrates gradient mass on the earliest semantically di ver gent span (identified via execution) while masking do wnstream tokens, improving credit assignment in the near-correct re gime. 3. W e show the approach is agnostic to the debugger’ s code-generation ability: the trained student surpasses a 1.5B-parameter debugger by +8.2 points, ruling out knowledge distillation as the mechanism. 2 R E L A T E D W O R K W e situate EGCA against fiv e lines of work that densify credit in RL VR for code. The shared limitation is that none reliably localizes failure to semantically causal regions of fully ex ecuting programs. Richer outcome signals. Methods such as RL TF (Liu et al., 2023) enrich outcome feedback from ex ecution beyond binary pass/fail, but the signal remains outcome-anchored and does not identify where within a program failure originated. Execution-aware masking. StepCoder (Dou et al., 2024) masks unex ecuted tok ens during updates, reducing spurious blame. Howe ver , when programs execute to completion, all tokens are ex ecuted, and masking provides no disambiguation among them. Group-structur e credit. Prefix-tree methods such as TEMPO/P2T (T ran et al., 2025) deri ve token- lev el updates from textual branching points within sample groups. This provides cleaner credit than sequence-lev el baselines, but textual div ergence does not necessarily coincide with the causal location of semantic failure in code. Learned e valuators. Process rew ard models (Li et al., 2025; Lightman et al., 2023) train step-le vel scorers for RL shaping. These meaningfully densify supervision but inherit challenges around label noise and distribution shift from the learned e valuator . Execution semantics and actor –critic. CodeRL+ (Jiang et al., 2025) adds auxiliary execution- alignment objectives, and actor–critic methods (Le et al., 2022; Shojaee et al., 2023) provide dense shaping via learned value functions. Both depart from the critic-free regime. 2 Figure 2: EGCA pipeline. W e extract constraints from a canonical reference, sample and e xecute a group of programs, route each into S Y N TA X / C O N S T R A I N T / L O G I C / C O R R E C T via deterministic gates, and apply token-lev el GRPO by localizing adv antage (compiler span for S Y N T A X , earliest reference- trace div ergence for L O G I C ) while masking downstream tok ens. Concurrent directions. Concurrent work explores complementary directions: RLEF (Gehring et al., 2024) grounds code LLMs in execution feedback via RL; multi-turn rewards (Jain et al., 2025) study single-step reward shaping across turns; and MURPHY (Ekbote et al., 2025) applies multi-turn GRPO for self-correction. EGCA differs in localizing credit to the earliest diver gence within a single generation. Summary and gap. Existing methods densify feedback or reduce noise, but do not reliably at- tribute failure to semantically causal regions of otherwise well-formed, constraint-following pro- grams. EGCA targets this gap: for near-miss programs that execute but fail due to a localized reasoning error, it identifies the causal span via reference-trace comparison and concentrates the GRPO advantage there. 3 M E T H O D 3 . 1 P RO B L E M S E T T I N G Let x denote a programming problem and let π θ be a code generation policy that samples programs y ∼ π θ ( · | x ) . Programs are ev aluated using unit tests, producing a base verifiable score ˆ R ( y ) ∈ [0 , 1] (fraction of tests passed). W e incorporate algorithmic constraints by defining the reward used for optimization as R ( y ) = ˆ R ( y ) I C ( y ) , (1) so that a program receiv es full rew ard only when it both passes all tests and satisfies the extracted constraints. 3 W e adopt Group Relative P olicy Optimization (GRPO) (Shao et al., 2024). For each problem x , we sample a group of G programs { y i } G i =1 and compute group-relativ e advantages A i = R ( y i ) − 1 G G X j =1 R ( y j ) . (2) 3 . 2 C A N O N I C A L S O L U T I O N S A S A S T RU C T U R A L P I V O T For each problem x , we assume access to a canonical refer ence solution y ref , curated once offline. The reference is not used as a target for imitation. Instead, it serves as a non-parametric pi vot for (i) extracting algorithmic constraints, (ii) defining a reference execution behavior , and (iii) anchoring semantic comparisons. This requirement limits applicability to settings where at least one correct solution exists (e.g., competitiv e programming and function synthesis with tests); extending EGCA to open-ended generation without references remains future work. 3 . 3 C O N S T R A I N T - G U I D E D S A M P L I N G Constraint extraction. A debugging-oriented teacher model extracts from ( x, y ref ) a set of algo- rithmic constraints C = { c 1 , . . . , c M } , (3) where each constraint specifies a permitted or forbidden structural property (e.g., control-flo w form, permitted data structures, or complexity tar gets). Constraints are non-ex ecutable, non-token-lev el, and solution-agnostic. Sampling with constraints. Constraints are injected as soft guidance via a prompt suf fix: y i ∼ π θ ( · | x ∥ C ) . (4) This biases sampling tow ard programs that are structurally comparable to y ref , increasing the density of samples for which semantic div ergence is meaningful. Constraint satisfaction. W e define a deterministic indicator I C ( y ) = 1 if y satisfies all constraints in C , and 0 otherwise. In addition, semantic di vergence is only well-defined when a candidate is comparable to the refer- ence at a coarse structural lev el. W e therefore compute a comparability indicator I cmp ( y ) ∈ { 0 , 1 } via normalized AST/CFG validation (Section 3.4). Candidates that fail this gate are treated as con- straint violations in m ( · ) . 3 . 4 S T RU C T U R A L V A L I D A T I O N ( C O M PA R A B I L I T Y G A T E ) W e parse y and y ref into ASTs, construct normalized CFGs, compute structural similarity scores, and declare a candidate comparable if scores exceed fix ed thresholds, yielding I cmp ( y ) ∈ { 0 , 1 } . 3 . 5 F A I L U R E M O D E S A N D C R E D I T O P E R A T O R Let m ( y ) ∈ { C O R R E C T , C O N S T R A I N T , S Y N TA X , L O G I C } denote a deterministic failure-mode clas- sifier defined by the following priority order: m ( y ) =          S Y N TA X y raises a compile/runtime error , C O N S T R A I N T I C ( y ) = 0 ∨ I cmp ( y ) = 0 , C O R R E C T ˆ R ( y ) = 1 ∧ I C ( y ) = 1 , L O G I C otherwise . (5) (Note that the priority ordering ensures I cmp ( y ) = 1 for any sample reaching the C O R R E C T or L O G I C cases.) 4 W e handle syntactic failures before CFG-based v alidation, since syntax errors can prevent reliable AST/CFG construction; compiler/interpreter diagnostics then provide a precise localization signal for token-le vel credit assignment. For a sampled completion y i of length T i , EGCA defines token-le vel advantages a i,t as a piecewise function of m ( y i ) and a small set of diagnostic spans. Syntax span. If m ( y i ) = S Y N TA X , the compiler/interpreter returns a location; we map it to a tok en span T err ⊂ { 1 , . . . , T i } and localize credit to that span, bypassing semantic di ver gence analysis. 1 Diver gence span. If m ( y i ) = L O G I C , we localize the earliest semantic diver gence against a ref- erence execution to obtain a boundary index k ∗ and an associated token span T k ∗ ⊂ { 1 , . . . , T i } (defined below). T oken-le vel advantage operator . W e then set a i,t =                      A i T i m ( y i ) = C O R R E C T , A i T i m ( y i ) = C O N S T R A I N T , A i |T err | 1 [ t ∈ T err ] m ( y i ) = S Y N TA X , A i |T k ∗ | 1 [ t ∈ T k ∗ ] m ( y i ) = L O G I C . (6) The operator is normalized so that P T i t =1 a i,t = A i for ev ery mode, while only localizing credit when blame can be attributed to a specific span. 3 . 6 E X E C U T I O N - G RO U N D E D D I V E R G E N C E L O C A L I Z A T I O N W e apply div ergence localization only when m ( y i ) = L O G I C , i.e., for candidates that are both constraint-satisfying and comparable to the reference. Semantic diver gence. Let d be the first failing unit test input. W e define execution boundaries B ( y i ) = ( b 1 , . . . , b K ) , (7) each mapping to a token span T k ⊂ { 1 , . . . , T i } . Executing both programs yields state traces τ ( y i , d ) = ( S 1 , . . . , S K ) , τ ( y ref , d ) = ( S ref 1 , . . . , S ref K ) . (8) W e define the earliest semantic div ergence boundary k ∗ = min { k : S k  = S ref k } . (9) Because static alignment alone cannot reliably map trace mismatches to fault regions, a debugging- oriented LLM localizes k ∗ ov er the aligned structure and paired traces; it is not used as a correctness oracle. 3 . 7 F I NA L G R P O O B J E C T I V E The ov erall GRPO objective with tok en-le vel adv antages is L ( θ ) = − G X i =1 T i X t =1 a i,t log π θ  y i,t | x, y i,>> longest_increasing_subsequence([10, 9, 2, 5, 3, 7, 101, 18]) 4 def longest_increasing_subsequence(nums): if not nums: return 0 n = len(nums) dp = [1] * n for i in range(1, n): for j in range(i): if nums[i] >= nums[j]: # BUG: should be strictly greater (>) dp[i] = max(dp[i], dp[j] + 1) return max(dp) GRPO output (fails unit tests). def longest_increasing_subsequence(nums): if not nums: return 0 12 n = len(nums) dp = [1] * n for i in range(1, n): for j in range(i): if nums[i] >= nums[j]: # BUG: should be strictly greater (>) dp[i] = max(dp[i], dp[j] + 1) return max(dp) Unit test result: F AILED on input [1, 3, 3, 5] (expected 3; got 4). EGCA output (passes unit tests). def longest_increasing_subsequence(nums): if not nums: return 0 n = len(nums) dp = [1] * n for i in range(1, n): for j in range(i): if nums[i] > nums[j]: # CORRECT dp[i] = max(dp[i], dp[j] + 1) return max(dp) Execution trace divergence (first semantic mismatch). For input [1, 3, 3, 5] , the first semantic div ergence occurs at ( i =2 , j =1) when comparing equal elements ( 3 vs. 3 ). T able 7: Trace comparison for [1,3,3,5] . The first div ergence occurs when the candidate uses ≥ while the reference uses > . Step i j nums[i] nums[j] Di vergence? 1 1 0 3 1 No 2 2 0 3 1 No 3 2 1 3 3 Y es Credit assignment contrast. Under uniform sequence-lev el GRPO, the negati ve advantage is spread across all tok ens in the program, so the comparison operator receiv es the same penalty weight as many unrelated tokens. EGCA maps the div ergence to the span containing the comparison oper- ator and concentrates the full advantage on that span while masking do wnstream tokens. A . 8 A P P E N D I X H . F U L L H Y P E R PA R A M E T E R T A B L E 13 T able 8: Hyperparameters used in our experiments. Hyperparameter V alue SFT learning rate 2 × 10 − 5 SFT epochs 3 SFT warmup 0.3 epochs SFT LR schedule Linear decay to zero GRPO policy learning rate 5 × 10 − 7 GRPO optimizer AdamW Rollouts per prompt ( G ) 16 T rain sampling temperature 0.8 T rain sampling top- p 0.9 Max generation tokens 8192 KL coefficient ( β ) 0.05 Clip epsilon ( ε ) 0.2 Eval decoding temperature 0.2 Eval decoding top- p 0.95 Hardware 8 × A100 80GB Global batch size 64 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment