Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

We present GuidedSAC, a novel reinforcement learning (RL) algorithm that facilitates efficient exploration in vast state-action spaces. GuidedSAC leverages large language models (LLMs) as intelligent supervisors that provide action-level guidance for…

Authors: Hao Ma, Zhiqiang Pu, Xiaolin Ai

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control
Efficient Soft Actor -Critic with LLM-Based Action-Lev el Guidance for Continuous Contr ol Hao Ma 1 2 Zhiqiang Pu 1 2 Xiaolin Ai 1 2 Huimu W ang 3 Abstract W e present GuidedSA C, a novel reinforcement learning (RL) algorithm that facilitates ef ficient exploration in v ast state-action spaces. Guided- SA C lev erages large language models (LLMs) as intelligent supervisors that provide action-lev el guidance for the Soft Actor-Critic (SAC) algo- rithm. The LLM-based supervisor analyzes the most recent trajectory using state information and visual replays, offering action-le vel interv entions that enable targeted exploration. Furthermore, we provide a theoretical analysis of GuidedSA C, proving that it preserv es the con vergence guaran- tees of SA C while improving con vergence speed. Through experiments in both discrete and continu- ous control en vironments, including toy te xt tasks and complex MuJoCo benchmarks, we demon- strate that GuidedSA C consistently outperforms standard SA C and state-of-the-art exploration- enhanced variants (e.g., RND, ICM, and E3B) in terms of sample efficienc y and final performance. 1. Introduction While reinforcement learning (RL) has demonstrated suc- cesses in robot control ( Johannink et al. , 2019 ; Galljamo v et al. , 2022 ; Zhang et al. , 2024 ; Ma et al. , 2023 ), a fundamen- tal challenge hindering its broader application, particularly in complex robotic tasks, is its well-known high sample com- plexity , making exploration very inefficient. This issue is significantly amplified with vast state-action spaces, where acquiring a practically useful policy through RL entirely from scratch often prov es to be an infeasible undertaking. Exploration based on intrinsic rewards, such as count-based ( Bellemare et al. , 2016 ; T ang et al. , 2017 ), memory-based ( Savino v et al. , 2018 ; Jiang et al. , 2025 ), and prediction- 1 School of Artificial Intelligence, Univ ersity of Chinese Academy of Sciences, Beijing, China 2 Institute of Automa- tion, Chinese Academy of Sciences, Beijing 100190, China 3 JD.com, Beijing, China. Correspondence to: Zhiqiang Pu < zhiqiang.pu@ia.ac.cn > . Pr eprint. Marc h 19, 2026. based ( Burda et al. , 2018 ; Pathak et al. , 2017 ) methods, has been extensi vely studied to address the exploration prob- lem by providing bonuses that encourage agents to discover nov el states. Howe ver , a critical issue remains overlook ed: a nov el state does not necessarily equate to a valuable state, and it is the v aluable states that are crucial for effecti ve ex- ploration. Novel states are relati vely easy to encounter , but the likelihood of discovering high-v alue states or trajecto- ries is e xceedingly low . This renders e xploration within v ast state-action spaces inefficient, ev en with intrinsic reward methods. T o efficiently explore high-value states, another method is the imitation-based approach, which initially le verages demonstrations to learn a policy and then refines it through online RL ( Peng et al. , 2018 ; Galljamov et al. , 2022 ). This approach effecti vely learns the prior distribution of the pol- icy and explores around this distribution. Recent studies hav e shown that such methods can enable humanoid robots, operating in v ast state-action spaces, to learn to walk natu- rally . Howe ver , the collection of demonstrations is costly , and dealing with heterogeneous demonstration data is chal- lenging. These limitations impose a barrier to the wide application of the imitation-based approach. T o address the limitations of existing e xploration methods, we propose lev eraging the concept of a real-time supervisor that can guide the RL learning process based on its observ a- tions. This paradigm mirrors ho w humans learn under super- vision, where guidance, e ven if imperfect, can significantly accelerate learning. T o realize such a supervisor capable of providing targeted guidance across diverse scenarios, we turn to lar ge language models (LLMs). Benefiting from v ast pretraining data, LLMs acquire a broad understanding of policies and principles applicable to a wide range of tasks, as demonstrated by recent works ( Jin et al. , 2024 ; Chen et al. , 2024 ; Y an et al. , 2024 ). W ith targeted guidance from LLMs, RL agents can ef ficiently explore high-v alue states without relying on costly manual data collection. W e propose a theoretical frame work that prov es the con- ver gence and improved sample ef ficiency of the SA C algo- rithm under action-lev el guidance from a suboptimal policy , upon which we design an LLM-based supervisor that adap- tiv ely generates such guidance, where one LLM provides 1 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol high-lev el analysis and another generates lo w-level imple- mentations in the form of rule-based policies. W e integrate this design into SA C to develop GuidedSAC. It is worth noting that our theoretical insights are broadly applicable to value-based algorithms. Howe ver , in this paper , we focus on implementing our method based on SAC, gi ven its supe- rior performance in continuous control tasks compared to value-based methods. W e ev aluate GuidedSAC across both discrete to y text tasks and high-dimensional MuJoCo benchmarks. The results show that it consistently achiev es better sample ef ficiency and final performance than standard SA C and state-of-the- art exploration methods (e.g., RND, ICM, and E3B). In summary , our contributions are twofold. (1) W e present a theoretical analysis that addresses the key question: How can LLM-generated policies improv e the efficiency of RL? (2) W e propose the GuidedSAC algorithm, which is empiri- cally v alidated across both discrete and comple x continuous control tasks. 2. Preliminary 2.1. Markov Decision Pr ocess The Markov Decision Process (MDP) provides a fundamen- tal frame work for sequential decision-making under uncer - tainty , defined by the tuple < S, A, P , R, γ > . Here, S is the state space, A is the action space, P : S × A × S → [0 , 1] is the transition probability , R : S × A → R is the re ward function, and γ ∈ [0 , 1] is the discount factor . The agent’ s goal is to learn a policy π : S → A that maximizes the expected return: J ( π ) = E τ ∼ π [ R ( τ )] , (1) where τ = ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . ) is a trajectory gener - ated by π . 2.2. Maximum Entropy RL and SA C Maximum Entropy RL (MaxEnt RL) e xtends the standard RL objecti ve by maximizing the policy’ s entropy in addition to the expected reward. This encourages exploration and can lead to better policies. The objective function is: J ( π ) = E s t ∼ ρ π ,a t ∼ π " ∞ X t =0 γ t ( r ( s t , a t ) + α H ( π ( ·| s t ))) # , (2) where ρ π is the state distribution under π , α is the tempera- ture parameter , and H ( π ( ·| s t )) = − E a ∼ π ( ·| s t ) [log π ( a | s t )] is the policy entropy . SA C is a model-free, of f-policy algorithm based on MaxEnt RL. It learns a stochastic policy and a soft value function by simultaneously maximizing re ward and polic y entropy . SA C is known for its sample ef ficiency and stability in continuous action spaces, utilizing actor-critic architectures and off-polic y learning from a replay buf fer . 3. Method Although SA C performs well on continuous control prob- lems, it remains inef ficient for many complex tasks when learning from scratch, often f ailing to produce the desired policy without additional guidance. LLMs, leveraging their pretraining knowledge, can pro vide macro-level insights by identifying areas where the current policy needs improve- ment and offering targeted guidance. RL can carry out fine-grained polic y optimization at the micro level. From this perspective, LLM policies are complementary to RL policies. In this section, we introduce a principled algorithm, Guided- SA C, for providing action-lev el guidance during RL training. W e build GuidedSA C on SA C due to its strong performance in continuous control and its inherent advantage of using a Q- value-based Boltzmann policy for polic y improvement. In contrast, action-lev el interventions in policy-gradient meth- ods typically mask gradients at intervened steps (see Ap- pendix B ), making them less suitable for learning directly from action-lev el guidance. In GuidedSA C, we integrate SA C with an LLM-based su- pervisor to provide guidance as sho wn in Fig. 1 . The LLM- based supervisor consists of two agents: an advisor and a coder . The advisor analyzes replays of the current policy and provides advice to the coder . The coder then implements this guidance into code (rule-based policies). Through their col- laboration, the LLM-based supervisor ef fecti vely pro vides action-lev el guidance, achieving impro ved performance in complex en vironments. In the follo wing, we first deri ve the theoretical motiv ation for GuidedSA C and then provide the implementation of GuidedSA C. Finally , we elaborate on the design of the LLM- based supervisor . 3.1. Derivation of Guided Soft Actor -Critic The algorithm of GuidedSA C is as follo ws: When the intervention occurs at time step t , a state s t is passed to both π ϕ ( ·|· ) and π LLM ( ·|· ) . An action a t is sampled from π ϕ ( ·| s t ) , and a residual action ∆ a t is sampled from π LLM ( ·| s t ) . The final action is computed as a t + ∆ a t . For simplicity , we denote the polic y with intervention as π interv , i.e. ( a t + ∆ a t ) ∼ π interv . When π ϕ ( ·| s t ) is already performing well enough, LLM in- tervention may not be necessary . T o determine whether intervention is needed, the advisor e valuates the perfor- mance ov er a recent time window and makes a decision for 2 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol 𝜋 𝜙 ( 𝑎 | 𝑠 ) 𝑠 , 𝑟 𝑎 T r ajecto r y Δ 𝑎 ෤ 𝑎 Replay Bu ffe r ෩ 𝓓 Ad viso r C o de r LLM - Based Su pervi sor C o o pe r ation Δ 𝑎 = 0 Δ 𝑎 ~ 𝜋 𝐿 𝐿 𝑀 ( ⋅ | 𝑠 ) Inter vene? Y es No RL Agent En vir onment F igure 1. The framework of GuidedSA C. GuidedSA C leverages an LLM-based supervisor to analyze the last trajectory and determine whether intervention is necessary . If intervention is triggered, a residual action ∆ a is added to the original action a , resulting in the intervened action e a . This intervened action is then stored in the replay buf fer, f acilitating the discovery of high-v alue trajectories. a period of time: I ( s ( ⌊ t/ M ⌋− 1) M , . . . , s ⌊ t/ M ⌋ M ) ∈ { 0 , 1 } , where M denotes the trajectory length. When the advisor decides to intervene, π interv will continue intervening for the remainder of the episode. For simplicity , we denote I ( s ( ⌊ t/ M ⌋− 1) M , . . . , s ⌊ t/ M ⌋ M ) ∈ { 0 , 1 } as I ( s ≤ t ) . T o generalize this process, the intervention decision for a giv en state s is represented as g ( s ) ∈ { 0 , 1 } . The resulting action e a t can then be interpreted as being sampled from a mixed beha vior policy , expressed as: e π ϕ ( ·| s ) = g ( s ) π interv ( ·| s ) + (1 − g ( s )) π ϕ ( ·| s ) . (3) The advisor LLM analyzes replays from the recent past (including both state and images) to decide whether inter- vention is necessary and the coder LLM provides guidance on which aspects require adjustment. The update equations are gi ven through the follo wing lemmas. W e refer readers to Appendix A for full proofs. Lemma 1 (Guided Policy Evaluation) . Consider the guided Bellman backup oper ator T e π and a mapping Q : S × A → R , and define Q k +1 = T e π Q k . Then the sequence Q k will con verg e to the Q-value of e π as k → ∞ . Lemma 2 (Guided Policy Improv ement) . Let e π old ∈ Π and update policy following Equation ( 12 ) to get π new . Then V π new ( s t ) ≥ V e π old ( s t ) for all s t ∈ S . π new ( · | s t ) = arg min π ′ ∈ Π D KL π ′ ( · | s t )      exp Q e π old ( s t , · ) log Z e π old ( s t ) ! = arg min π ′ ∈ Π J e π old ( π ′ ( · | s t )) , ∀ s t ∈ S. (4) From the abov e two lemmas, we can tell that the guidance in the behavior policy e π does not affect the con ver gence of π . This ke y finding directly informs our design: we can freely guide the behavior policy using an y intervention policy π interv . By training on the data it collects, π is guaran- teed to con ver ge, which can be e xpressed as the following theorem. Theorem 1 (Conv ergence of GuidedSA C) . By r epeatedly applying the guided policy evaluation in Equation ( 10 ) and guided policy impro vement in Equation ( 12 ), the pol- icy network π new con verg es to the optimal policy π ∗ if V e π ( s ) ≥ V π ( s ) for all s ∈ S . The assumption that V e π ( s ) ≥ V π ( s ) for all s ∈ S is rea- sonable, as we can al ways choose not to interv ene when π is sufficiently good, thereby ensuring the assumption remains satisfied. In the following proposition, we deri ve a theoreti- cal analysis to show ho w the quality of the guidance policy π interv and intervention decision g ( · ) affect the ef ficiency of the GuidedSA C algorithm. Proposition 1 (Single Step Improvement) . Under the as- sumption that the guidance is superior to the current policy , that is, V π interv ( s ) ≥ V π old ( s ) if g ( s ) > 0 , the lower bound of the impr ovement is higher than without guidance. Pr oof. By definition, e π old ( ·| s ) = g ( s ) π interv ( ·| s ) + (1 − g ( s )) π old ( ·| s ) . W e can decompose V e π old in Equation ( 15 ) such that Q e π old ( s t , a t ) = r ( s t , a t ) + γ E s t +1 ∼ p h V e π old ( s t +1 ) i = r ( s t , a t ) + γ E s t +1 ∼ p h g ( s t +1 ) V π interv ( s t +1 ) +  1 − g ( s t +1 )  V π old ( s t +1 ) i ≤ Q π new ( s t , a t ) . (5) 3 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol When g ( s t +1 ) = 0 , GuidedSA C is equi valent to SAC. When g ( s t +1 ) = 1 , the improvement is guaranteed if V π interv ( s t +1 ) ≥ V π old ( s t +1 ) . That is, π interv does not need to be optimal to guide π old , it only needs to ensure that π interv is better than π old when guidance occurs to achieve reliable improv ement. This insight suggests that two elements are crucial for GuidedSAC: a good intervention policy π interv and a good g ( · ) to identify when to guide. T o summarize, Lemmas 1 and 2, along with Theorem 1, demonstrate that SA C can still conv erge under action-le vel interventions, as long as the mix ed policy e π outperforms the original policy π . The function g ( · ) ensures this condition is met, e ven in difficult situations. Proposition 1 re veals that intervention policies need not be optimal to be beneficial: as long as the intervention policy π interv outperforms the current policy locally , it can significantly enhance sample efficienc y . This highlights the strength of the proposed framew ork, which uses imperfect but helpful guidance to speed up learning and tackle challenges in RL for complex continuous control tasks. 3.2. Guided Soft Actor -Critic Based on the pre vious deriv ations, we implement Guided- SA C Equation ( 10 ) and Equation ( 12 ) with neural networks. In the part of state v alue estimation, we use TD loss to up- date the state v alue network and the Q network. Since the algorithm is based on SA C, the state v alue will ha ve an item related to the entropy of the policy . The update of soft value estimation is as follows: J V ( ψ ) = E s t ∼ e D  1 2  V ψ ( s t ) − E a t ∼ π ϕ  Q θ ( s t , a t ) − log π ϕ ( a t | s t )   2  , (6) where e D is the replay buf fer that mixed with guided trajec- tories. The state value network is not necessary , b ut it can improv e the stability of action value estimation. The update of critic network follo ws TD loss J Q ( θ ) = E ( s t , a t ) ∼ e D  1 2  Q θ ( s t , a t ) − ˆ Q ( s t , a t )  2  , (7) ˆ Q ( s t , a t ) = r ( s t , a t ) + γ E s t +1 ∼ p  V ¯ ψ ( s t +1 )  , (8) where the ¯ ψ is a target v alue network and defined by a soft update ¯ ψ = τ ψ + (1 − τ ) ¯ ψ . By expanding Equation ( 12 ), the loss function of actor network is J π ( ϕ ) = E s t ∼ e D ,ϵ t ∼N [ log π ϕ ( f ϕ ( ϵ t ; s t ) | s t ) − Q θ ( s t ,f ϕ ( ϵ t ; s t )) ] . (9) You’re an expert helpful AI assista nt… Advisor The most c ritical adjustment is to correct th e backward l eaning of the tor so. Coder # Task Def inition: You will s erve as an assistant that provide rule-based code for reinforcem ent learning agent… class RuleController : def __init__(sel f): … def __call__( self, obs): … Advisor Prom pt Coder Prom pt Rule-based Policy Analysis & Advice F igure 2. Illustration of the LLM-based supervisor’ s coopera- tion details. Compared to SA C, the loss function of GuidedSA C differs in that the replay buffer e D contains samples drawn from a mixed policy e π ϕ . According to Theorem 2 and Propo- sition 1 , sampling from e π ϕ does not affect con vergence, moreov er, if π interv is superior to π ϕ , it can e ven accelerate con vergence. The pseudocode of GuidedSA C is provided in Alg. 1 . 3.3. LLM-based Supervisor Recent research has sho wn that task decomposition, where multiple LLMs cooperate to complete a task, is more ef- fectiv e than relying on a single LLM to handle the entire task ( Guo et al. , 2024 ). Inspired by this, the LLM-based supervisor is designed to include two LLMs: an advisor observes the policy replay and provides suggestions, and a coder generates a rule-based policy based on those sug- gestions, as illustrated in Fig. 2 . When the advisor LLM finds that the RL policy is good enough, it decides not to intervene. GuidedSA C relies on well designed prompts to define π interv and g ( · ) effecti vely . W e use a five component framework consisting of a task definition, background information, chain of thought reasoning, domain hints, and a code tem- plate. The task definition establishes the agent role and objectiv es while the background information provides en- vironmental context and documentation. Chain of thought reasoning supports systematic analysis and domain hints improv e accuracy by addressing specific LLM limitations. Finally , the code template ensures the output follo ws a struc- tured format. The Advisor utilizes the first four elements whereas the Coder includes the code template to maintain implementation consistency . Contemporary LLM agent ar- chitectures often incorporate these fi ve elements into their prompt designs ( W ang et al. , 2025 ). Detailed prompts are provided in Appendix C.3 . 4 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol 4. Experiment W e design our experiments to systematically validate Guid- edSA C through a progression of increasing complexity . First, we ev aluate in toy te xt en vir onments with discrete action spaces using direct policy substitution e π = π LLM rather than residual action. This simplification allows us to isolate and validate the core idea that LLM-based guidance can accelerate learning. Among baseline exploration meth- ods, RND demonstrates the strongest o verall performance on these discrete tasks. Building on this foundation, we then ev aluate GuidedSAC on continuous control benchmarks MountainCar and Humanoid using the full intervention mechanism with residual actions, where RND serves as the primary exploration baseline due to its success in the first experiment. Finally , ablation studies analyze interven- tion timing and duration to understand optimal guidance policies. The progression from discrete to continuous action spaces is designed to validate that LLM guidance provides both sam- ple efficienc y gains and the ability to generate interpretable, task-appropriate policies. By establishing strong perfor - mance in toy text en vironments where guidance ef fects can be cleanly isolated, we b uild confidence that the approach scales to complex, real-world rele vant control problems. 4.1. Discrete Contr ol T asks Setup . T o ev aluate the effecti veness of GuidedSA C com- pare it with state-of-the-art exploration methods, we first conduct experiments on four classic discrete control tasks from toy text en vironments. The toy te xt environments include Blac kjack , a card game where the agent must learn optimal decision making under uncertainty . CliffW alking represents a grid world navigation task requiring the agent to find the shortest path while av oid- ing a cliff. F r ozenLake provides a stochastic grid world en vironment where the agent must navigate across a frozen lake with holes. T axi presents a domain where the agent must pick up and deliver passengers while na vigating a grid world. These environments are characterized by discrete state and action spaces, sparse re wards, making them ideal testbeds for comparing exploration strate gies. In these tasks, we employ direct policy substitution e π = π LLM instead of residual action interv ention. This simplifi- cation allo ws us to isolate the ef fect of LLM-based guidance from the intervention mechanism. By doing so, we can directly validate the core hypothesis that semantic, task- aware guidance accelerates learning more effecti vely than undirected, nov elty-based exploration. Baselines . W e compare GuidedSA C against three promi- nent intrinsic rew ard based e xploration methods. RND (Ran- dom Network Distillation) ( Burda et al. , 2018 ) encourages 0 5 10 15 20 25 30 35 40 Steps (10 2 ) −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Episode R eward SAC SAC+RND SAC+E3B SAC+ICM GuidedSAC (a) Blackjack 0 5 10 15 20 Steps (10 3 ) −50000 −40000 −30000 −20000 −10000 0 Episode R eward SAC SAC+RND SAC+E3B SAC+ICM GuidedSAC (b) CliffW alking 0 5 10 15 Steps (10 3 ) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Episode R eward SAC SAC+RND SAC+E3B SAC+ICM GuidedSAC (c) FrozenLake 0 5 10 15 20 Steps (10 3 ) −400 −350 −300 −250 −200 Episode R eward SAC SAC+RND SAC+E3B SAC+ICM GuidedSAC (d) T axi F igure 3. Perf ormance comparison on toy text. Training curv es comparing GuidedSA C with RND, E3B, and ICM across four toy text environments. The shaded regions represent the predefined intervals where interv ention can occur . exploration by rewarding the agent for visiting states that are nov el to a randomly initialized neural network. E3B (Explo- ration via Elliptical Episodic Bonuses) ( Henaff et al. , 2022 ) uses elliptical bonuses to guide exploration in episodic set- tings. ICM (Intrinsic Curiosity Module) ( Pathak et al. , 2017 ) lev erages prediction errors as intrinsic re wards to dri ve ex- ploration. These methods represent the current state of the art in e xploration dri ven reinforcement learning and pro vide a comprehensi ve baseline for e valuating the ef fectiv eness of LLM-based guidance. Efficiency and Con vergence Analysis. In Fig. 3 , the shaded regions represent the predefined intervals where intervention can occur . W ithin these intervals, the timing of intervention is autonomously determined by the Advisor . As sho wn, GuidedSA C achiev es the highest reward across all four tasks. Notably , intrinsic re ward methods do not necessarily accelerate con ver gence in all tasks. For instance, in the CliffW alking task, the E3B method actually slows down con ver gence. This may be attributed to the task’ s rela- ti vely small exploration space and a single optimal trajectory , where random sampling before policy updates is suf ficient to discover this trajectory , making intrinsic rew ards intro- duce unnecessary exploration ov erhead. For Clif fW alking and T axi, GuidedSAC generates the optimal rule-based pol- icy at the beginning, enabling near -immediate conv ergence to the optimal policy in these en vironments. Among the baselines, RND emer ges as the strongest com- petitor and demonstrates rob ust performance in structured en vironments like Blackjac k . Ho wever , the fundamental ad- vantage of GuidedSA C lies in its value-aligned e xploration. 5 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol 0 25 50 75 100 125 150 175 200 Steps (10 3 ) −40 −20 0 20 40 60 80 100 Episode R eward SAC GuidedSAC SAC+RND (a) MountainCar 0 5 10 15 20 Steps (10 5 ) 0 10000 20000 30000 40000 50000 60000 Episode R eward SAC GuidedSAC SAC+RND (b) Humanoid F igure 4. T raining curves for MountainCar and Humanoid. Shaded regions indicate interv ention periods for GuidedSA C. For MountainCar the intervention occurs between steps 50k and 53k while for Humanoid it occurs between steps 700k and 800k. Moun- tainCar rew ard of 100 indicates successful goal achie vement. Hu- manoid values abo ve 5000 represent robust bipedal locomotion. While intrinsic methods like RND and ICM driv e the agent tow ard any novel state, GuidedSA C lev erages the seman- tic understanding of the LLM to direct e xploration toward trajectories that are both novel and task-rele vant. In F r ozen- Lake , for example, nov elty-driven e xploration might lead to discov ering new ways to fall into holes, whereas Guided- SA C focuses on reaching the goal. These results empirically support our theoretical analysis in Proposition 1 by proving that e ven non-optimal LLM guidance significantly improv es sample ef ficiency by outperforming the local random polic y . 4.2. Continuous Control T asks Setup . W e ev aluate GuidedSAC on tw o distinct continuous control problems: MountainCar and Humanoid. The Moun- tainCar en vironment requires an underpowered vehicle to ascend a steep hill by oscillating back and forth to generate momentum. Its state space is a two-dimensional continuous space representing position and velocity , while the agent ex erts acceleration within the range of -1.0 to 1.0. The re- ward for MountainCar includes a hill reward for reaching the goal and a control cost . In contrast, Humanoid is a high-dimensional task in MuJoCo, where a 17-joint robot must learn to walk forward. The 47-dimensional state space includes the robot’ s center of mass position, velocity , angu- lar momentum, and joint angles. The rew ard function for Humanoid consists of a forward re ward , a healthy rew ard , and a control cost . For both tasks, we emplo y residual ac- tion intervention to enable fine-grained control. This allows the RL policy to retain autonomy while receiving targeted guidance from the LLM, which is particularly crucial in Humanoid to overcome the challenge of coordinating 17 continuous control variables. Baselines . Based on the discrete control experiments where RND demonstrated the strongest performance among intrin- sic re ward methods, we employ both SA C and SA C+RND as baselines to ev aluate the ef fectiveness of LLM-based guidance in continuous control settings. Results. Training curv es for continuous control are sho wn in Fig. 4a and Fig. 4b . In MountainCar, SA C fails to climb the slope and remains stationary to minimize control costs. While SA C + RND ev entually discovers a successful tra- jectory after 80k steps, GuidedSA C achiev es high rewards immediately after the first intervention. This demonstrates that targeted guidance allows the agent to find high value trajectories without the need for extensi ve random explo- ration. The Humanoid task presents a greater challenge due to its high dimensional state and action spaces. During early training, the agent struggles to balance e ven with guid- ance, which results in a slo wer reward increase compared to SA C. Howe ver , once the policy achiev es basic stability mid- way through training, the action lev el intervention becomes highly effecti ve. GuidedSAC sho ws a sharp increase in rew ard after 7 × 10 5 steps and e ventually surpasses all base- lines. These results support our single step improvement proposition and sho w that LLM guidance is most ef fecti ve when the baseline policy reaches a suffi cient le vel of basic competence. Guidance Accelerates Discov ery of High-V alue T rajec- tories The success across both tasks highlights ho w LLM guidance of fers targeted intervention compared to the sys- tematic but undirected exploration of intrinsic re wards. In MountainCar , the policy landscape visualization in Fig. 5 shows that the polic y undergoes immediate reconfiguration during the intervention window . The interv ention policy π interv sets a simple oscillation strategy that is quickly re- tained by π ϕ after the intervention ends. This validates Lemma 3 as data collected under the mixed polic y e π effec- tiv ely transfers knowledge to the agent. In Humanoid, this improv ement is more delayed but equally dramatic. This phenomenon reflects the interplay between task complex- ity and algorithmic design. During early training, the robot struggles with basic balance, meaning ev en with action-level intervention, the combined policy may not immediately find high-value trajectories. Howe ver , once the base policy π ϕ achiev es basic stability , the residual guidance (such as sinu- soidal joint control) enhances the existing polic y rather than ov erriding it, leading to rapid conv ergence. Qualitative Analysis of Policy Reconfiguration . The ef fec- ti veness of GuidedSA C is further evidenced by the quality of the learned behaviors. In MountainCar, the re ward curve of SA C suggests it con verges to a stationary polic y to minimize control costs, while GuidedSA C discov ers the non-intuitiv e oscillation strategy . In Humanoid, as shown in Fig. 6 , the qualitativ e difference is even more pronounced. Standard SA C often conv erges to an unnatural gait where the robot supports itself on one leg while performing rapid small-step hops. Although this maximizes reward, it is biomechani- cally inef ficient. In contrast, GuidedSA C generates more structured and interpretable locomotion. By leveraging sim- ple rule-based guidance for sinusoidal leg motion, the agent 6 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol GuidedSAC SAC Intervene for 3e3 steps F igure 5. Policy landscape evolution in MountainCar . Columns show snapshots at 0k, 50k intervention start, 53k intervention end, 60k, 80k, a nd 100k training steps. Horizontal axis sho ws car position with x ∈ [ − 1 . 3 , 0 . 7] . V ertical axis sho ws velocity with v ∈ [ − 0 . 07 , 0 . 07] . Color indicates most probable action arg max a π ( a | s ) . Dotted lines intersect at initial state with x = − 0 . 5 and v = 0 . The intervention window causes immediate polic y reconfiguration, demonstrating efficient kno wledge transfer from π interv to π ϕ . (a) GuidedSA C (b) SA C F igure 6. Visualization of final policies in Humanoid task. Left: GuidedSAC produces coordinated bipedal running with sinu- soidal leg motion patterns. Right: SAC con ver ges to an unnatural gait—supporting on one leg while performing rapid small-step hops. Both achiev e re ward, b ut GuidedSAC’ s locomotion aligns more closely with natural human mov ement. learns coordinated bipedal running that aligns closely with natural human movement. This demonstrates that LLM- based guidance can steer the learning process toward seman- tically meaningful and task-appropriate policies rather than just rew ard-maximizing ones. 4.3. Ablation Study According to the single-step improvement proposition, action-lev el intervention in GuidedSA C should not persist throughout the entire training phase. T o systematically in- vestigate the influence of intervention timing and duration on agent performance, we conduct an ablation study using the MountainCar en vironment. In these experiments, we disable the autonomous judgment of the Advisor LLM and instead manually configure the intervention start point s and the intervention duration l . These two factors collectiv ely determine the degree of external guidance provided to the agent. For example, a configuration where s = 5000 and 0 25 50 75 100 125 150 175 200 Steps (10 3 ) −40 −20 0 20 40 60 80 100 Episode Reward GuidedSAC (s=5k, l = 3000) GuidedSAC (s=50k, l = 3000) GuidedSAC (s=100k, l = 3000) (a) Ablation on intervention start point 0 25 50 75 100 125 150 175 200 Steps (10 3 ) −40 −20 0 20 40 60 80 100 Episode Reward GuidedSAC (s=50k, l = 1000) GuidedSAC (s=50k, l = 3000) GuidedSAC (s=50k, l = 10000) GuidedSAC (s=50k, l = 20000) 100 120 140 160 180 200 94.0 94.2 94.4 94.6 94.8 95.0 (b) Ablation on interv ention du- ration F igure 7. Impact of intervention timing and duration. Left side shows intervention start point s tested at 5k, 10k, 20k, and 40k steps. Right side sho ws intervention duration l tested at 1k, 3k, 5k, and 10k steps. Optimal performance requires early intervention with moderate duration. If duration is too short it dilutes impact, while too long hinders autonomous exploration. l = 3000 indicates that the intervention be gins precisely at step 5000 and continues for a fix ed window of 3000 steps. This manual override allo ws us to isolate how different tem- poral interventions affect the con vergence and stability of the baseline policy . Effect of Intervention Timing . The results of the abla- tion study on s (Fig. 7a ) indicate that, if the intervention policy is suf ficiently ef fective, earlier intervention yields better results. This aligns with the single-step-improvement proposition, which suggests that timely intervention can accelerate policy impro vement. Effect of Intervention duration . The ablation study on l (Fig. 7b ) rev eals that the intervention duration should be neither too short nor too long. If the duration is too short, the interv ened data in e D becomes diluted by non-intervened data, thereby diminishing its impact on policy impro vement. On the other hand, if the duration is too long, after the RL policy is good enough, the intervention will hinder the RL 7 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol policy from e xploring better trajectories. 5. Related W ork RL for Continuous Control . Exploration in continuous control problems poses significant challenges due to the vast state-action space. T ake humanoid bipedal locomotion in MuJoCo as an example. The policies produced by RL often remain suboptimal, even with advanced exploration techniques, leading to unnatural forward mov ement ( Lill- icrap , 2015 ; Fujimoto et al. , 2018 ; Schulman et al. , 2017 ; Haarnoja et al. , 2018 ). T o enhance policy quality , researchers hav e explored reference-based and reference-free paradigms lev eraging external data or prior knowledge. Reference-based methods, such as DeepMimic ( Peng et al. , 2018 ), use motion cap- ture to enable realistic locomotion, while ( Galljamov et al. , 2022 ) demonstrate that action space representation, sym- metry priors, and cliprange scheduling accelerate training and improv e human-like walking. Reference-free methods, relying on carefully designed re ward functions, capture ef- fectiv e bipedal movement characteristics but often require extensi ve tuning or human demonstrations ( Siekmann et al. , 2021 ; van Marum et al. , 2024 ). Combining exploration techniques with prior knowledge may offer an efficient alternativ e. GuidedSAC introduces an LLM-based supervisor that provides action-level guid- ance, lev eraging LLMs’ prior knowledge without extensi ve rew ard tuning or human demonstrations. Its adaptability through prompt and template modifications enables integra- tion of domain-specific kno wledge, making it a promising approach for complex en vironments. Advice-taking Agents . Advice-taking RL agents enhance learning through external guidance, such as demonstrations or language instructions. For demonstration-based advice- taking, ( Maclin and Shavlik , 1994 ) introduced a framew ork that utilizes pre-defined code to guide the learning process. W u et al. ( W u et al. , 2023 ) proposed a human-in-the-loop framew ork for autonomous navigation, allo wing human operators to intervene during training. Similarly , ( Peng et al. , 2024 ) de veloped a method to learn a proxy v alue func- tion from human interventions, which is subsequently used to guide Q-network updates. ( Cederborg et al. , 2015 ) ex- tended Q-learning by transforming human demonstrations into weights within a Boltzmann policy . W ang et al. ( W ang et al. , 2018 ) incorporated an imitation loss into PPO to en- hance policy updates. For language-based advice-taking, recent advancements ha ve demonstrated the potential of LLMs in shaping RL exploration. Du et al. ( Du et al. , 2023 ) introduced ELLM, a method that re wards agents for achie ving sub-goals suggested by LLM, leading to impro ved performance in various tasks. Ma et al. ( Ma et al. , 2024 ) proposed ExploRLLM, which integrates LLMs for candi- date selection in manipulation tasks, enhancing sample effi- ciency and policy learning in robotic manipulation. Chen et al. ( Chen et al. , 2024 ) integrates an LLM-generated rule- based controller with RL, lev eraging the collected data to create an imitation learning loss that guides policy updates. Howe ver , most existing methods focus on discrete action spaces, while continuous control approaches often require human intervention, which is impractical in parallelized or high-acceleration en vironments and infeasible for com- plex tasks lik e bipedal locomotion. Besides, current LLM- aided RL algorithms often lack rigorous theoretical analysis. GuidedSA C bridges this gap by theoretically and empiri- cally v alidating RL with suboptimal action-le vel guidance, offering a scalable and efficient frame work for continuous control advice-taking agents. 6. Conclusion This paper introduces GuidedSA C, which employs an LLM- based supervisor for action-le vel guidance to facilitate tar - geted exploration in comple x continuous control tasks. W e demonstrate, theoretically and empirically , that integrating this LLM-dri ven guidance into the SA C algorithm preserv es con vergence properties while enhancing con vergence speed and ov erall performance. Our theoretical analysis identifies conditions under which external guidance is most benefi- cial to RL. Experimental results on discrete toy text tasks and continuous control benchmarks sho w that GuidedSA C enables efficient learning and promotes more reasonable policies. GuidedSAC presents a promising direction for dev eloping exploration strategies in continuous control that intelligently explore high-v alue states or trajectories. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References Marc Bellemare, Sriram Sriniv asan, Georg Ostrovski, T om Schaul, David Saxton, and Remi Munos. Unifying count- based exploration and intrinsic moti vation. Advances in neural information pr ocessing systems , 29, 2016. Y uri Burda, Harrison Edwards, Amos Storkey , and Oleg Klimov . Exploration by random network distillation. arXiv pr eprint arXiv:1810.12894 , 2018. Thomas Cederborg, Ishaan Gro ver , Charles L Isbell Jr, and 8 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol Andrea Lockerd Thomaz. Policy shaping with human teachers. In IJCAI , pages 3366–3372, 2015. Liangliang Chen, Y utian Lei, Shiyu Jin, Y ing Zhang, and Liangjun Zhang. Rlingua: Improving reinforcement learn- ing sample efficienc y in robotic manipulations with large language models. IEEE Robotics and Automation Let- ters , 9(7):6075–6082, 2024. doi: 10.1109/LRA.2024. 3400189. Y uqing Du, Oli via W atkins, Zihan W ang, C ´ edric Colas, T rev or Darrell, Pieter Abbeel, Abhishek Gupta, and Jacob Andreas. Guiding pretraining in reinforcement learning with large language models. In International Confer ence on Machine Learning , pages 8657–8677. PMLR, 2023. Scott Fujimoto, Herke Hoof, and David Meger . Address- ing function approximation error in actor -critic methods. In International confer ence on machine learning , pages 1587–1596. PMLR, 2018. Rustam Galljamov , Guoping Zhao, Boris Belousov , Andr ´ e Seyfarth, and Jan Peters. Improving sample ef ficiency of example-guided deep reinforcement learning for bipedal walking. In 2022 IEEE-RAS 21st International Confer- ence on Humanoid Robots (Humanoids) , pages 587–593. IEEE, 2022. T aicheng Guo, Xiuying Chen, Y aqi W ang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf W iest, and Xian- gliang Zhang. Large language model based multi-agents: a survey of progress and challenges. In Pr oceedings of the Thirty-Thir d International Joint Confer ence on Artificial Intelligence , pages 8048–8057, 2024. T uomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-polic y maximum entropy deep reinforcement learning with a stochastic actor . In In- ternational confer ence on machine learning , pages 1861– 1870. PMLR, 2018. Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rockt ¨ aschel. Exploration via elliptical episodic bonuses. Advances in Neural Information Pr ocessing Systems , 35: 37631–37646, 2022. Y uhua Jiang, Qihan Liu, Y iqin Y ang, Xiaoteng Ma, Dianyu Zhong, Hao Hu, Jun Y ang, Bin Liang, XU Bo, Chongjie Zhang, et al. Episodic novelty through temporal distance. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. Y ixiang Jin, Dingzhe Li, Jun Shi, Peng Hao, Fuchun Sun, Jianwei Zhang, Bin Fang, et al. Robotgpt: Robot ma- nipulation learning from chatgpt. IEEE Robotics and Automation Letter s , 9(3):2543–2550, 2024. T obias Johannink, Shikhar Bahl, Ashvin Nair , Jianlan Luo, A vinash Kumar , Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjo w , and Serge y Levine. Residual reinforce- ment learning for robot control. In 2019 international confer ence on r obotics and automation (ICRA) , pages 6023–6029. IEEE, 2019. TP Lillicrap. Continuous control with deep reinforcement learning. arXiv pr eprint arXiv:1509.02971 , 2015. Runyu Ma, Jelle Luijkx, Zlatan Ajano vic, and Jens K ober . Explorllm: Guiding exploration in reinforcement learning with large language models. arXiv e-prints , pages arXiv– 2403, 2024. Y echeng Jason Ma, W illiam Liang, Guanzhi W ang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Y uke Zhu, Linxi Fan, and Anima Anandkumar . Eureka: Human- lev el re ward design via coding lar ge language models. arXiv pr eprint arXiv:2310.12931 , 2023. Richard Maclin and Jude W Shavlik. Incorporating advice into agents that learn from reinforcements. In Pr oceed- ings of the T welfth AAAI National Confer ence on Artificial Intelligence , pages 694–699, 1994. Deepak Pathak, Pulkit Agraw al, Alex ei A Efros, and Tre vor Darrell. Curiosity-driv en exploration by self-supervised prediction. In International confer ence on machine learn- ing , pages 2778–2787. PMLR, 2017. Xue Bin Peng, Pieter Abbeel, Ser gey Le vine, and Michiel V an de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills. A CM T ransactions On Graphics (TOG) , 37(4):1–14, 2018. Zhenghao Mark Peng, W enjie Mo, Chenda Duan, Quanyi Li, and Bolei Zhou. Learning from acti ve human in volvement through proxy value propagation. Advances in neural information pr ocessing systems , 36, 2024. Nikolay Savinov , Anton Raichuk, Rapha ¨ el Marinier, Damien V incent, Marc Pollefeys, T imothy Lillicrap, and Sylvain Gelly . Episodic curiosity through reachability . arXiv pr eprint arXiv:1810.02274 , 2018. John Schulman, Filip W olski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv pr eprint arXiv:1707.06347 , 2017. Jonah Siekmann, Y esh Godse, Alan Fern, and Jonathan Hurst. Sim-to-real learning of all common bipedal gaits via periodic re ward composition. In 2021 IEEE Interna- tional Confer ence on Robotics and Automation (ICRA) , pages 7309–7315. IEEE, 2021. 9 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol Haoran T ang, Rein Houthooft, Davis F oote, Adam Stooke, OpenAI Xi Chen, Y an Duan, John Schulman, Filip De- T urck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in neural information pr ocessing systems , 30, 2017. Bart van Marum, Aayam Shrestha, Helei Duan, Pranay Dugar , Jeremy Dao, and Alan Fern. Revisiting reward design and ev aluation for robust humanoid standing and walking. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS) , pages 11256– 11263. IEEE, 2024. Fan W ang, Bo Zhou, Ke Chen, Tingxiang Fan, Xi Zhang, Jiangyong Li, Hao T ian, and Jia Pan. Intervention aided reinforcement learning for safe and practical policy opti- mization in navig ation. In Confer ence on Robot Learning , pages 410–421. PMLR, 2018. Junyang W ang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Y an, W eizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effecti ve navigation via multi-agent collaboration. Ad- vances in Neural Information Pr ocessing Systems , 37: 2686–2710, 2025. Jingda W u, Y anxin Zhou, Haohan Y ang, Zhiyu Huang, and Chen Lv . Human-guided reinforcement learning with sim-to-real transfer for autonomous navigation. IEEE T ransactions on P attern Analysis and Machine Intelli- gence , 2023. Xue Y an, Y an Song, Xidong Feng, Mengyue Y ang, Haifeng Zhang, Haitham Bou Ammar, and Jun W ang. Efficient reinforcement learning with large language model priors. In The Thirteenth International Confer ence on Learning Repr esentations , 2024. Chong Zhang, W enli Xiao, T airan He, and Guanya Shi. W ococo: Learning whole-body humanoid control with sequential contacts. arXiv preprint , 2024. 10 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol A. Proof of Con vergence Lemma 3 (Guided Policy Evaluation) . Consider the guided Bellman backup operator T e π and a mapping Q : S × A → R , and define Q k +1 = T e π Q k . Then the sequence Q k will con verg e to the Q-value of e π as k → ∞ . Pr oof. W e expand the guided Bellman backup operator T e π as Q ( s, a ) ← r ( s, a ) + γ E s ′ ∼ p,a ′ ∼ e π , [ Q ( s ′ , a ′ )] . (10) Then, we can prov e guided Bellman backup operator T e π is a contraction mapping. ∥T e π Q k − T e π Q k +1 ∥ ∞ = max s,a |T e π Q k ( s, a ) − T e π Q k +1 ( s, a ) | ≤ γ max s,a | E s ′ ,a ′ [ Q k ( s ′ , a ′ ) − Q k +1 ( s ′ , a ′ )] | ≤ γ max s,a | Q k ( s ′ , a ′ ) − Q k +1 ( s ′ , a ′ ) | = γ ∥ Q k − Q k +1 ∥ ∞ (11) Lemma 4 (Guided Policy Improvement) . Let e π old ∈ Π and update policy following Equation ( 4 ) to get π new . Then V π new ( s t ) ≥ V e π old ( s t ) for all s t ∈ S . π new ( · | s t ) = arg min π ′ ∈ Π D KL π ′ ( · | s t )      exp Q e π old ( s t , · ) log Z e π old ( s t ) ! = arg min π ′ ∈ Π J e π old ( π ′ ( · | s t )) , ∀ s t ∈ S. (12) Pr oof. Let e π old ∈ Π and let Q e π old and V e π old be the corresponding soft state-action value and soft state value, be defined as E a t ∼ π new h log π new ( a t | s t ) − Q e π old ( s t , a t ) i ≤ E a t ∼ e π old h log e π old ( a t | s t ) − Q e π old ( s t , a t ) i . (13) J e π old ( π new ( · | s t )) ≤ J e π old ( e π old ( · | s t )) is guaranteed, since we can always choose π new = e π old ∈ Π . Since partition function Z e π old depends only on the state, the inequality reduces to E a t ∼ π new h Q e π old ( s t , a t ) − log π new ( a t | s t ) i ≥ V e π old ( s t ) . (14) Next, consider the Bellman equation for Q e π old : Q e π old ( s t , a t ) = r ( s t , a t ) + γ E s t +1 ∼ p h V e π old ( s t +1 ) i ≤ r ( s t , a t ) + γ E s t +1 ∼ p h E a t +1 ∼ π new h Q e π old ( s t +1 , a t +1 ) − log π new ( a t +1 | s t +1 ) ii . . . ≤ Q π new ( s t , a t ) , ∀ ( s t , a t ) ∈ S × A . (15) Giv en Q π new ( s t , a t ) ≥ Q e π old ( s t , a t ) for all ( s t , a t ) ∈ S × A , by definition, we have V π new ( s t ) ≥ V e π old ( s t ) for all s t ∈ S . Theorem 2 (Con vergence of GuidedSA C) . By r epeatedly applying the guided policy evaluation in Equation ( 10 ) and guided policy impr ovement in Equation ( 4 ), the policy network π new con verg es to the optimal policy π ∗ if V e π ( s ) ≥ V π ( s ) for all s ∈ S . Pr oof. Applying T e π repeatedly will induce Q k → Q e π , also V k → V e π . Given V e π ( s ) ≥ V π ( s ) for all s ∈ S , updating π according to Equation ( 4 ) will result in V π new ( s ) ≥ V e π old ( s ) ≥ V π old ( s ) , which leads to the monotonically increasing of V π ( s ) . As V π ( s ) is bounded, π will eventually con v erge to π ∗ . 11 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol B. Theoretical Analysis of Action-Le vel Intervention in P olicy Gradient Methods In this section, we analyze how an action-level intervention mechanism changes the policy-gradient signal. Consider a standard policy-gradient objecti ve L PG ( ϕ ) ∝ − E τ ∼ π ϕ " X t A t log π ϕ ( a t | s t ) # , (16) where A t denotes an advantage estimator . Action-level intervention policy . Let g ( s ) ∈ [0 , 1] be a state-dependent intervention gate and π interv ( · | s ) be a fixed intervention polic y . W e define the intervened behavior policy as the mixture: e π ϕ ( a | s ) = g ( s ) π interv ( a | s ) + (1 − g ( s )) π ϕ ( a | s ) . (17) When g ( s ) ∈ { 0 , 1 } , Equation ( 17 ) reduces to a hard switch: the agent either fully follows π interv or π ϕ at any gi ven state. Policy gradient under interv ention. When trajectories τ are sampled according to the intervened policy e π ϕ , the surrogate objectiv e is defined as: L PG ( ϕ ) ∝ − E τ ∼ e π ϕ " X t A t log e π ϕ ( a t | s t ) # . (18) T aking the gradient with respect to the parameters ϕ yields: ∇ ϕ L PG ( ϕ ) ∝ − E τ ∼ e π ϕ " X t A t ∇ ϕ e π ϕ ( a t | s t ) e π ϕ ( a t | s t ) # = − E τ ∼ e π ϕ " X t A t g ( s t ) ∇ ϕ π interv ( a t | s t ) + (1 − g ( s t )) ∇ ϕ π ϕ ( a t | s t ) e π ϕ ( a t | s t ) # . Noting that ∇ ϕ π interv ( a t | s t ) = 0 , the gradient simplifies to: ∇ ϕ L PG ( ϕ ) ∝ − (1 − g ( s t )) E τ ∼ e π ϕ " X t A t ∇ ϕ log π ϕ ( a t | s t ) # = − E τ ∼ e π ϕ " X t A t (1 − g ( s t )) π ϕ ( a t | s t ) e π ϕ ( a t | s t ) ∇ ϕ log π ϕ ( a t | s t ) # . In the case of a hard intervention where g ( s ) ∈ { 0 , 1 } , the gradient contribution v anishes whenev er g ( s t ) = 1 . Conv ersely , when g ( s t ) = 0 , the expression reduces to the standard polic y gradient. Under these conditions, the gradient simplifies to: ∇ ϕ L PG ( ϕ ) ∝ − E s t ∼ ρ e π ϕ , g ( s t )=0 , a t ∼ π ϕ " X t A t ∇ ϕ log π ϕ ( a t | s t ) # . (19) Thus, action-le vel interv ention effecti vely acts as a mask on policy updates, only non-interv ened steps provide a learning signal for ϕ . Prior work on intervention-aided policy optimization sho ws that, the actor receives no learning signal after an intervention occurs. As a result, the learned policy can remain unreliable in states that frequently trigger interv entions, harming robustness. A common remedy is to add an auxiliary imitation loss that matches the learner to the intervention actions on intervened steps ( W ang et al. , 2018 ). C. Implementation Details W e use stable-baselines3 1 to implement GuidedSA C. W e use MLP netw orks with two hidden layers for all networks. Feature extractor is FlattenExtractor , which flattens the input features into a vector . 12 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol T able 1. Hyperparameter settings for GuidedSA C across different en vironments. En vironment Guided W indow End Step Batch Size γ τ α Start Blackjack 2000 5000 256 0.5 0.005 auto 1000 CliffW alking 2000 10000 1024 0.99 0.005 0.01 0 FrozenLake 100 5000 64 0.99 0.005 0.01 0 T axi 2000 5000 256 0.99 0.005 0.01 0 MountainCar 1000 3000 256 0.99 0.005 auto 0 Humanoid 1000 900,000 256 0.99 0.005 auto 100 Advisor Prompt Coder Prompt T ask Definitio n Background Information CoT Domain Hint Code T emplate # Task Definition: You will serve as an assistant that provide rule-based code for reinforcement learning agent that is learning to walk in the Mujoco Humanoid environm ent. You are expected to write a rule -based code to perform human-like walking in the environment. # Information about the environment: {document of humanoid} # Knowledge Base 1. The walking cycle control for both legs should follow a sinusoidal func tion and a recommended cycle length is 50. 2. The torso should be kept slightly forward, adj ust abdomen_y_angle (minus value means forward) to lean forward. 3. Arms should be adjusted based on the a ngle of legs to maintain arms-leg coordination. # Task Receiving observation in shape [1,376] , I need yo u to provide the code to solve this problem in following format: {code template} You can Adjust desired_forward_lea n and balance arms according to advice. Summarize following advice, think step by step and then write code to a ccomplish the task. {Output of Advisor Agent} You're an expert helpful AI assistant which follow s instructions and pe rform as an advisor as instructed. You have expert knowledge of the Humanoid environment in a simulator named Mujoco. For each step, you will get one or more observation images. Your advanced capabilities enable you to process and interpret these observation images and other re levant informatio n in detail. You will receive a sequence of observation images of the humanoid. The control algorithm for the humanoid robot has some flaws, which may re sult in unnatural leg mo vement, poor coordination between hands and legs, a nd an excessive leaning of the torso. Please carefully analyze the issues present in the images and suggest what improvement is most nee ded at the mom e nt. Here is some helpful information to help you give advice. Overall task description: The most important task is to make this humanoid move forward a s fast as possible, and maintain a human-like gait and posture. Image introduction: The image shows a sequenc e of snapshots of the Humanoid in motion. The view is from the side, and the Humanoid is moving towards the side of the image, which means the humanoid fa ces to the right. Based on upper information, think how to identify whether the humanoid should adjust its posture according to how the image demonstrates, a nd reason questions below: 1. How to deter mine the incorrect po se (forward, bac kward) of the humanoid body based on the pictures 2. How to deter mine w hether the hands and f eet are coordinated according to the picture 3. What should be intervened and what shouldn't (both need be explicitly mentioned)? Hint: Before giving any other advice, you should ensure Sinusoidal motion for walking in both legs. Answer about 3 questions and a nalyze and propose the ONE MOST needed improvement at this moment. F igure 8. Prompts for the advisor and coder LLMs. C.1. Hyperparameters Querying the Advisor LLM at e very time step is e xpensiv e in practice. T o solve this, we use a windo w mechanism where each intervention lasts for a fixed duration of steps. This guided window defines how long the agent follows the LLM guidance before the Advisor is consulted again to ev aluate performance. This approach reduces API latency and ensures the agent collects experience under a consistent polic y without violating our theoretical modeling. Sev eral hyperparameters govern the coordination between the LLM and the agent. The guided window allo ws for periodic performance checks while the end guidance timestep sets the final point where all external help is disabled. Standard learning parameters include the batch size for b uffer sampling and the discount factor gamma ( γ ) to prioritize long-term rew ards. Training stability is maintained by the soft update coef ficient tau ( τ ) and the entropy coef ficient alpha ( α ) which balances exploration and e xploitation. Finally , the learning starts parameter defines the initial phase of random exploration used to fill the replay buf fer before gradient updates begin. The intrinsic re ward coef ficients are set to 10 − 4 for most experiments. For Clif fW alking and T axi, this coefficient is increased to 1 . C.2. LLM Configuration The proposed framew ork incorporates two specialized LLMs to address distinct functional requirements. W e utilize the qwen3-vl-plus model as the Advisor LLM because of its advanced visual perception and spatial reasoning capabilities. For the Coder LLM, we employ the qwen3-max-previe w model to leverage its superior performance in automated code 1 https://github.com/DLR- RM/stable- baselines3 13 Efficient Soft Actor -Critic with LLM-Based Action-Level Guidance for Continuous Contr ol synthesis and algorithmic generation. Both models are integrated into our experimental pipeline through remote API calls. C.3. Prompt Details Fig. 8 shows the prompts for the advisor and coder LLMs in Humanoid task. The prompt for the advisor LLM is designed to ev aluate whether intervention is necessary based on recent trajectory performance. The prompt for the coder LLM is designed to generate rule-based policies with sufficient conte xt information when intervention is triggered. D. Pseudo-Code of GuidedSA C Algorithm 1 Guided Soft Actor-Critic 1: Initialize ψ , ¯ ψ , θ , ϕ , interv ene = F alse ; 2: for each iteration do 3: if inter v ene then 4: for each en vironment step do 5: e a t ∼ e π ϕ ( a t | s t ) ; 6: s t +1 ∼ p ( s t +1 | s t , e a t ) ; 7: e D ← e D ∪ { ( s t , e a t , r ( s t , e a t ) , s t +1 ) } ; 8: end for 9: else 10: for each en vironment step do 11: a t ∼ π ϕ ( a t | s t ) ; 12: s t +1 ∼ p ( s t +1 | s t , a t ) ; 13: e D ← e D ∪ { ( s t , a t , r ( s t , a t ) , s t +1 ) } ; 14: end for 15: end if 16: inter v ene ← I ( s ≤ T ) ; 17: for each gradient step do 18: ψ ← ψ − λ V ∇ ψ J V ( ψ ) ; 19: θ ← θ − λ Q ∇ θ J Q ( θ ) ; 20: ϕ ← ϕ − λ π ∇ ϕ J π ( ϕ ) ; 21: ¯ ψ ← τ ψ + (1 − τ ) ¯ ψ ; 22: end for 23: end for E. Limitation & Future W ork The performance gains from an LLM-based supervisor fundamentally rely on the LLM’ s ability to propose ef fective solutions that can be distilled into a rule-based policy . Such a rule-based policy represents a trade-off between cost and ef fectiveness: it av oids querying the LLM at every en vironment step, but its representational capacity is limited, and it can struggle to process rich observ ations (e.g., raw images) or to coordinate high-dimensional beha viors. For e xample, in vision-centric en vironments like Minecraft, the lack of sophisticated visual processing within the rule-based controller makes it difficult to extract ef fectiv e policies from image-based states. Furthermore, in high-dimensional continuous control tasks such as the Unitree G1 2 (29 dimensions), ev en if the LLM can understand the en vironment and provide reasonable advices in a subset of joints, the resulting rule-based intervention may ha ve only mar ginal impact on overall performance, making exploration of high-v alue trajectories less likely . In the future, as LLM inference costs and latency continue to decline, it may become practical to deploy po werful LLMs for direct real-time interventions, thereby circumventing these limitations. Importantly , the theoretical framew ork developed in this w ork remains fully applicable to such settings. 2 https://github.com/google- deepmind/mujoco_menagerie/tree/main/unitree_g1 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment