Sat-EnQ: Satisficing Ensembles of Weak Q-Learners for Reliable and Compute-Efficient Reinforcement Learning
📝 Original Info
- Title: Sat-EnQ: Satisficing Ensembles of Weak Q-Learners for Reliable and Compute-Efficient Reinforcement Learning
- ArXiv ID: 2512.22910
- Date: 2025-12-28
- Authors: Ünver Çiftçi
📝 Abstract
Deep Q-learning algorithms remain notoriously unstable, especially during early training when the maximization operator amplifies estimation errors. Inspired by bounded rationality theory and developmental learning, we introduce Sat-EnQ, a two-phase framework that first learns to be "good enough" before optimizing aggressively. In Phase 1, we train an ensemble of lightweight Q-networks under a satisficing objective that limits early value growth using a dynamic baseline, producing diverse, low-variance estimates while avoiding catastrophic overestimation. In Phase 2, the ensemble is distilled into a larger network and fine-tuned with standard Double DQN. We prove theoretically that satisficing induces bounded updates and cannot increase target variance, with a corollary quantifying conditions for substantial reduction. Empirically, Sat-EnQ achieves 3.8× variance reduction, eliminates catastrophic failures (0% vs 50% for DQN), maintains 79% performance under environmental noise, and requires 2.5× less compute than bootstrapped ensembles. Our results highlight a principled path toward robust reinforcement learning by embracing satisficing before optimization.📄 Full Content
We develop an alternative perspective grounded in two fundamental insights from outside traditional reinforcement learning. First, Herbert Simon’s theory of bounded rationality [12] argues that intelligent agents, whether human or artificial, often aim for satisfactory rather than optimal solutions when computational resources are limited. Second, developmental psychology suggests that humans learn complex skills through a process of coarse-to-fine refinement, first acquiring basic competencies before optimizing for peak performance. These observations motivate our central hypothesis: reinforcement learning agents may benefit from learning stable, conservative value estimates before attempting full optimization.
We introduce Sat-EnQ (Satisficing Ensemble Q-learning), a two-phase framework that operationalizes this idea through three key innovations:
-
A satisficing Bellman operator that clips TD targets to a dynamic baseline, limiting early value growth while maintaining eventual optimality.
-
An ensemble of lightweight “weak learners” trained with this operator, providing diverse, low-variance value estimates at minimal computational cost.
-
A distillation and polishing procedure that transfers the ensemble’s stable knowledge into a single strong network, which is then fine-tuned with standard methods.
Our theoretical analysis proves that satisficing induces bounded value updates and cannot increase target variance, with explicit conditions for when substantial variance reduction occurs. Empirically, we demonstrate that Sat-EnQ eliminates catastrophic training failures, reduces variance by factors of 2-4×, maintains robustness to environmental noise, and achieves these benefits with significantly lower computational requirements than existing ensemble methods.
The remainder of this paper is organized as follows: Section 2 reviews related work, Section 3 introduces the Sat-EnQ framework, Section 4 presents theoretical analysis, Section 5 describes experimental results, and Section 6 discusses implications and future directions.
Ensemble methods have been widely explored to improve RL stability and exploration. Bootstrapped DQN [7] trains multiple Q-networks on different data subsets to drive exploration through uncertainty. Averaged-DQN [1] maintains multiple target networks and averages their predictions to reduce variance. Maxmin Q-learning [4] uses an ensemble to provide pessimistic value estimates, mitigating overestimation bias. These approaches typically train full-sized networks in parallel, incurring substantial computational cost. In contrast, Sat-EnQ employs lightweight weak learners and adds explicit value constraints through satisficing.
Conservative Q-learning (CQL) [3] addresses overestimation by penalizing large Q-values, effectively learning a lower bound on the true value function. While CQL provides strong theoretical guarantees, it can be overly conservative and may limit asymptotic performance. Sat-EnQ applies conservatism selectively during early learning, then transitions to standard optimization, combining stability with eventual optimality.
Knowledge distillation has been used to transfer policies between networks [9] and to compress ensembles into single networks [4]. Sat-EnQ employs distillation as a stability mechanism, transferring knowledge from an ensemble of conservative learners into a single strong policy.
The concept of satisficing, introduced by Simon [12], has influenced decision theory, economics, and cognitive science [10]. In RL, satisficing has been explored in bandit settings [8] and for exploration [5], but has not been systematically applied to value function learning with deep networks. Our work bridges this gap by embedding satisficing principles directly into the Bellman update.
3 The Sat-EnQ Framework
We consider a Markov Decision Process (S, A, P, R, γ) with state space S, action space A, transition dynamics P (s ′ |s, a), reward function R(s, a), and discount factor γ ∈ [0, 1). The standard Q-learning update uses the Bellman operator:
This operator amplifies estimation errors through the max operation, particularly problematic when Q estimates are inaccurate during early learning.
We introduce a satisficing Bellman operator that incorporates a dynamic baseline B : S → R representing a “good enough” value:
where m ≥ 0 is a margin parameter. The operator clips the next-state value estimate at B(s ′ ) + m, preventing runaway growth while allowing improvement up to the satisficing threshold.
The baseline function B(s) evolves during training to track achievable returns. We consider two implementations:
- Episodic moving average: For states visited in an episode with return G, update B(s) ← αB(s) + (1 -α)G, where α ∈ [0, 1) controls update speed.
Train a separate network B ϕ (s) to predict Monte Carlo returns via regression:
Both approaches provide a conservative signal that adapts as the agent improves, implementing Simon’s notion of an “aspiration level” that rises with achievement.
i=1 with small architectures (e.g., 2-layer MLPs with 32-64 hidden units). Each learner maintains a private replay buffer D i and optimizes a combined satisficing loss:
2 TD error with satisficing target
where Q θ - i denotes a target network (updated periodically), and λ ≥ 0 controls regularization strength. The hinge term encourages Q values not to exceed the satisficing threshold, providing additional stability.
After Phase 1, we compute the ensemble average:
We then initialize a larger “student” network Q θ S (e.g., 3-layer MLP with 64-128 hidden units) and distill the ensemble knowledge via regression:
where D pool = i D i combines all replay buffers. Finally, we fine-tune the student using standard Double DQN [13] for additional performance improvement. The complete algorithm is summarized in Algorithm 1.
Proposition 1 (Boundedness of Satisficing Operator). Let B : S → R be bounded with ∥B∥ ∞ ≤ B max . Then for any Q : S × A → R and all (s, a):
Proof. The boundedness follows directly from the clipping in Equation (1). For contraction, define f (x) = min{x, B + m}, which is 1-Lipschitz. Then for any Q 1 , Q 2 :
where the last inequality uses that max is 1-Lipschitz in ℓ ∞ norm. Var(Y ) ≤ Var(X).
Equality holds if and only if Pr(X ≤ c) = 1.
Since the mean minimizes mean squared error:
By the Lipschitz property:
For the equality condition: if Pr(X ≤ c) = 1, then Y = X almost surely, so Var(Y ) = Var(X). Conversely, if Pr(X > c) > 0, the inequality is strict because f is strictly contracting on {x > c}.
The variance reduction is:
Proof. Equation (3) follows from the law of total variance applied to Y , noting that Y = X on {X ≤ c} and Y = c (constant) on {X > c}. Equation ( 4) is the standard variance decomposition for X. Subtracting gives the reduction formula.
Theorem 3 provides quantitative insight: variance reduction scales with (1) the probability mass p above the threshold, (2) the conditional variance σ 2 > above the threshold, and (3) the gap between conditional means relative to the clipping value. In practice, early training often exhibits large p and σ 2 > as value estimates are noisy and optimistic, leading to substantial variance reduction.
Proposition 4 (Sources of Diversity). Weak learners in Sat-EnQ maintain diversity through:
- Architectural constraints: Small capacity induces different approximation errors. While a formal proof of non-collapse requires strong assumptions about optimization and data distributions, empirical measurements (see Section 5) confirm that ensemble members maintain distinct value functions throughout training.
We evaluate Sat-EnQ on three environments representing different challenges:
• Stochastic GridWorld: 8 × 8 grid with slippery transitions (slip probabilities 10%, 20%, 30%). Goal: reach target cell. Tests tabular RL and robustness to environmental stochasticity.
• CartPole-v1: Classic control task with 4D state space. We test both standard environment and a noisy variant with 10% action noise. Tests function approximation stability.
• Acrobot-v1: Sparse-reward swing-up task with 6D state space. Tests limitations in challenging exploration settings.
Baselines: DQN [6], Double DQN [13], Bootstrapped DQN [7] (K=10), and Maxmin Qlearning [4]. Metrics: Mean return (across seeds), standard deviation (measure of variance), catastrophic failure rate (percentage of runs with return below 50% of optimal), training time, and parameter counts.
Table 1 shows results on Stochastic GridWorld. Sat-EnQ achieves significantly higher returns (0.61 vs -0.73 for DQN) with 85% success rate versus 5%. The variance is reduced by 1.7× (0.047 vs 0.079), and catastrophic failures drop from 80% to 5%. Table 2 presents neural network results. Sat-EnQ achieves the highest mean return (354) with the lowest variance (14,959), representing a 3.8× reduction compared to DQN. Crucially, Sat-EnQ completely eliminates catastrophic failures (0% vs 50% for DQN) while requiring only 27 seconds of training versus 105 seconds for Bootstrapped DQN.
Our experiments show that our method reduces the variance throughout training. Sat-EnQ maintains consistently lower standard deviation across seeds, with the gap widening during early training when standard methods are most unstable. This aligns with our theoretical analysis: satisficing is most beneficial when value estimates are noisy and prone to overestimation.
Statistical tests confirm the significance: Levene’s test for equality of variances gives p < 0.0002 for Sat-EnQ vs DQN on CartPole, rejecting the null hypothesis of equal variances.
Table 3 shows performance degradation when evaluated with action noise. Sat-EnQ maintains 79% of its clean performance, significantly higher than DQN’s 63%. This suggests that satisficing produces more robust value estimates less sensitive to perturbation.
Table 5 reveals a limitation: on Acrobot’s sparse-reward task, Sat-EnQ fails to learn, consistently achieving the minimum return of -500. This occurs because satisficing’s conservative updates prevent the exploratory behavior needed to discover rare rewards. This suggests an important boundary condition: satisficing excels in dense-reward settings but may require adaptation (e.g., adaptive margins or intrinsic motivation) for sparse rewards.
We conduct several ablations to understand Sat-EnQ’s components:
• No satisficing: Using standard Q-learning for weak learners increases failure rate from 0% to 50%. • Single learner: A single satisficing network achieves similar mean return but with 4× higher variance.
• No polishing: Removing the fine-tuning phase reduces final performance by 15%.
• Margin sensitivity: m = 0.5 provides optimal trade-off; too small hinders learning, too large reduces variance benefits.
• Ensemble size: K = 4 provides good balance; diminishing returns beyond K = 6.
We introduced Sat-EnQ, a framework that applies bounded rationality principles to stabilize deep Q-learning. By first learning conservative “good enough” value estimates through satisficing weak learners, then distilling and polishing them into a strong policy, Sat-EnQ achieves unprecedented stability without sacrificing final performance or efficiency. Theoretical analysis proved that satisficing cannot increase target variance and quantified conditions for substantial reduction. Empirically, Sat-EnQ demonstrated 3.8× variance reduction, elimination of catastrophic failures, superior robustness to noise, and 2.5× better compute efficiency than bootstrapped ensembles.
Several promising directions for future work emerge:
• Adaptive margins: Dynamic adjustment of m based on learning progress or uncertainty estimates.
• Sparse reward adaptation: Combining satisficing with intrinsic motivation or curiosity for exploration.
• Extension to other algorithms: Applying satisficing principles to actor-critic methods, offline RL, and model-based approaches.
• Theoretical extensions: Finite-sample convergence rates, generalization bounds, and connections to regularization theory.
• Real-world applications: Testing in robotics, autonomous systems, and other safetycritical domains where stability is paramount.
Sat-EnQ represents a paradigm shift: rather than fighting Q-learning’s instabilities with increasingly complex heuristics, we embrace the natural strategy of first learning to be adequate before optimizing aggressively. This principle, drawn from human cognition and formalized through bounded rationality, offers a principled path toward truly robust reinforcement learning.
Algorithm 1 Sat-EnQ Framework Require: Environment E, weak learner count K, margin m, baseline decay α, regularization λ Ensure: Final policy π θ S (s) = arg max a Q θ S (s, a)
1: Phase 1: Train satisficing weak learners 2: Initialize K small networks {Q θ i } K i=1 , target networks {Q θ - i }, replay buffers {D i } 3: Initialize baseline B(s) (zero or pre-trained) 4: for episode = 1 to M 1 do 5:
for each learner i = 1 to K do 6:
Collect trajectory using ϵ-greedy policy from Q θ i Update θ i using ∇ θ i L sat (θ i ) (Eq. 2) Update θ S using ∇ θ S L distill (θ S ) 22: end for 23: for step = 1 to N polish do 24:
Collect data using ϵ-greedy from Q θ S , store in D S
Update θ S using Double DQN loss on D S 26: end for 27: return π θ S
Implementation Details: Weak learners: 2-layer MLP with 32 hidden units each. Student network: 3-layer MLP with 64 hidden units. K = 4, m = 0.5, α = 0.99, λ = 0.1, γ = 0.99. Training: 10,000 steps for GridWorld, 20,000 for CartPole/Acrobot. All results averaged over 10 random seeds unless specified.