Interaction-Grounded Learning for Contextual Markov Decision Processes with Personalized Feedback
In this paper, we study Interaction-Grounded Learning (IGL) [Xie et al., 2021], a paradigm designed for realistic scenarios where the learner receives indirect feedback generated by an unknown mechanism, rather than explicit numerical rewards. While prior work on IGL provides efficient algorithms with provable guarantees, those results are confined to single-step settings, restricting their applicability to modern sequential decision-making systems such as multi-turn Large Language Model (LLM) deployments. To bridge this gap, we propose a computationally efficient algorithm that achieves a sublinear regret guarantee for contextual episodic Markov Decision Processes (MDPs) with personalized feedback. Technically, we extend the reward-estimator construction of Zhang et al. [2024a] from the single-step to the multi-step setting, addressing the unique challenges of decoding latent rewards under MDPs. Building on this estimator, we design an Inverse-Gap-Weighting (IGW) algorithm for policy optimization. Finally, we demonstrate the effectiveness of our method in learning personalized objectives from multi-turn interactions through experiments on both a synthetic episodic MDP and a real-world user booking dataset.
💡 Research Summary
**
Interaction‑Grounded Learning (IGL) was originally introduced to handle scenarios where a learner receives only indirect, noisy feedback rather than explicit numerical rewards. Prior work on IGL, however, has been limited to single‑step contextual bandit settings, making it unsuitable for modern sequential decision‑making systems such as multi‑turn large language model (LLM) deployments, conversational agents, or robotic controllers that operate over multiple steps before receiving any feedback. This paper bridges that gap by extending IGL to contextual episodic Markov Decision Processes (MDPs) with personalized feedback, and by providing the first provably efficient algorithm with sublinear regret in this setting.
The authors decompose the problem into two main components: (1) constructing a reward estimator from indirect feedback, and (2) using that estimator to learn a near‑optimal policy. The reward‑estimator construction proceeds in three stages. First, Reachable State Identification learns, for each possible terminal state (s \in S_H), a “homing” policy (b_{\pi_s}) that maximizes the probability of reaching (s). By applying the EULER algorithm (a PAC‑RL method) to an auxiliary MDP with a dummy reward of 1 for reaching (s), the algorithm guarantees after (O(1/\epsilon^2)) episodes that (P_{b_{\pi_s}}(s) \ge p_\star^s - \epsilon) with high probability. Second, Inverse Kinematic Learning estimates the posterior distribution of actions conditioned on reaching a reachable state, i.e., (\Pr(a \mid x, s)), under a uniform policy. This step leverages the conditional independence assumption that the observed feedback (y) depends only on the context (x), the final state (s_H), and the latent binary reward (r), and is independent of the entire trajectory prefix and the final action given those variables. Third, the authors build a Lipschitz‑continuous reward estimator (\hat f) that combines the inverse‑kinematic model with a smoothness prior. They distinguish between heterogeneous states (where the expected reward varies sufficiently across actions) and homogeneous states (where all actions yield the same expected reward). The heterogeneous condition ensures enough signal for identification, while the homogeneous case is handled by assuming a known constant reward value (c), which is realistic for degenerate states (e.g., “I don’t know” responses).
With the reward estimator in hand, the policy learning phase introduces the Inverse‑Gap‑Weighting (IGW) algorithm. IGW computes, for each state–action pair, the estimated value gap between the current policy and the (unknown) optimal policy, and inversely weights exploration effort toward pairs with larger gaps. This weighting scheme mitigates the high variance inherent in indirect feedback and focuses data collection where it matters most for reducing regret. The authors prove that the combined estimation and policy errors decay as (O(T^{-1/4})) and (O(T^{-1/2})) respectively, leading to an overall regret bound of (\tilde O(T^{3/4})) over (T) episodes—sublinear and thus asymptotically optimal up to polynomial factors.
Empirically, the method is evaluated on two fronts. A synthetic episodic MDP benchmark allows systematic variation of state space size, action cardinality, and horizon length (H). In these controlled experiments, IGW consistently outperforms baselines such as ε‑greedy, Thompson Sampling, and standard UCB‑type algorithms, achieving faster convergence to the optimal policy and lower cumulative regret. The second evaluation uses a real‑world user booking dataset where users provide binary thumbs‑up/down feedback after a multi‑turn interaction with a recommendation system. By treating this feedback as indirect, personalized signals, the proposed algorithm learns a user‑specific reward function and improves the final booking success rate by over 12 % compared to a system trained with naive supervised learning on the same data.
The paper’s contributions are threefold: (1) It establishes the first theoretical foundation for IGL in multi‑step contextual MDPs, delivering a sublinear regret guarantee. (2) It introduces a novel three‑stage reward‑decoder pipeline that reliably extracts latent rewards from indirect, personalized feedback even in the presence of homogeneous states. (3) It proposes the IGW policy optimizer, which efficiently balances exploration and exploitation under indirect supervision. Together, these advances make IGL applicable to modern AI systems that interact with humans over multiple turns—such as conversational LLMs, adaptive tutoring platforms, and brain‑computer interfaces—where explicit reward signals are unavailable but personalized implicit feedback is abundant.
Comments & Academic Discussion
Loading comments...
Leave a Comment