Reading time: 24 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.20974
  • Date:
  • Authors: Unknown

📝 Abstract

Bayesian Reinforcement Learning (BRL) provides a framework for generalisation of Reinforcement Learning (RL) problems from its use of Bayesian task parameters in the transition and reward models. However, classical BRL methods assume known forms of transition and reward models, reducing their applicability in real-world problems. As a result, recent deep BRL methods have started to incorporate model learning, though the use of neural networks directly on the joint data and task parameters requires optimising the Evidence Lower Bound (ELBO). ELBOs are difficult to optimise and may result in indistinctive task parameters, hence compromised BRL policies. To this end, we introduce a novel deep BRL method, Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL), that enables efficient and accurate learning of transition and reward models, with fully tractable marginal likelihood and Bayesian inference on task parameters and model noises. On challenging MetaWorld ML10/45 benchmarks, GLiBRL improves the success rate of one of the state-of-the-art deep BRL methods, VariBAD, by up to 2.7×. Comparing against representative or recent deep BRL / Meta-RL methods, such as MAML, RL 2 , SDVT, TrMRL and ECET, GLiBRL also demonstrates its low-variance and decent performance consistently.

📄 Full Content

Reinforcement Learning (RL) algorithms have great potentials to enable robots to act intelligently without supervisions from humans. A well-known issue with RL algorithms is they generalise poorly to unseen tasks. Most standard RL algorithms by their designs do not consider possible variations in the transition and reward models, hence fail to adapt to new tasks whose models might be different from that of training environments.

Bayesian Reinforcement Learning (BRL) is an effective framework that can be used to improve the generalisation of RL. Instead of ignoring possible variations in transition and reward models, BRL methods explicitly take them into considerations by assuming parametric distributions of models and performing Bayesian inference on the parameters (Ghavamzadeh et al., 2015). Different parameters indicate different transition and reward models, hence implicitly encode various tasks. To solve BRL problems, many previous works use planners (Poupart et al., 2006;Guez et al., 2013) that search for Bayes-optimal policies. These methods are often limited in their scalability. Moreover, they require full information about the forms of transition and reward models, which restricts generalisation across different tasks.

Hence, recent deep BRL methods (Rakelly et al., 2019;Zintgraf et al., 2021) enable model learning by optimising the marginal likelihood of the data. However, most of the deep BRL methods do not support tractable Bayesian inference on the task parameters, because of the direct use of neural networks on the joint data and parameters. As a result, the exact marginal likelihood of the data is also not tractable and cannot be optimised directly. To this end, deep BRL methods adopt variational inference to optimise the evidence lower bound (ELBO) instead. However, the optimisation of ELBO is not an easy task as it may face issues such as high-variance Monte Carlo estimates, amortisation gaps (Cremer et al., 2018) and posterior collapse (Bowman et al., 2016;Dai et al., 2020). Such issues can preclude BRL methods from obtaining meaningful and distinctive distributions of task parameters, which are crucial to smooth Bayesian learning.

Bayes-Adaptive MDPs (BAMDPs) (Duff, 2002;Ghavamzadeh et al., 2015) is a Bayesian framework for solving RL. Compared to standard MDPs, BAMDPs assume known forms of functions of T and / or R, but parameterised by unknown parameters θ T ∈ Θ T and / or θ R ∈ Θ R .

In BAMDPs, distributions (or, beliefs) b t = p(θ T ,t , θ R,t ) ∈ B T × B R are placed on the unknown parameters, and updated to posteriors b t+1 = p(θ T ,t+1 , θ R,t+1 ) with Bayesian inference.

To efficiently use existing MDP frameworks, the beliefs can be absorbed into the original state space to form hyper-states S + = S × B T × B R . Hence, BAMDPs can be defined as 5-tuple (S + , A, R + , T + , γ) MDPs, where

R + (s + t , a t , s + t+1 , r t+1 ) = p(r t+1 |s t , b t , a t , s t+1 , b t+1 ) = E θ R ,t+1 ∼bt+1 p(r t+1 |s t , a t , s t+1 , θ R,t+1 )

The hyper-transition function (Equation 1) consists of the θ T -parameterised expected regular MDP transition and a deterministic posterior update specified by the Dirac delta function δ(b t+1 = p(θ T ,t+1 , θ R,t+1 )). The hyper-reward function consists of the θ R -parameterised expected regular MDP reward function. Accordingly, the expected return to maximise becomes

, where H + > 0 is the BAMDP horizon, and π + : S + → A is the policy of BAMDPs. Traditionally, problems that require solving BAMDPs are named Bayesian Reinforcement Learning (BRL). Aside from their generalisability, BRL methods are also recognised for offering principled approaches to the exploration-exploitation problem in RL (Ghavamzadeh et al., 2015).

However, classical BRL methods (Poupart et al., 2006;Guez et al., 2013;Tziortziotis et al., 2013) assume that the forms of transition T + and reward R + models are fully known, despite being parameterised by unknown parameters. These methods are not sufficiently flexible in scenarios where the forms of T + and R + are not known a priori. A rough guess of the forms, however, may lead to significant underfit (e.g., assuming linear transitions while the ground truth is quadratic).

To generalise classical BRL methods, Hidden-Parameter MDPs (HiP-MDPs) (Doshi-Velez & Konidaris, 2016;Killian et al., 2017;Yao et al., 2018) have started to learn the forms of models through performing Bayesian inference on the weights. Doshi-Velez & Konidaris (2016) proposed HiP-MDPs with Gaussian Processes (GPs) to learn the basis functions for approximating transition models. Afterwards, Killian et al. (2017) discovered the poor scalability from the use of GPs, and applied Bayesian Neural Networks (BNNs) in HiP-MDPs for larger scale problems. (Yao et al., 2018) proposed to fix the weights of BNNs during evaluation for improved efficiency, at the cost of losing most of the test-time Bayesian features. These works focus on performing Bayesian inference on the weights of BNNs, which does not scale well with the size of BNN and is empirically demonstrated by (Yang et al., 2019). Reward functions in the HiP-MDP setting are also assumed to be known, which is generally infeasible in real-world applications .

On the other hand, recent deep BRL methods (Harrison et al., 2018a;Rakelly et al., 2019;Zintgraf et al., 2021) adopt scalable regular deep neural networks, as Bayesian features remain by performing (approximate) Bayesian inference on task parameters θ T , θ R directly. GLiBRL follows this line of works for scalability and also the more general assumption of unknown reward functions. We briefly introduce how the learning of the forms of transition and reward models is done in the deep BRL setting. The deep BRL agent is provided with MDPs with unknown transitions T and / or rewards R, and simulators from which the agent can obtain samples of tuples, known as contexts (Rakelly et al., 2019;Zintgraf et al., 2021).The context at step t is defined as (Rakelly et al., 2019;Perez et al., 2020;Zintgraf et al., 2021) is to maximise the marginal log-likelihood of the joint context1

where ζ is a neural network to learn the parameters of the prior distribution p(θ T , θ R ),

, and ϕ T , ϕ R are neural networks to learn the forms of transition and reward functions.

For the ease of learning, p ϕ T and p ϕ R are generally assumed to be Gaussian with mean and diagonal covariance determined by the output of neural networks ϕ T , ϕ R , and the prior p ζ (θ T , θ R ) is also assumed to be a Gaussian. However, note that even with these simplifications, Equation 3 is still not tractable as θ T , θ R are not linear with respect to contexts, because of the use of neural networks. Fortunately, variational inference provides a lower bound to Equation 3, named Evidence Lower Bound (ELBO) that can be used as an approximate objective function (proof see Appendix A.1):

where D KL (•||•) is the KL-divergence, and q(•) is an approximate Gaussian posterior of θ T , θ R . An optimised log p ζ,ϕ T ,ϕ R (C) will bring in models for transitions and rewards, with which both modelfree (Rakelly et al., 2019;Zintgraf et al., 2021) and model-based (Guez et al., 2013;Harrison et al., 2018a) methods can be applied for learning the BRL policy.

ELBO-like objectives enable the learning of transition and reward models. However, ELBOs are challenging to optimise, for known issues such as high-variance Monte Carlo estimates, amortisation gaps (Cremer et al., 2018) and posterior collapse (Bowman et al., 2016;Dai et al., 2020). Unoptimised ELBOs may result in scenarios where learnt latent representations (e.g., θ T , θ R in BRL) are not meaningful and distinctive. Different from other tasks where meaningful latent representations are less important, BRL policies determine the next action to perform heavily dependent on continually updated distributions of latent representations. Indistinctive latent representations, hence meaningless posterior updates will substantially harm the performance of BRL policies.

Aside from issues in ELBO, it is also concerning how previous methods compute the posterior q(θ T , θ R |C Mi ). As C Mi contains variable and large number of contexts, it is inefficient to directly use it as a conditional variable. Instead, Rakelly et al. (2019) applied factored approximation so that q(θ T , θ

, where g(•) is a neural network that takes the t-th context in C Mi as the input and returns the mean and covariance of a Gaussian as the output. From Bayes rule, q(θ T , θ

. This is to say, for the approximation to be accurate,

From the left-hand-side, g(•) tries to predict the mean and the covariance regardless of the prior, while the right-hand-side has the prior involved, meaning the same N (θ T , θ R |g([C Mi ] t )) gets implicitly assigned with different targets as N increases. This may result in inaccuracies of the approximation and unstable training. On the other hand, Zintgraf et al. ( 2021) summarise C Mi with RNNs to get hidden variables h, and compute q(θ T , θ R |h). Despite its simplicity, it has been shown in Rakelly et al. (2019) that permutation-variant structures like RNNs may lead to worse performance.

Previous deep BRL methods perform inaccurately approximated posterior updates and optimise challenging ELBOs. Both issues may lead to incorrect distributions of task parameters, compromising the performance of BRL policies. To this end, we introduce our method, GLiBRL. GLiBRL features generalised linear models that enable fully tractable and permutation-invariant posterior update, hence closed-form marginal log-likelihood, without the need to evaluate and optimise the ELBO. The linear assumption seems strong, but basis functions still enable linear models to learn non-linear transitions and rewards. The basis functions that maps from the raw data C Mi to the feature space are made learnable from the marginal log-likelihood, instead of being chosen arbitrarily, allowing for efficient learning under low-dimensional feature space. We elaborate on the learning of the forms of transition and reward functions in Section 3.1 and discuss efficient online policy learning in Section 3.2. The full GLiBRL algorithm is demonstrated in Algorithm 1.

We rewrite

) for compactness, where N is the number of context and D S , D A are the dimensions of the state and action space. We further let θ

), where D T , D R are task dimensions. Note we explicitly perform Bayesian inference on T σ , R σ , instead of assuming known model noises. We make the following approximation:

where C T , C R 2 are features of contexts S i , A i , S ′ i computed through neural networks which act as learnable basis functions

and MN (W|X, Y, Z) defines a matrix normal distribution with random matrix W, mean X, row covariance Y and column covariance Z. Different from other deep BRL methods such as PEARL (Rakelly et al., 2019) andVariBAD (Zintgraf et al., 2021), GLiBRL does not place neural networks on the joint contexts and task parameters (e.g., ϕ T (S i , A i , θ T )), in order for tractable inference.

2 Dependence on i of CT , CR is omitted for clarity. This also applies to

Algorithm 1: GLiBRL

learning policy, transition and reward models while Learning do

Assuming the independence of θ T and θ R , dropping the neural network ζ of the prior, Equation 3 can be written as

Because of the linear relationship between θ T , θ R and features of the contexts, we can place Normal-Wishart priors conjugate to matrix normals on θ T , θ R for tractable inference

where W(W|X, ν) defines a Wishart distribution on positive definite random matrix W with scale X and degrees of freedom ν. It has been shown in Appendix A.2 that the posteriors are also Normal-Wishart distributions

where

Thus, we can find a closed-form marginal log-likelihood (proof in Appendix A.3):

Equation 14 is to be maximised with related to C T and C R , hence ϕ T and ϕ R . We add squared Frobenius norms ∥C T ∥ 2 F and ∥C R ∥ 2 F to Equation 14 as regularisations, the effect of which being discussed in Appendix A.8. The regularised loss function is defined as

where λ T > 0 and λ R > 0 are hyperparameters. We note that L model can be directly minimised with gradient descent, without the need to evaluate and optimise the ELBO. To roll out π + ψ online, the prior b t needs to be continually updated to the posterior b t+1 from the new context c = {s t , a t , s ′ t , r t+1 }, using the learnt transition and reward models. Therefore, fast posterior update is crucial for efficient context collections. One of the most time-consuming part in Equation 13is the inversion of Ξ ′ T and Ξ ′ R , which is of time complexity O(D 3 T ) and O(D 3 R ), respectively. Fortunately, with the matrix inversion lemma

When updating the belief online with the new context,

ALPaCA. First, we discuss the most relevant work, ALPaCA (Harrison et al., 2018b). ALPaCA is an efficient and flexible online Bayesian linear regression framework, which also involves Bayesian linear models with learnable basis functions. ALPaCA initially was not proposed as an BRL method, though follow-up work such as CAMeLiD (Harrison et al., 2018a) uses controllers to compute the policy assuming known reward functions. Our method, GLiBRL, generalises ALPaCA and CAMeLiD in (1) ALPaCA and CAMeLiD assume a known noise in the likelihood function, instead of performing Bayesian inference, (2) ALPaCA and CAMeLiD are not evaluated in online BRL settings. They only investigated scenarios where offline contexts are available with unknown transitions and relatively simple known rewards. In Section 5, we will argue empirically that the assumption of known noises incurs error in both predictions of transitions and rewards.

Reinforcement Learning. RL methods can be categorised as model-free and model-based. We use the former in this paper to learn BRL policies as a large proportion of RL work is model-free, such as Trust-Region Policy Optimisation (TRPO) (Schulman et al., 2015), Proximal Policy Optimisation (PPO) (Schulman et al., 2017) and Soft Actor-Critic (SAC) (Haarnoja et al., 2018). As GLiBRL learns the models, model-based methods can also be used for improved sample efficiency.

Hidden-Parameter MDPs. Hidden-Parameter MDP (HiP-MDP), proposed by Doshi-Velez & Konidaris (2016), is a framework for parametric Bayesian Reinforcement Learning. HiP-MDP was initially modelled using Gaussian Processes (GPs). Killian et al. (2017) improved the scalability by replacing GPs with Bayesian Neural Networks (BNNs). The weights of BNNs are updated with new data during evaluation, which has been empirically shown inefficient (Yang et al., 2019). Yao et al. (2018) mitigated the inefficiency by fixing the test-time weights of BNNs and optimising task parameters. Despite the improved speed, we have observed that this would divert the agent from following Bayes-optimal policies. The shared objectives in (Killian et al., 2017) and (Yao et al., 2018) correspond to approximate Bayesian inference on BNN weights, but not on task parameters.

Optimising the objective on task parameters with fixed BNN weights is equivalently performing Maximum Likelihood Estimations (MLEs) on task parameters, immediately removing the Bayesian features (which is also mentioned in (Zintgraf et al., 2021)). Most recent parametric deep BRL (Rakelly et al., 2019;Zintgraf et al., 2021;Lee et al., 2023), including GLiBRL, are considered orthogonal to this line of works as they perform (approximate) Bayesian inference on task parameters directly, rather than on the weights of the neural networks. Furthermore, HiP-MDPs assume known reward functions. Perez et al. (2020) generalised HiP-MDPs to consider also unknown reward functions, just as GLiBRL and recent deep BRL methods such as (Rakelly et al., 2019;Zintgraf et al., 2021;Lee et al., 2023).

Classical Bayesian Reinforcement Learning. As mentioned in Section 2, classical BRL methods assume known forms of transitions and rewards. Poupart et al. (2006) presented a Partially Observable MDP (POMDP) formulation of BRL and a sampling-based offline solver. Guez et al. (2013) proposed an online tree-based solver, applying posterior sampling (Strens, 2000;Osband et al., 2013) for efficiency. Both methods use solver for the (approximately) optimal policy using planners, which is orthogonal to GLiBRL. GLiBRL shares the idea of using generalised linear models with Tziortziotis et al. (2013), while differs in that Tziortziotis et al. (2013) chooses the basis function, instead of learning them. Even a simple non-linear basis function, such as quadratic function, may result in O(d 2 ) dimensional feature space, where d is the dimension of the raw input3 . As demonstrated in Section 3, performing online Bayesian inference at least requires quadratic complexity with related to the feature dimension, meaning prohibitive O(d4 ) complexity is required for just a single inference step. In contrast, with learnt basis functions in GLiBRL, low-dimensional feature space are usually sufficient to capture the non-linearity, providing sufficient scalability.

Meta-Reinforcement Learning. Meta-Reinforcement Learning (Meta-RL) aims to learn policies from seen tasks that are capable of adapting to unseen tasks following similar task distributions (Beck et al., 2023b

In this section, we investigate the performance of GLiBRL using MetaWorld (Yu et al., 2021;McLean et al., 2025) We follow the majority of the experiment settings in McLean et al. (2025). ML10 and ML45 experiments are run for 2e7 steps and 9e7 steps, respectively. We do not allow any adaptive steps during test time, hence evaluating the zero-shot performance. The zero-shot performance is critical to Bayesian RL and Meta-RL, as the goal of Meta-RL is to adapt with as little data as possible (Beck et al., 2023b). We compare the Inter-Quartile Mean (IQM) of the results, as suggested by Agarwal et al. (2021). Each experiment is run with a single A100 GPU 4 for 10 times. et al. (2023a), MAML learns its policy with TRPO, and VariBAD, RL 2 and GLiBRL use PPO.

Our PPO implementation uses standard linear feature baseline (Duan et al., 2016), as suggested by McLean et al. (2025). Afterwards, we compare GLiBRL to recent deep BRL and Meta-RL baselines, including deep BRL method SDVT (Lee et al., 2023) and two Transformer-based (Vaswani et al., 2017) black-box Meta-RL methods, TrMRL (Melo, 2022) and ECET (Shala et al., 2025). Finally, we perform ablation studies on whether it is useful to place a Wishart distribution on model noises (i.e., comparing with ALPaCA). It is also worth mentioning why we do not compare with PEARLit has been demonstrated empirically in Yu et al. (2021) that PEARL performs much worse than other methods.5

We report the details of our implementations of GLiBRL and other methods. For GLiBRL, we set task dimensions D T = 16 and D R = 256 in both ML10 / ML45. We demonstrate an analysis on the sensitivity of GLiBRL with related to task dimensions D T and D R in Appendix A.9. The model networks ϕ T , ϕ R are Multi-Layer Perceptrons (MLPs) consisting of feature and mixture networks. Feature networks convert raw states and actions to features and are shared in ϕ T and ϕ R . Mixture networks mix the state and action features, further improving the representativeness. The training of π + ψ (a|s, b) requires representing the beliefs b as parameters. Empirically, we find representing b using flattened mean matrices M T , M R have the best performance. As flattening M T ∈ R D T ×D S directly results in a large number of parameters, we consider instead flattening the lower triangle of M T M T T ∈ R D T ×D T . The policy network then takes the flattened and normalised parameters as representations of the belief b. For MAML and RL 2 , we use the implementation provided by Beck et al. (2023a). For VariBAD, we implement our own, following official implementations in Zintgraf et al. ( 2021); Beck et al. (2023a). The reason we re-implement VariBAD is to use the standardised framework provided by Beck et al. (2023a), written in JAX (Bradbury et al., 2018). Our VariBAD implementation has experiment results matching that of Beck et al. (2023a). We use the tuned hyperparameters from official implementations in MAML, RL 2 and VariBAD. The table of all related hyperparameters of GLiBRL is shown in Appendix A.10.

We demonstrate the results in Figure 1, which shows the testing success rates with related to training steps. The success rates are averaged across 5 testing tasks (with 10 seeds per experiment), and pertask success rates are shown in Appendix A.5 and Appendix A.6. Overall, GLiBRL substantially

The main comparison is between GLiBRL and VariBAD as they are both BRL methods. In both benchmarks, we can see substantial improvement, by up to 2.7× in ML10, using GLiBRL. We also noticed an interesting decreasing trend of VariBAD in the ML45 benchmark. We suspect that the more number of training tasks leads to increased difficulty of learning meaningful and distinctive latent representations, partly because of the ELBO objective in VariBAD. By design, GLiBRL avoids the use of ELBO, hence achieving better and more stable performance. In Appendix A.7, we show that VariBAD suffers from posterior collapse even in the simpler benchmark ML10, while GLiBRL learns meaningful task representations.

GLiBRL, MAML and RL 2 : GLiBRL achieves consistently higher success rates than MAML and RL 2 in the more complex ML45 benchmark. Notably, GLiBRL also admits low variance which can be inferred from the tightest CI. In Appendix A.9, we show that GLiBRL reveals even higher success rates (29%) in ML10 when setting D T = 8, at the cost of slightly higher predictive error.

Aside from comparisons against standard baselines in deep BRL and Meta-RL, we compare the success rate of GLiBRL in both ML10 / ML45 to more recent work, namely SDVT (Lee et al., 2023), TrMRL (Melo, 2022) and ECET (Shala et al., 2025).

Table 1 demonstrates the detailed results. We can observe that GLiBRL outperforms all of these recent deep BRL / Meta-RL methods. Comparing against computationally heavy Transformer-based models TrMRL and ECET, GLiBRL is more performant and lightweight (∼ 200K parameters) hence applicable to, for example, mobile robots with lower-end computing resources.

Having outperformed most of the recent or state-of-the-art methods already, GLiBRL learns policies from PPO that is not fully revealing its potentials. Model-based policy learners that cannot be applied to black-box / PPG-based Meta-RL methods are expected to further improve the sample efficiency and performance of GLiBRL.

GLiBRL can be viewed as a generalised deep BRL version of ALPaCA, as GLiBRL performs Bayesian inference on model noises T σ , R σ , while ALPaCA simply assumes T σ = Σ T , R σ = Σ R are fully known a priori. Under the assumption of ALPaCA, Equation 14 reduces to (see Appendix A.4)

We studied on whether inferring on model noises is necessary for learning accurate transition and reward models. GLiBRL and its variant without noise inference (GLiBRL wo NI) are tested with identical hyperparameters on both ML10 and ML45 benchmarks6 . The metric being evaluated are L 1 norms of prediction errors in both transitions (defined as |S ′ -C T T µ | 1 ) and rewards (defined as

. 7 The results are shown in Figure 2.

Overall, with noise inference, GLiBRL admits lower prediction errors in both transitions and rewards, compared to GLiBRL wo NI assuming known noises. Both methods become increasingly erroneous in reward predictions with training steps. This is expected, as more steps result in higher success rates, hence increased magnitude of rewards and errors. However, the increasing trend of transition errors of GLiBRL wo NI is abnormal, as magnitudes of states are rather bounded and less relevant with the success rate, compared to that of rewards. In Equation 18, the term governing the fit of the transition model

has a fixed learning rate from the fixed Σ T , leading to continual unstable / underfit behaviours if the learning rate is too big / too small. On the contrary, in GLiBRL, the model fit term ν ′ T log | 1 2 Ω ′ T | has dynamic learning rates from dynamic ν ′ T . This enables self-adaptive and effective model learning, hence the expected decreasing trend in Figure 2. The lower prediction error of GLiBRL allows better integrations with model-based methods using imaginary samples, the quality of which depending highly on the accuracy of the prediction.

We propose GLiBRL, a novel deep BRL method that enables fully tractable inference on the task parameters and efficient learning of basis functions with ELBO-free optimisation. Instead of assuming known noises of models, GLiBRL performs Bayesian inference, which has been shown empirically to reduce the error of prediction in both transition and reward models. The results on challenging MetaWorld ML10 and ML45 benchmarks demonstrate a substantial improvement compared to one of the state-of-the-art deep BRL methods, VariBAD. Low-variance and decent performance of GLiBRL can also be inferred from its comparisons against representative or recent deep BRL / Meta-RL methods, including MAML, RL 2 , SDVT, TrMRL and ECET.

Multiple directions of future work arise naturally from the formulation of GLiBRL, with the most interesting one being model-based methods. As GLiBRL is capable of learning accurate transition and reward models, model-based methods can be applied easily for improved sample efficiency and performance. However, model-based methods usually require frequent sampling from the learnt models, revealing limitations in GLiBRL, as sampling from Wishart distributions can be slow. Another exciting direction is, if we prefer model-free methods, to seek a better way of utilising the task parameters in the policy network. In the paper, we simply feed the policy network with normalised means of task parameters. A naive normalisation of the parameters may confuse the policy network, and the use of means only loses uncertainty information from covariances.

and Equation 5

We prove Equation 11

where

PROOF:

The density of the prior distribution p(θ T ) is

where from Equation 19to Equation 20 we treat multiplicative parameters irrelevant to θ T as constants. The joint density of

Matching the second-order then the first-order term with related to T µ , we can rewrite

We can find Equation 20 and Equation 23 match exactly, indicating the Normal-Wishart-Normal conjugacy. Note, we just use the posterior update of p(θ T ) as an example. The exact same proof applies to the posterior update of p(θ R ) as well. Such conjugacy allows exact posterior update and marginal likelihood, enabling efficient learning.

We prove Equation 18 has the following form

PROOF:

The distributions without inferring on the noise are listed as follows:

Likelihood:

Prior:

Posterior:

where

Similar to Appendix A.3,

As

And

Hence Intuitively, if the expected divergence is close to 0, the majority of posterior updates has collapsed to priors, meaning barely any meaningful task representation has been learnt. Clearly from the above figure, VariBAD fails to learn meaningful representations, while GLiBRL demonstrates obvious divergence between posteriors and priors.

We list all hyperparameters of GLiBRL in the following table. We use the same hyperparameters for both ML10 and ML45. GLiBRL is rather efficient in both time and memory. Although all of our experiments are run using A100, we have tested that GLiBRL can run fast on much lower-end GPUs with 8GB memory, such as RTX 3070, with each run costing less than 2 hours. The runtime does not vary too much with changes in D T and D R , due to the quadratic online inference complexity.

Henceforth, we drop the dependence on time step t of θT , θR for brevity.

In GLiBRL, d = DS + DA for transition models and d = 2 • DS + DA for reward models.

A100 is not mandatory. GLiBRL is runnable with ≤ 8GB GPU memory, see Appendix A.10.

For example, PEARL barely succeeded (< 3%) in ML10 with 1e8 steps, see Figure17inYu et al. (2021).

They also have the same initial noises. (νT ΩT ) -1 = ΣT = 0.025 • I and (νRΩR) -1 = ΣR = 0.5.

Comparisons of success rates are not included, as there is no obvious difference in IQMs or CIs.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut