On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler’s time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.
💡 Research Summary
The paper investigates the computational efficiency of the widely used Bayesian Additive Regression Trees (BART) sampler, focusing on how its convergence behavior scales with the number of training observations n. Because BART’s posterior cannot be evaluated analytically, inference relies on a Metropolis‑within‑Gibbs MCMC algorithm originally proposed by Chipman et al. (2010). The authors first modify this sampler by marginalizing out the leaf‑node parameters µ, so that the Metropolis–Hastings step operates solely on the discrete space of tree‑structure ensembles. This marginalization yields a finite, n‑independent state space, making it amenable to classical Markov‑chain analysis.
The theoretical framework assumes a d‑dimensional discrete covariate space X={1,…,B}^d and a regression model y_i = f⁎(x_i)+ε_i with sub‑Gaussian noise. The true regression function f⁎ is allowed to be additive, i.e., a sum of component functions each depending on a single covariate. Under these conditions the posterior concentrates on a set OPT of “optimal” tree ensembles that achieve the smallest bias‑complexity trade‑off; OPT also forms a highest‑posterior‑density region (HPDR).
The first major result is a lower bound on the hitting time τ_OPT, defined as the number of MCMC iterations required for the chain (starting from the usual trivial‑tree initialization) to enter OPT. When the number of additive components in f⁎ is at least as large as the number of trees in the fitted BART model, the authors prove τ_OPT = Ω(n^{1/2}). In other words, as the sample size grows, the chain must traverse increasingly difficult multimodal regions of the posterior, and convergence slows down at least at the square‑root rate. This lower bound is modest compared with exponential bounds found for single‑tree Bayesian CART, but it is the first quantitative guidance linking n to the required MCMC budget for BART.
To counteract this slowdown, the paper proposes three simple modifications and derives corresponding mixing‑time upper bounds:
-
Increasing the number of trees – If the ensemble contains sufficiently many trees (relative to the true additive components), the posterior becomes effectively unimodal and the mixing time becomes O(1) with respect to n.
-
More global proposal moves – By allowing larger structural changes (e.g., swapping whole sub‑trees, simultaneous split/merge operations) before the Metropolis–Hastings acceptance step, the chain can jump between modes more readily. The authors show that such enriched proposals also yield an O(1) mixing time.
-
Tempering – Raising the posterior density to the power 1/T (T ≥ 1) flattens the landscape. The resulting tempered chain mixes in time that grows slower than any polynomial in n (e.g., polylogarithmic). Importantly, all three modifications preserve the HPDR, so the statistical properties of the posterior remain unchanged.
The theoretical findings are complemented by an extensive simulation study using the original (unmodified) BART sampler on six real‑world and synthetic datasets. Convergence is assessed with the Gelman–Rubin statistic (\hat R). Across all datasets, (\hat R) exhibits a clear upward trend as n increases, confirming the predicted degradation in mixing. When the number of trees is increased or the temperature is raised, (\hat R) drops dramatically, and the coverage of credible intervals as well as held‑out RMSE improve. These empirical patterns are robust to changes in initialization, burn‑in length, feature‑selection priors, and discretization schemes, indicating that the observed n‑dependence is intrinsic to the default sampler rather than an artifact of tuning.
The authors acknowledge several limitations. Their analysis relies on discrete covariates and additive true functions; extending the results to continuous predictors, interactions, or non‑additive structures remains an open problem. Moreover, the lower bound of √n, while non‑trivial, may still be conservative for many practical settings. The modifications suggested (more trees, global moves, tempering) are straightforward to implement, but their impact on computational cost per iteration and on scalability to very high‑dimensional data warrants further study.
In summary, this work provides the first asymptotic characterization of BART’s MCMC convergence as a function of sample size, demonstrates that naïve use of the default sampler can lead to poor mixing in large‑n regimes, and offers theoretically justified, easily applicable remedies. The results give practitioners concrete guidance on how to set the number of trees, design proposal kernels, or employ tempering to ensure reliable posterior inference when applying BART to large‑scale problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment