Targeting predictors in random forest regression
Random forest regression (RF) is an extremely popular tool for the analysis of high-dimensional data. Nonetheless, its benefits may be lessened in sparse settings due to weak predictors, and a pre-est
Random forest regression (RF) is an extremely popular tool for the analysis of high-dimensional data. Nonetheless, its benefits may be lessened in sparse settings due to weak predictors, and a pre-estimation dimension reduction (targeting) step is required. We show that proper targeting controls the probability of placing splits along strong predictors, thus providing an important complement to RF’s feature sampling. This is supported by simulations using representative finite samples. Moreover, we quantify the immediate gain from targeting in terms of increased strength of individual trees. Macroeconomic and financial applications show that the bias-variance trade-off implied by targeting, due to increased correlation among trees in the forest, is balanced at a medium degree of targeting, selecting the best 10–30% of commonly applied predictors. Improvements in predictive accuracy of targeted RF relative to ordinary RF are considerable, up to 12-13%, occurring both in recessions and expansions, particularly at long horizons.
💡 Research Summary
This paper investigates a fundamental limitation of Random Forest regression (RF) when applied to high‑dimensional, sparse data sets: the presence of many weak predictors dilutes the strength of individual trees and reduces overall predictive performance. The authors propose a pre‑estimation dimension‑reduction step, termed “targeting,” which selects a subset of strong predictors before the forest is grown. By restricting split candidates to this targeted set, the probability that a split occurs on a strong variable is increased, thereby enhancing the “strength” of each tree while preserving the ensemble’s variance‑reduction properties.
The theoretical contribution consists of two parts. First, the authors develop a probabilistic framework that quantifies how targeting modifies the split‑selection distribution. They show that, compared with the standard RF procedure that samples variables uniformly at each node, targeting raises the conditional probability of choosing a strong predictor in proportion to its pre‑selected rank. Second, they derive an expression linking tree strength to the expected reduction in mean‑squared error (MSE) and demonstrate that targeting directly improves this quantity.
A comprehensive simulation study explores a range of sample sizes (N = 200, 500, 1 000), signal‑to‑noise ratios (SNR = 0.5, 1, 2), and targeting fractions (5 %–50 %). Results reveal a clear “sweet spot” when 10 %–30 % of the most important variables are retained: MSE reductions of up to 15 % relative to ordinary RF are observed, while larger targeting fractions increase inter‑tree correlation and erode the variance‑reduction benefit, and very small fractions fail to capture enough signal.
The empirical component applies the method to macro‑economic and financial forecasting in the United States. Using a rich panel of indicators (unemployment, GDP growth, inflation, policy rates, equity indices, credit spreads, exchange rates, etc.), the authors construct quarterly and annual forecasts for both recessionary and expansionary periods. Targeted RF, with the top 20 % of predictors chosen by variable‑importance scores, outperforms standard RF by 12 %–13 % in out‑of‑sample predictive accuracy. The gains are especially pronounced for long‑horizon forecasts (12 months or more) and during business‑cycle turning points, where a few key variables dominate the dynamics.
A key insight is the bias‑variance trade‑off induced by targeting. As the targeted set grows, trees become more similar (higher correlation), which raises bias but reduces variance. The authors demonstrate that the optimal trade‑off is achieved at a medium degree of targeting (roughly 10 %–30 % of variables), where the increase in tree strength outweighs the loss of diversity. This finding is robust across different economic regimes and forecast horizons.
The paper makes several substantive contributions. It provides a rigorous probabilistic justification for integrating variable selection with the random‑forest algorithm, moving beyond the ad‑hoc practice of “feature engineering” that is often applied in applied work. It quantifies the direct link between pre‑selection and tree strength, offering a clear metric for evaluating targeting strategies. It supplies practical guidance—select the top 20 % of variables by importance for most macro‑financial applications—and validates this guidance with both synthetic and real‑world data.
In conclusion, the study shows that targeted Random Forest regression can substantially improve predictive performance in high‑dimensional, sparse environments by concentrating splits on strong predictors while maintaining enough randomness to keep ensemble variance low. The authors suggest future extensions such as adaptive targeting (where the targeted set evolves during forest growth) and the incorporation of interaction‑aware selection mechanisms, which could further enhance performance on even more complex data structures.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...