SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning
📝 Original Info
- Title: SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning
- ArXiv ID: 2512.22895
- Date: 2025-12-28
- Authors: Xiaotian Ren, Nuerxiati Abudurexiti, Zhengyong Jiang, Angelos Stefanidis, Hongbin Liu, Jionglong Su
📝 Abstract
Highlights • SAMP-HDRL integrates dynamic asset grouping, hierarchical agent coordination, and utility-based capital allocation to enhance portfolio robustness under non-stationary market conditions. • Extensive backtests across three market regimes show that SAMP-HDRL outperforms nine traditional and nine DRL baselines, delivering at least 5% improvements in Return and major risk-adjusted metrics. • SHAP-based analysis uncovers a complementary "diversified + concentrated'' decision pattern across agent layers, providing transparent and interpretable insights into hierarchical DRL portfolio behavior.📄 Full Content
In the domain of portfolio optimization, achieving effective allocation under dynamically evolving market conditions remains a fundamental challenge in intelligent systems research. Early studies primarily focus on holistic modeling of a single asset universe, emphasizing the use of more powerful and generalized neural architectures or advanced mathematical formulations to optimize portfolios Jiang et al. (2017); Gu et al. (2025); Sun et al. (2025Sun et al. ( , 2024)); Qin et al. (2022); Ren et al. (2021); Gu et al. (2021); Sun et al. (2021); Ren et al. (2025). However, such approaches in highdimensional spaces are highly prone to the curse of dimensionality, resulting in slow convergence, unstable training, or even divergence Bengio et al. (2013). To address this issue, researchers introduce clustering mechanisms to categorize assets and subsequently optimize portfolio weights through mathematical models, which subsequently improves performance Bulani et al. (2025). Nevertheless, these approaches generally suffer from a fundamental limitation: the clustering and optimization processes are decoupled. Since the clustering outcomes and the optimizer cannot be jointly trained, information cannot be effectively propagated, leading to constrained policy learning or suboptimal solutions Poli et al. (2007). Moreover, most clustering procedures are static or rely on heuristic updates, which fail to capture the temporal dynamics of financial time series Moody and Saffell (2001). Consequently, when regime shifts occur in the market, such models exhibit delayed or ineffective responses, resulting in severe drawdowns, uncontrolled risk exposure, and missed opportunities for potential profit Akgül et al. (2025). Leveraging this, some studies attempt to integrate clustering with deep reinforcement learning, applying DRL algorithms within clustered subsets to achieve preliminary coupling between clustering and DRL Jiang et al. (2023); Wang and Aste (2023). However, clustering remains static and weakly integrated with the DRL pipeline, limiting the capacity to model dynamically changing market conditions Yan et al. (2024). Furthermore, certain models introduce dynamic subset selection mechanisms, restricting investment to a subset of promising assets and reducing computational complexity by maintaining a representative subset of assets through online updates Peng et al. (2024a); Ma et al. (2024). Although online updating provides promising benefits, existing mechanisms predominantly rely on heuristic rules and do not incorporate differentiable end-to-end training frameworks, thereby limiting integration with deep learning or reinforcement learning and restricting scalability as well as generalization capability Wang and Aste (2023). To address these issues, researchers design a two-level reinforcement learning architecture in which the upper-level agent dynamically selects stocks while the lower-level agent optimizes transaction execution, thereby alleviating to some extent the challenges posed by high-dimensional action spaces Zhao and Welsch (2024a). However, the stock selection process remains implicit Zhao and Welsch (2024a), relying entirely on the policy network’s mapping from market states to actions and lacking explicit modeling of structural market shifts. Moreover, the inter-layer information flow does not constitute a genuine end-to-end feedback loop, which imposes inherent limitations on interpretability and adaptability to dynamic environments Barto and Mahadevan (2003a).
From a computational perspective, these approaches exhibit three major limitations.
-
Static or heuristic clustering lacks the ability to model the dynamic properties of time series. Many methods treat clustering as a one-off or lowfrequency “static annotation,” Jiang et al. (2023); Wang and Aste (2023) or perform heuristic updates with fixed thresholds and simple rules, implicitly assuming that asset relationships and risk structures remain stationary within a given window Peng et al. (2024a); Ma et al. (2024). In reality, markets are highly dynamic, and such static clusters quickly become obsolete Jiang et al. (2023). The absence of dynamism leads to two consequences: (i) signal lag and attenuation-when regime shifts occur, outdated clusters or features lose discriminative power and strategies respond with delay Shu et al. (2024); and (ii) risk mismatch-risk models constrained by outdated partitions underestimate true exposures, resulting in elevated risk Trucíos (2025). Moreover, this increases transaction costs, as misaligned rebalancing and frequent ineffective transactions cause excessive turnover. Methodologically, it is essential to introduce risk-sensitive dynamic clustering mechanisms and deeply couple such dynamic representations with portfolio optimization in order to achieve rapid and adaptive strategy updates in evolving environments.
-
Clustering or subset selection is decoupled from the optimizer, preventing the formation of an end-to-end feedback loop. Approaches that rely on clustering or subset selection generally treat this procedure as a pre-processing stage, where the derived asset sets are subsequently passed to the optimizer or DRL module for portfolio weight learning Bulani et al. (2025); Peng et al. (2024a); Ma et al. (2024). The two stages neither share parameters nor provide differentiable linkages; thus, the optimizer’s objective function or DRL reward signals cannot propagate backward to influence upstream clustering or filtering strategies. This disjunction creates both objective mismatch Trosten et al. (2021) and distribution shift Dulac-Arnold et al. (2021), leading to under-utilization of information Dulac-Arnold et al. (2021), persistent errors Trosten et al. (2021), low sample efficiency Franke et al. (2020), slow convergence Shi et al. (2019), and high sensitivity to hyperparameters Franke et al. (2020), with the risk of becoming trapped in local optima under complex nonconvex objectives Hong et al. (2018). Small perturbations upstream can also amplify downstream decisions, degrading stability and reproducibility Zheng et al. (2016). Without a closed loop, models cannot perform end-to-end uncertainty quantification and calibration, thereby weakening risk awareness and control Kuleshov et al. (2018). It is therefore necessary to design differentiable and feedback-aware coupling mechanisms that align asset clustering with portfolio objectives under a unified optimization paradigm.
-
Subset updating suffers from limited scalability. Rule-based dynamic subset selection cannot adaptively optimize with respect to rewards or losses Peng et al. (2024a); Ma et al. (2024), which easily introduces selection bias Chou and Pham (2025). The Hierarchical Reinforced Trader Zhao and Welsch (2024a), designed with a two-level structure of “upper-level stock selection and lower-level execution,” illustrates how hierarchical designs can partially alleviate instability and exploration difficulties in high-dimensional action spaces. Such models are generally aligned with the paradigm of hierarchical reinforcement learning (HRL) Barto and Mahadevan (2003b), where complex decision-making tasks are decomposed into layered subproblems to improve learning efficiency and robustness. However, the stock selection logic in these designs remains implicit, relying entirely on black-box policy networks mapping states to actions, and thus lacking explicit modeling of structural market changes Bieganowski and Ślepaczuk (2025). Moreover, commonly adopted alternating or staged training strategies hinder end-to-end information flow across layers, which not only exacerbates stability issues Levy et al. (2017) but also restricts scalability when applied to larger universes or higher-frequency trading environments. To address these challenges, it is essential to replace heuristic and black-box selection with explicit market-structure modeling, differentiable selection and allocation modules, and hierarchies that preserve informational consistency, while embedding capital allocation and risk control into a unified, utility-driven, learnable framework to enhance interpretability and scalability. In summary, these three deficiencies collectively undermine generalization capability in real-world environments, delay responses to abrupt market transitions, and destabilize risk control. Consequently, dynamic modeling of non-stationary environments, systematic optimization around differentiable feedback loops, and scalable hierarchical architectures is not only critical for performance improvement but also a prerequisite for ensuring the practicality and reliability of knowledge-based systems in financial applications.
To address three major limitations in prevailing portfolio optimization techniques-static clustering, the lack of end-to-end integration between clustering and optimization, and reliance on heuristic subset mechanisms-we propose Segmented Allocation with momentum-adjusted utility for Multi-agent Portfolio management via Hierarchical Deep Reinforcement Learning (SAMP-HDRL), a knowledge-driven hierarchical DRL framework that integrates dynamic clustering with multi-agent decision making.In this framework, assets are dynamically partitioned into two groups through clustering. The upper-level agent operates on the complete asset universe, extracting global representations that capture inter-asset correlations and market-wide dynamics, and provides holistic guidance signals to inform subsequent decisions. The lower-level agents then focus on their respective dynamically assigned groups, allocating portfolio weights under masking constraints to ensure coherent intra-group optimization. An exponential utility function subsequently integrates the outputs of the lower-level agents with historical returns to determine segmented allocations across the two risky asset groups and the risk-free asset. On this basis, momentum adjustment and rebound detection are incorporated into the utility formulation to enhance robustness against persistent market trends and abrupt regime shifts. By design, SAMP-HDRL not only adapts to dynamic changes in market structure but also achieves a principled integration of global feature modeling, localized optimization, and risk-sensitive capital allocation. In doing so, it provides a knowledge-driven and practically scalable solution to portfolio management, directly addressing the fragmentation and rigidity observed in conventional methodologies.
This study does not introduce new neural network architectures. This study concentrates on structural innovations within the reinforcement learning framework for portfolio management. The proposed design integrates dynamic asset grouping, coordinated multi-agent allocation, and risk-sensitive utility formulation. By emphasizing structural modeling over architectural modification, the framework ensures that performance improvements originate from principled design mechanisms, thereby enhancing interpretability, robustness, and scalability in practical financial applications.
This paper has four key contributions: 1. Joint modeling of global market signals and localized asset-group decisions. Unlike conventional portfolio optimization methods that either rely on a single global model or isolate group-wise optimization, our framework first performs dynamic asset grouping, then applies an upper-level agent to extract global market representations, and finally deploys lower-level agents to allocate weights within their dynamically assigned clusters under mask constraints. This sequential design-grouping, global modeling, and local allocation-enhances representational richness and enables more precise allocation strategies, which is empirically reflected in improved adaptability and performance stability.
-
Interpretable hierarchical decision mechanism built on dynamic classification. Leveraging dynamic asset classification, we design a hierarchical decision mechanism that eliminates the reliance on rule-based subset updating and implicit black-box selection. By explicitly modeling the grouping and allocation process within a unified learning framework, and by ensuring consistent information flow across layers, the proposed design mitigates instability caused by staged training. This approach enhances interpretability and provides stronger adaptability to structural market changes.
-
An innovative momentum-adjusted utility function for segmented allocation. We advance the utility-based allocation principle by incorporating momentum adjustment Moskowitz et al. (2012) and rebound detection Jegadeesh (1990) into the capital allocation process, which jointly accounts for historical returns, regime dynamics, and risk-free assets. This enriched utility design enhances resilience against abrupt market transitions, offering a novel mechanism for risk-sensitive and knowledge-driven portfolio optimization.
-
Consistent Outperformance in Diverse Markets SAMP-HDRL demonstrates consistent empirical performance under volatile and non-stationary market regimes. Compared with the strongest baselines, it achieves improvements of at least 5%, 5%, 5%, and 2% in Return, Sharpe Ratio, Sortino Ratio, and Omega Ratio, respectively. These results indicate that SAMP-HDRL not only enhances profitability but also delivers robust risk-adjusted performance, thereby underscoring both its theoretical contribution to reinforcement learning research and its practical significance for portfolio management.
The structure of this paper is as follows. Section 2 reviews the relevant literature and highlights recent developments in the domain of portfolio optimization and intelligent systems. Section 3 outlines the proposed methodology, covering the mathematical foundations, the design of the learning environment, and the architecture of the policy network. Section 4 presents the experimental backtests, where the performance of our framework is evaluated using portfolio value, Sharpe ratio, Sortino ratio, and Omega ratio against traditional strategies and learning-based approaches. Section 5 concludes the study and discusses possible directions for future research.
In the domain of portfolio optimization, prior research primarily focuses on traditional online portfolio strategies. The Constant Rebalanced Portfolio (CRP) Li et al. (2023); Kelly (1956) and Universal Portfolios (UP) Borodin et al. (2000) establish the theoretical foundations of online portfolio selection by emphasizing continuous rebalancing and nonparametric universal learning, while methods such as UBAH Cover (1991) and the Markov of order zero (M0) model Li and Hoi (2014) further extend these paradigms. To capture inter-asset correlations and dynamic market characteristics, algorithms such as the Exponential Gradient (EG) Helmbold et al. (1998) are developed, enhancing adaptability and resilience. Alternatively, a family of mean-reversion approaches, including Passive Aggressive Mean Reversion (PAMR) Li et al. (2012) and Confidence Weighted Mean Reversion (CWMR) Li et al. (2011b), exploits the mean-reverting tendencies of asset prices to stabilize predictive performance. Hybrid techniques that integrate nonparametric statistics with log-optimal portfolio theory, such as the Correlation-driven Nonparametric Learning Strategy (CORN) Li et al. (2011a), provide flexible mechanisms for leveraging structural dependencies among assets.
Beyond these algorithmic frameworks, the classical Capital Asset Pricing Model (CAPM) Sharpe (1964) remains a cornerstone of quantitative finance, offering a rigorous theoretical basis for asset pricing and investment allocation. Despite their foundational role, traditional strategies exhibit inherent limitations, as they are largely rule-based and struggle to adapt to complex, high-dimensional, and evolving financial environments Xiao et al. (2020). These constraints motivate the development of machine learning and deep reinforcement learning methods, which aim to overcome such rigidity and provide more adaptive and data-driven solutions for portfolio optimization.
In addition to these classical approaches, several alternative methodologies extend portfolio optimization beyond rule-based strategies. For example, the study on stock portfolio selection using Dempster-Shafer evidence theory Mitra Thakur et al. (2018) applies evidence-theoretic reasoning to quantify uncertainty and integrate heterogeneous signals, while fuzzy cross-entropy, mean, variance, and skewness models Bhattacharyya et al. (2014) utilize fuzzy mathematics and higher-order moments to capture non-linear risk characteristics. A hybrid two-stage robustness approach Atta Mills and Anyomi (2022) further formulates portfolio construction as a robust optimization problem under uncertainty, explicitly emphasizing resilience against model misspecification and market shocks. Beyond portfolio allocation, hybrid predictive models such as the efficient hybrid approach for forecasting real-time stock market indices Kalra et al. (2024) and the MMGAN-HPA algorithm for stock price prediction Polamuri et al. (2022) leverage deep generative models and multi-model integration to improve the forecasting accuracy of asset prices. While these methods enrich the methodological landscape of financial modeling, they remain primarily optimization-or prediction-driven, with limited adaptability to high-dimensional and non-stationary portfolio management environments Chou and Pham (2025).
Building on these developments, reinforcement learning (RL) becomes increasingly prominent in portfolio management, with research primarily emphasizing uncertainty modeling, transfer learning, expert priors, neural architecture design, and multiobjective learning, rather than asset structuring or hierarchical decision-making. Kang et al. propose the Neural Process Continuous Reinforcement Learning (NPCRL) Kang et al. (2024), which integrates Neural Processes with continuous-action RL to capture market uncertainty and enable dynamic rebalancing in highly volatile environments, achieving a favorable trade-off between return and risk. Representation Transfer Reinforcement Learning (RTRL) Jiang et al. (2024a) addresses limited samples and non-stationarity by migrating features across domains, although its benefits mainly derive from transferability rather than explicit modeling of asset structures. Choi and Kim Choi and Kim (2024) design an expert-infused DRL framework that incorporates tutor policies as priors to guide dynamic allocation, demonstrating that RL agents can outperform expert strategies but focusing on knowledge infusion rather than structural modeling. An early contribution by Yang Yang (2023) demonstrates the feasibility of end-to-end deep RL for finance by mapping price sequences directly to portfolio weights with Convolutional Neural Networks (CNNs) Hochreiter and Schmidhuber (1997) and Long Short-Term Memory networks (LSTMs) LeCun et al. (1998), yet it does not incorporate classification, clustering, or hierarchical structures. Recent advances explore more expressive architectures: DRL-UTrans Yang et al. (2023) combines Transformers and U-Net to capture long-range dependencies and multi-scale patterns in financial time series, while Deep Long Short-Term Memory Q-Learning (DLQL) and its attention-based variant DLAQL Oyewola et al. (2024) integrate LSTMs and attention within a Q-learning framework for applications in the oil and gas sector. Other research extends RL to multi-objective optimization, such as the framework in Xu et al. (2023), which balances return maximization and risk control through reward design, broadening optimization objectives without modeling hierarchical or structural dependencies. Collectively, these studies provide diverse perspectives on portfolio management, yet they share common limitations Yang (2023); Jiang et al. (2023); Fischer and Krauss (2018): insufficient treatment of structural relationships among assets, the absence of explicit classification or clustering mechanisms, and limited adaptability and interpretability in complex and volatile financial environments Ulm et al. (2020). These gaps highlight the need for structural innovations that move beyond network design alone and motivate the hierarchical and clustering-based approach developed in this study.
Beyond conventional portfolio optimization and generic reinforcement learning approaches, a number of studies are more directly aligned with this work. These contributions incorporate asset structuring into portfolio management through mechanisms such as dynamic subset selection, clustering, ranking, or hierarchical organization, thereby following the paradigm of structuring assets prior to portfolio decision-making. For clarity, these studies are organized into three categories, their respective limitations are examined, and the ways in which the proposed model addresses these shortcomings are demonstrated.
In the category of clustering-or subset-based reinforcement learning methods, Dynamic Coreset Construction Peng et al. (2024b) dynamically constructs representative subsets to approximate the overall market distribution, thereby improving the efficiency of online portfolio selection. This mechanism can be regarded as a special form of dynamic clustering. The framework combining K-means Jin and Han (2010), mean-variance optimization, and reinforcement learning Zouaghia et al. (2025) explicitly embodies the idea of “classification followed by decision.” The CAD framework Jiang et al. (2023) further extracts asset correlations via clustering and integrates them with deep reinforcement learning for multi-period portfolio management. In a related direction, the clustering-based return prediction model with PSO-CNN+MVF Ashrafzadeh et al. (2023) employs clustering for stock pre-selection, followed by deep return prediction and mean-variance forecasting, thereby reducing noise and enhancing the reliability of portfolio construction. These studies collectively Focuses on task partitioning rather than structural asset modeling.
Our hierarchy captures both classification and cross-asset relationships.
contribute by enhancing portfolio decisions with clustering or grouping mechanisms. However, their clustering processes are mostly static, lacking dynamic interaction with reinforcement learning, which limits adaptability under non-stationary markets. In contrast, our hierarchical multi-agent framework embeds dynamic clustering directly into the training process and assigns distinct roles to lower-and upper-level agents to capture temporal dependencies and cross-asset relationships, thereby achieving stronger adaptability and structural modeling capability.
In the category of ranking-or matching-based reinforcement learning methods, the Stock Ranking and Matching RL approach Alzaman (2025) implicitly forms asset groups through ranking and matching mechanisms, and then optimizes portfolio selection via reinforcement learning. The ASA framework Zhao and Welsch (2024b) employs graph and hypergraph ranking models for stock selection, combined with classification and regression models for weight allocation. This covers the entire process from selection to allocation. These methods share the contribution of introducing ranking or graph structures into portfolio management. Their limitations, however, lie in their reliance on external ranking signals or supervised learning paradigms, which reduces their dynamic adaptability and interactive learning capability, while risk constraints and transaction costs are simplified. In contrast, our method avoids reliance on external signals, directly models asset correlations via dynamic clustering, and achieves interactive and adaptive optimization within a reinforcement learning framework, making it better suited to non-stationary market conditions.
In the category of hierarchical structures, the Hierarchical Reinforced Trader (HRT) Kim et al. (2022) proposes an explicit two-level framework, where the upper level selects assets and the lower level executes transactions, thus introducing hierarchical mechanisms into portfolio management. Its contribution lies in expanding the decision structure of traditional RL. However, its hierarchy mainly targets task partitioning (selection vs. execution) rather than asset-structural partitioning, and therefore remains limited in capturing cross-asset relationships. By contrast, our hierarchical design is tailored for asset classification and cross-asset relationship modeling, with explicit division of roles between levels. This ensures that the hierarchical objectives are tightly coupled with market structure, enabling more effective operation under non-stationary environments.
To more intuitively demonstrate the current research results on extracting potential connections between assets, we summarize the relevant studies and present the results in Table 1.
This section includes definitions of financial markets and portfolios, and relevant mathematical definitions related to market transactions. It provides the theoretical basis for our work.
The financial market represents a complex ecosystem in which a wide range of instruments -such as equities, fixed income assets, and certificates of deposit -are continuously traded with the aim of generation of returns. This study concentrates on the U.S. equity market, which is distinguished by its comprehensive regulatory framework and substantial liquidity, rendering it one of the leading markets worldwide in terms of capitalization and the scale of investor participation Harris and Ravenscraft (1991).
Portfolio management constitutes a systematic investment process aimed at maximizing returns through the optimization of asset allocation. Consider an investor selecting m stocks, where m > 0, with the portfolio adjust closing price vector at time t represented by the m × 1 column vector
For analytical purposes, the investment horizon is partitioned into F discrete intervals, each associated with a period t ∈ N. A period t begins immediately after t and concludes at t + 1, corresponding to the open-closed interval (t, t + 1]. The relative price vector of the portfolio, denoted as z t and having dimension 1 × m, is defined as a function of v ′ t (the portfolio price vector at the beginning of period t) and v t (the price vector at the end of period t):
where ⊘ indicates element-wise division. In view of arguments suggesting that upward price movements should not be regarded as a source of risk Rollinger and Hoffman (2013), this study utilizes the asset Sortino ratio z Sortino Rollinger and Hoffman (2013), a 1 × m vector, to quantify asset risk. The Sortino ratio incorporates downside standard deviation, reflecting the perspective that only adverse volatility represents true risk Rollinger and Hoffman (2013). Unlike the Sharpe ratio, which penalizes both upside and downside volatility, the Sortino ratio concentrates exclusively on downside deviation, thereby providing a risk measure that better aligns with investor preferences and the asymmetric nature of financial return distributions Rollinger and Hoffman (2013). This property makes it particularly suitable for portfolio optimization and for integration into reinforcement learning frameworks, where distinguishing harmful volatility from favorable gains is critical for designing effective risk-aware allocation strategies Rollinger and Hoffman (2013). Formally, the asset Sortino ratio z Sortino measures the excess return per unit of downside risk and is defined as
where r A denotes the minimum acceptable logarithmic return with base 2, specified as the daily logarithmic risk-free rate r A in this study. The E[.] denotes the expectation. The index j corresponds to the number of periods during which the daily logarithmic return of an asset falls below this threshold. For z Sortino , larger values indicate stronger potential for asset appreciation.
For mitigating decision complexity, capturing structural heterogeneity, and adapting to evolving market conditions, this work applies K-means clustering Jin and Han (2010) for dynamic asset classification. K-means is an unsupervised learning algorithm that partitions the dataset into k disjoint clusters by minimizing intra-cluster variance:
where v ai,t-1 denotes the adjusted closing price vector of asset a i , which serves as the representative (centroid) of cluster C i , and ∥•∥ 1 is the L 1 norm, measuring the sum of absolute changes in portfolio weights and thus quantifying the total volume of reallocation. The parameter k is predetermined; in this study k = 2, corresponding to two asset groups. As noted by Fama and French Fama and French (1993), excessive stratification may fragment portfolios and lead to unstable estimates, which further justifies our parsimonious two-group design. The algorithm iteratively alternates between assigning data points to the nearest centroid and recalculating centroids as the mean of the assigned points, until assignments stabilize or the objective function converges. To balance sensitivity to regime shifts with computational tractability, clustering is re-executed every 75 trading days, which approximately aligns with a quarterly cycle Camanho et al. (2022); Kuhn and Luenberger (2010).
The adoption of K-means clustering confirms three distinct benefits. First, it offers a straightforward and computationally efficient approach for categorizing large-scale asset sets Cai et al. (2025). Second, by incorporating Sortino ratio features, the classification emphasizes risk-adjusted returns rather than raw price movements Jin and Han (2010). Third, periodic re-execution enables dynamic restructuring of asset clusters in response to market evolution, thereby strengthening the robustness and adaptability of the overall portfolio optimization process Namitha and Santhosh Kumar (2020).
After applying K-means clustering, the result is expressed as C = {C (1) , C (2) }, which partitions the asset universe into two mutually exclusive subsets. To embed this clustering outcome within the hierarchical reinforcement learning framework, binary masks m
(1) i,t and m
(2) i,t are introduced for each cluster, defined as
where m
(1) t and m
(2) t denote complementary masks of dimension 1 × m, ensuring that only the assets associated with their respective clusters remain active. These masks are then employed to guide the lower-level agents, thereby constraining intra-group allocation to assets within each dynamically identified group. Using m
i,t-1 , and the price vector v t-1 , the masked price vectors v
(1) t-1 and v
(2) t-1 corresponding to Group 1 and Group 2 are obtained as
where ⊙ denotes the Hadamard product Horn (1990). The portfolio price history over the most recent n periods is represented by the matrix q t-1 , an m × n matrix defined as
Analogously, the price matrices q
(1)
t-1 and q
(2)
t-1 corresponding to Groups 1 and 2 are given by q
(1)
.
Based on the stock price matrix q
(1)
t-1 , q
(2)
t-1 and the mask m
(1)
t-1 , m
(2)
t-1 , we obtain the logarithmic relative price matrix lzq
Within the portfolio, the allocation of capital at time t is represented by the weight vector ω t-1 , defined as
which is an m × 1 vector. The initial weight vector is specified as the zero vector of dimension m × 1, indicating that the initial endowment is entirely allocated to cash. At any time t, the portfolio weights satisfy the budget constraint
Correspondingly, the group-specific allocations are denoted by ω
(1)
t-1 and ω
(2)
t-1 , defined through the masking mechanism:
t-1 is the weight of risk free asset at time t. The portfolio value prior to trading at time t, denoted as p t-1 , is determined jointly with the share vector sh t-1 , which represents the number of shares hold across assets. These quantities are obtained as andh (2) t-1 denote the share holdings and residual cash for Groups 1 and 2, respectively, at time t. Portfolio values of Group 1, 2 and risk free asset at time t are denoted as p
t-1 denote portfolio values and weight vector of Groups 1, 2 and risk free asset at beginning of period t -1. The operator ⊘ specifies element-wise division, while the floor operator ⌊•⌋ enforces integer rounding of the number of shares by truncation. Refer to Figure 1 for a comprehensive illustration of the transaction mechanism. At the end of period t -1 (i.e., at time t), the price vectors, portfolio values, and allocation weights of the two groups are denoted by v
(1) andω (2) t-1 , respectively. Once transactions at time t are executed, the updated states of the system are represented as v andω (2)′ t . The portfolio is then maintained without additional rebalancing until the conclusion of period t.
Each transaction occurs instantaneously at the decision point. Upon completion, the group-specific price vectors v
(1) (2)′ t at the beginning of period t are expressed as
, where u
(1) t and u
(2) t denote the proportional reductions in group values at time t due to transaction costs.
Transaction costs are deducted from group values, denoted by Cost
(1) t and Cost
(2)
t . The fee structure is formalized as:
, where cs represents the transaction cost rate. Following Jiang et al. Jiang et al. (2017), the transaction cost rate c is fixed at 0.001. After rebalancing, the residual cash holdings h
(1) t and h
(2) t that remain unallocated to equities are expressed as
, where p 0 denotes the initial portfolio value. If any allocation leads to h
(1) t < 0 or h
(2) t < 0, the transaction is proportionally scaled down to guarantee non-negativity of the cash balance, i.e., h
(1) t , h
(2) t ≥ 0. This mechanism prohibits debt accumulation and enforces that all asset positions are backed by available capital. The same feasibility constraint is consistently applied across all settings to ensure a fair and comparable evaluation.
The value of the portfolio at the beginning of period t, denoted p ′ t , is given by
The transaction return in period t is quantified through the base-2 logarithmic rate of return φ t , defined as
, where φ
(1) t and φ
(2) t denote group-specific returns. This iterative process continues for all periods, and the final portfolio value p f is expressed as
with F indicating the total number of investment periods and φ 0 is the initial capital. In this study, the initial capital is set to 1 million dollars. andω (2)′ t , which remain unchanged throughout period t. At the end of period t, the realized market state is expressed as v andω (2) t . The relative change in asset prices during period t is captured by the relative price vector zt.
The analysis is based on two foundational assumptions, denoted as A1 and A2.
A1: The market possesses sufficient liquidity to enable the prompt execution of each transaction.
A2: Transactions are assumed to have no impact on stock prices. According to Assumption A1, if a stock exhibits sufficiently high liquidity such that its trading volume exceeds the required execution size, all proposed trades can be executed without delay. Assumption A1 is essential to the framework, as daily market volume cannot be forecasted with precision. In practical markets, large transaction volumes can influence stock prices because both buying and selling activities convey investors’ sentiments and are reflected in price fluctuations. Therefore, Assumption A2 is introduced to abstract from this price-impact effect and maintain model tractability. Taken together, these two assumptions hold reasonably well when the selected stocks exhibit sufficiently high liquidity Jiang et al. (2017).
Although a transaction cost of 0.1% per transaction is incorporated in the experimental design to simulate basic market frictions Jiang et al. (2017), several critical aspects of real-world trading-such as slippage, order book depth constraints, and market impact-remain insufficiently modeled Kyle (1985). Under such conditions, applying the proposed strategy in live trading environments may result in execution price deviations Kyle (1985) and partial fills Lo et al. (2002), which significantly deteriorate the realized returns compared to backtests results. Furthermore, the multi-asset coordinated rebalancing paths learned by the agent during training may become infeasible due to real-world execution constraints, thereby impairing portfolio optimization effectiveness and overall strategy stability. Nevertheless, the proposed framework is equipped to jointly capture both long-term trends and short-term market dynamics, enabling it to adaptively modulate trading frequency based on prevailing market regimes. This multi-timescale perception mechanism offers greater flexibility and robustness compared to strategies relying solely on single-horizon signals, contributing to improved resilience against execution errors and performance degradation induced by market frictions Wei et al. (2021). It is worth noting that, although these idealized assumptions may lead to systematic overestimation of the absolute performance across all evaluated strategies, each approach-ranging from traditional rule-based baselines to alternative reinforcement learning frameworks-is assessed under the same experimental conditions. Therefore, the simplification applies uniformly across methods, ensuring the relative fairness of performance comparisons and preserving the validity of the observed inter-strategy rankings.
The proposed structure combines the continuous-control capability of the Deep Deterministic Policy Gradient (DDPG) Lillicrap et al. (2015a) algorithm with the task-decomposition advantage of HRL Barto and Mahadevan (2003b). The upperlevel agent provides stable global representations and guidance, while the lower-level agents perform fine-grained allocation within dynamically defined asset groups. Through this design, the framework establishes coherent linkages between global and local perspectives, thereby enhancing feature utilization, allocation granularity, and interpretability. The overall workflow of the proposed architecture is illustrated in Figure 2.
Hierarchical reinforcement learning decomposes and coordinates complex tasks through the cooperation of upper-and lower-level agents Barto and Mahadevan (2003b). In the proposed design, the upper-level agent (Agent 0) extracts global market signals and is then kept fixed to provide a stable supervisory interface, while the lower-level agents (Agent 1 and Agent 2) operate under dynamic masking constraints to allocate weights across ordinary and high-quality asset groups. This structure enables simultaneous modeling of global and local decision information, thereby enhancing representational richness and refining allocation granularity. A fusion mechanism then integrates the outputs to produce the final portfolio weights, ensuring that alongside concentrated investment in high-quality assets, the inclusion of remaining assets provides diversification benefits that reduce overall portfolio risk Sharpe (1964).
This design improves interpretability of the decision process and ensures coherent coordination between global and local decision modules.
In reinforcement learning (RL), the interaction between the agent and the environment follows a closed-loop process, as illustrated in Figure 3. At each decision step t, the agent observes the environment through the state s t , generates an action a t based on its policy, and applies this action to the environment. The environment then responds by updating the system dynamics and returning both a new state s t+1 and a scalar reward r t+1 . This iterative process allows the agent to optimize its policy so as to maximize cumulative rewards over time De Asis et al. (2018). In the context of portfolio optimization, the environment corresponds to the financial market, the state encapsulates observable market information and portfolio status, the action represents allocation decisions across assets, and the reward reflects investment performance Jiang et al. (2017). The following subsections elaborate on the detailed design of state, action, and reward in this study.
In reinforcement learning, the state s t refers to the complete information observed by the agent from the environment at time t, which provides the basis for decisionmaking. As illustrated in the standard agent-environment interaction loop, the environment updates the portfolio value after receiving the action a t and returns the new state s t+1 together with the immediate reward r t+1 . The construction of the state thus determines the agent’s perception of the market environment Mnih et al. (2015).
Fig. 3 [Closed-loop interaction.]Closed-loop interaction between agent and environment in reinforcement learning, where the agent observes the state st, takes action at, and receives the reward rt together with the next state s t+1 .
In the portfolio optimization problem, the environment corresponds to the financial market, and the state should capture market dynamics, portfolio positions, and structural constraints. In this study, the state at time t is defined as
where • q t-1 ∈ R m×n : the price history matrix of the past n days, reflecting the temporal evolution of the market; • sh
t-1 : the portfolio positions of the two asset groups before the current transaction, ensuring that decisions are conditioned on existing holdings rather than reinitialized at each step;
t-1 ∈ {0, 1} m : binary masks generated by dynamic clustering, indicating the tradable subsets at time t. This state representation is shared across all three agents -one upper-level agent and two lower-level agents-thereby providing a unified market perception. Such a shared-state design ensures consistent information flow throughout the hierarchical framework, while allowing each agent to concentrate on its designated decision-making responsibility Lowe et al. (2017). By jointly encoding temporal information, pathdependent holdings, and dynamically evolving market structures, this formulation enhances adaptability and robustness in non-stationary financial environments.
In reinforcement learning, the action a t denotes the agent’s policy output under state s t , which directly influences portfolio evolution and determines future returns. In the portfolio optimization setting, the action corresponds to the allocation weights assigned to assets. At time t, the actions a (0) t and a
(1) t is defined as
Here, ω1 ′ t is nonzero only for assets indicated by m1 t , and ω2 ′ t is nonzero only for assets indicated by m2 t . Together, they satisfy the budget constraint:
This formulation decomposes the global allocation problem into two structured subproblems, allowing the agent to optimize allocations within each group while maintaining overall capital conservation.
In reinforcement learning, the reward R t provides the immediate feedback signal that drives policy optimization De Asis et al. (2018). In the proposed hierarchical framework, the asset universe is dynamically partitioned into two subsets through Kmeans clustering Jin and Han (2010), where the clustering is periodically executed on rolling windows of asset returns to capture regime shifts and structural changes in the market. For each group i ∈ {1, 2}, the instantaneous reward
. To stabilize the learning process, this base reward
where κ is a positive scaling coefficient introduced to amplify the reward signal, ensuring sufficient gradient magnitude during training and stabilizing policy optimization Mnih et al. (2015). The term σ (i)
t-1 denotes the standard deviation of historical logreturns for group i at time t, and N t-1 represents the number of past observations available at time t. The adaptive risk-aversion coefficient β adj t-1 is then defined as
where β is a fixed hyperparameter set to 0.2 in this study and η is a normalization constant representing the baseline number of observations Moody and Saffell (2001).
Here, β serves as the baseline risk-aversion factor that regulates the strength of volatility penalization in the reward function: a smaller β prioritizes return maximization with limited regard for risk, whereas a larger β enforces more conservative behavior by amplifying the influence of volatility Bisi et al. (2019). Multiplication by Nt-1 η allows the overall coefficient β adj t-1 to increase as the number of observations grows, thereby reducing the weight of risk penalties when estimates are noisy in early stages and strengthening robustness as more reliable information becomes available Bisi et al. (2019).
The group-wise reward formulation establishes a one-to-one correspondence between each reward signal and the agent responsible for its designated asset group. Specifically, R
(1) t is assigned to agent 1 managing asset group 1, while R
(2) t is assigned to agent 2 managing asset group 2. This design ensures that the optimization objectives remain consistent with the structural decomposition of the market, while simultaneously enhancing interpretability, since the contribution of each asset group to policy learning can be explicitly identified.
To introduce structured constraints into the hierarchical reinforcement learning framework, this study applies the K-means algorithm Jin and Han (2010) to partition assets into two groups: Group 1 (high-quality assets) and Group 2 (ordinary assets).
For each group, a binary mask vector is constructed to restrict the input space of the lower-level agents. For the asset universe at time t, the clustering result is denoted as
The corresponding masks are defined as
1, if asset j belongs to Group 2 (ordinary) 0, otherwise . At any time t, the masks are complementary, satisfying m
(1)
t-1 = 1. Based on these masks, the subset input logarithmic return matrix matrices z
Correspondingly, the lower-level agents output the group-specific weight vectors ω
(1) t- 1and ω
(2) t-1 , which allocate capital within Group 1 and Group 2, respectively:
.
Through this clustering and mask mechanism, the model conducts group-wise learning, ensuring that each lower-level agent operates strictly within its designated subset of assets. This design reduces the complexity of the action space and establishes a clear structural constraint for subsequent weight fusion and global decision-making Walauskis and Khoshgoftaar (2025); Peng et al. (2025).
DDPG is a canonical actor-critic based deep reinforcement learning approach that is suitable for control tasks in continuous action spaces Lillicrap et al. (2015a). In this framework, the actor network outputs deterministic actions, such as portfolio weights, while the critic network evaluates state-action pairs and updates parameters through policy gradients. Because DDPG directly models continuous numerical outputs, it aligns with portfolio optimization, where weight adjustments are inherently continuous.
The upper-level agent employs a transformer-based architecture validated in prior financial applications Ren et al. (2025). This choice underscores that our contribution lies in structural innovation, while relying on established networks for function approximation. This structure employs a transformer-based model that effectively captures temporal dependencies in financial time series and cross-sectional correlations among assets Vaswani et al. (2017). Compared with conventional recurrent or convolutional networks, the transformer-based architecture exhibits stronger capability in extracting global market features. The spatial interaction and hierarchical relationship between the upper-level and lower-level agents are illustrated in Figure 2. As depicted, clustered and masked states are processed by the upper-level and lower-level agents in parallel, and a fusion module integrates their outputs into the final portfolio weights, preserving both focus and diversification Sharpe (1964). As illustrated in Fig. 4, the hierarchical reinforcement learning framework comprises an upper-level agent and two lower-level agents that interact through structured information exchange. It is particularly noteworthy that the upper-level agent directly receives the complete price information q t-1 , enabling it to capture inter-asset correlation structures and price dynamics. Its implicit output, denoted as ω (0)′ t , serves as a global guidance signal that encapsulates this information and is subsequently provided to the lower-level agents for coherent coordination. In contrast, the lower-level agents operate on logarithmic return matrices under mask constraints z
(1)
t-1 , and information ω (0)′ t obtained from the upper-level agent. The focus of the lower-level agents is confined to intra-group weight allocation, which ensures coherence between global objectives and local optimization processes. As a result, they lack the capacity to represent global cross-asset relationships. Through this hierarchical design, the upper-level agent supplements the information unavailable to the lower-level agents and provides guidance signals for more coherent decision making.
As shown in Figure 5, lower-level agents 1 and 2 receive as input the masked logarithmic return matrices q
(1) t-1 and q
(2) t-1 , corresponding to the high-quality asset group and the ordinary asset group, respectively. Both agents share an identical network architecture, with differences arising solely from the masks m
(1) t-1 and m
(2) t-1 , which define the effective input domain by filtering out irrelevant assets and thereby reducing the dimensionality of the allocation space Shen and Shafiq (2020). The network consists of three main components:
(0)′ t produced by the upper-level agent is first transformed through nonlinear mappings to generate gating coefficients. These coefficients perform element-wise multiplication with the input matrices, allowing the lower-level agents to incorporate global market guidance in their intra-group allocation and avoid relying solely on local patterns.
- Sparse Large Kernel Network (SLaK) backbone. The gated return matrix is processed by a SLaK convolutional backbone adopted from Liu et al. (2022). SLaK employs sparsity-enabled kernels with receptive fields extending beyond 51×51, allowing the network to jointly model temporal dependencies and intra-group correlations. This design substantially enhances representational capacity while preserving computational tractability. The resulting logits are linearly projected and combined with the group masks m
(1) t-1 and m
(2) t-1 to filter valid entries, followed by a softmax operation that produces the intra-group weight distributions ω
(1)′ t and ω
(2)′ t 3. Fusion strategy within lower-level agents. The softmax-based intra-group weights are further combined with the upper-level guidance signal. Specifically, the masked and normalized output of the upper-level agent is computed as
which represents a group-wise prior distribution. This prior is integrated with the intra-group predictions ω (i)′ t , where i ∈ {1, 2}, through a convex combination Vaswani et al. (2017), that is, a linear combination with non-negative coefficients summing to one. The combination ratio is controlled by a learnable coefficient λ, constrained to (0, 1) via a sigmoid function, thereby enabling adaptive adjustment between reliance on prior knowledge and data-driven predictions. The fused intra-group allocation is expressed as: It should be emphasized that the outputs ω
(1)′ t
and ω
(2)′ t represent only relative intragroup weights. The final portfolio weights are determined in the subsequent capital allocation and cross-group fusion mechanism.
The lower-level critic network adopts a transformer-based architecture for value estimation as introduced in Ren et al. (2025). In this setting, the critic encodes state-action pairs and learns the corresponding value function, thereby providing optimization signals for actor updates. The principal modification lies in the incorporation of masking constraints at the input stage. Specifically, for each group i ∈ {1, 2}, the state z (i) t-1 and the action ω (i)′ t are element-wise multiplied with the corresponding mask m (i) t-1 before being fed into the critic, ensuring that value estimation remains confined to the designated asset subset. This design prevents cross-group interference and preserves group-specific independence. Two advantages arise from this modification. First, it maintains coherence with the hierarchical framework, in which the upper-level agent extracts global signals while the lower-level critic performs localized value assessment Barto and Mahadevan (2003c). Second, it enforces consistency between value estimation and dynamic classification, thereby reinforcing the logical closure of the hierarchical decision process Yang et al. (2020a).
After obtaining the intra-agent weights ω a(1)′ t
and ω a(2)′ t
, we integrate them with classical portfolio selection techniques to determine the allocation of investment capital among the risk-free asset, Group 1, and Group 2. Within the domain of portfolio theory, the mean-variance framework represents one of the most influential and widely applied methodologies. Subsequent developments extend this paradigm by formulating portfolio selection as the maximization of an expected utility function, a perspective that generalizes the trade-off between risk and return. When the utility function is quadratic, the problem of maximizing expected utility can be equivalently expressed as minimizing portfolio variance while maximizing expected return, as discussed in Bodnar et al. (2015a).
In this formulation, the investor seeks to optimize the expected utility of terminal wealth, with preferences encoded through a chosen utility function Çanakoğlu and Özekici (2009). Standard measures of risk aversion are well established in the literature Pratt (1978); Arrow (1996), providing formal foundations for modeling investor behavior. Moreover, Bertsekas et al. (2011) analyzes a class of utility functions and derives optimal multi-period policies under dynamic settings. Leveraging these theoretical foundations, section 4.6 formulates an exponential utility framework to represent investor preferences and derive optimal allocations among the risk-free asset, Group 1, and Group 2.
Prior to addressing the primary problem, it is essential to establish some notational conventions. At the commencement of period t, an investor possessing an endowment represented by W t is tasked with determining the optimal portfolio weights ω ′ t to allocate towards risky assets. This allocation is aimed at maximizing the expected utility of their wealth in the subsequent period, while investing the remainder, represented by 1-ω ′ t , in a risk-free asset. The wealth function of these assets can be articulated as follows:
This formulation encompasses both the risk-free rate and the adjustments for risky asset returns, thereby facilitating a comprehensive analysis of wealth dynamics in the investment strategy. Therefore, the portfolio selection problem is
In many instances, obtaining an analytical solution to the expected utility maximization problem is notably challenging, prompting numerous scholars to depend on numerical methods, as referenced in Brandt and Santa-Clara (2006) and Çanakoğlu and Özekici (2009). Nevertheless, a significant number of researchers have successfully derived analytical solutions for various scenarios of the expected utility maximization problem, particularly when the utility function is exponential, as indicated in Bodnar et al. (2015b) and Rásonyi and Sayit (2025).
In the subsequent section, we will articulate the primary contributions derived from our research. In this study, we do not investigate the relationship between optimal portfolio weights and wealth. Consequently, we set the wealth level at W t = 1. Our utility function is expressed as follows:
Assuming that the log-return conforms to a Normal distribution, it follows that the expression R t -1r A also follows to a Normal distribution. As a result, the wealth function can be redefined as follows:
where µ represents the expected value of log-returns (at the period t), Σ denotes the covariance of log-returns (at the period t), and N (0,1) is interpreted as the standard normal distribution. Therefore, the expectation of the exponential utility function can be derived as follows:
where φ(x) denotes the density function of the standard normal distribution.
To reformulate the expected exponential utility function, we shall utilize the Fourier transform, as illustrated in the following expression:
Given that α > 0, the constant term in the aforementioned objective function is negative, which indicates that the function is decreasing. Therefore, this optimization problem can be reformulated as a quadratic programming problem:
It is important to note that the initial problem presents significant challenges for analytical resolution. Furthermore, when the dimensionality of the stock variables is exceedingly high, obtaining a numerical solution becomes exceptionally arduous. In this context, we define a hyperplane characterized by the equation ω ′⊤ t (µ-1r A ) = -c. Achieving the optimal solution situated below this hyperplane demonstrates enhanced efficiency. Additionally, it is feasible to modify the value of c in response to fluctuating market conditions. This adjustment enables the repositioning of the hyperplane, facilitating the identification of the optimal solution in relation to it.
The solution of this problem can be derived using the Lagrangian method Jin et al. (2008). Accordingly, the Lagrangian function is defined as follows:
, where λ denotes the Lagrangian multiplier. By applying the first-order conditions (FOC), one arrives at a system of equations articulated as follows:
In analyzing the system of equations for a specified constant c, we denote the resultant solution as ω ′c t , which can be expressed in the following manner:
where B = (µ -1r A ) ⊤ Σ -1 (µ -1r A ) and A = Σ -1 (µ -1r A ). From the analysis presented above, it can be observed that an increase in the parameter c results in a decrease in the allocation to risky assets, denoted as ω ′c t . This indicates that c serves as a lever to manage investment exposure to risk. Consequently, in a bearish market environment, it is advisable to decrease the value of c. Conversely, during a bullish market, increasing the value of c would be advisable.
To capture the momentum of the recent market, we implement a momentum strength parameter Moskowitz et al. (2012). Initially, we utilize the log-returns from three distinct periods as the foundation for assessing momentum strength. We define momentum strength as the L 2 norm of the mean log-returns for these periods. Consequently, we introduce a new parameter denoted as β, represented mathematically as follows:
where R refers to the mean log-returns, l is the lowest value and h is the highest value. The allocation of weights to risky assets is then expressed by the equation: -3:t ) . This formulation allows for a structured approach to investment strategy based on identified momentum dynamics.
To complement this momentum-based adjustment, we further incorporate a rebound detection mechanism Jegadeesh (1990). The objective is to distinguish between genuine rebounds following a significant decline and mere technical recoveries that may occur during an ongoing downtrend.
To avoid allocating heavily during temporary technical recoveries in ongoing downtrends, we incorporate a rebound detection mechanism. Let Rt-m:t denote the recent mean log-return for a group of assets. A rebound is identified if two conditions are simultaneously satisfied:
- Downtrend condition: the mean return over the preceding m periods (e.g. m =
- falls below a negative threshold θ d , representing a sustained decline:
Rebound condition: the most recent one or two periods exhibit returns above a positive threshold θ u :
Binary indicators of rebounds are computed for each asset group: rebound_flags = [rebound 0 , rebound 1 ] These flags act as validation signals for whether the portfolio should consider adjustments beyond the baseline momentum allocation. To integrate multiple sources of portfolio information while maintaining stability, we implement a selective fusion mechanism.
The baseline weights ω ′ * t , is integrated with the generated global masked information, denoted ω (0)i t , specifically for asset groups that exhibit confirmed rebounds. Formally, for each group i, the updated weights are defined as follows:
, otherwise where η ∈ [0, 1] dictates the intensity of the blending process. The agent weights are calculated by selectively masking the relevant assets within each group and aggregating their individual actions:
This targeted fusion ensures that adjustments informed by adaptive models affect asset allocation only when the market exhibits verified rebound behavior. Asset groups without rebound signals remain consistent with the baseline expected-utility maximizing portfolio, thereby maintaining robustness. To incorporate the risk-free asset into the allocation, the unnormalized logits are constructed by concatenating the fixed baseline element with the outputs of the two risky-asset groups as
where the first element corresponds to the risk-free asset and is fixed at zero. Setting its logit to zero provides a stable reference baseline in the softmax normalization, reflecting the financial role of the risk-free asset as a benchmark with negligible sensitivity to market conditions. This treatment eliminates redundant parameterization and leverages the shift-invariance property of the softmax function LeCun et al. (2015). Accordingly, the logits of the two risky asset groups, ω new(1)′ t and ω new(2)′ t , quantify their relative attractiveness against the risk-free baseline. After applying the softmax transformation,
where T ep > 0 denotes a temperature coefficient controlling the sharpness of the allocation distribution LeCun et al. (2015), we obtain normalized weights ω
, and ω capital(2)′ t
, which can be interpreted as probabilities of allocating capital across the risk-free asset and the two risky groups.
, i ∈ {1, 2}, thereby integrating intra-group optimization with group-level capital allocation in a financially coherent manner. This design treats the risk-free asset as a benchmarkrelative anchor Sharpe (1964), ensuring that risky asset allocations are consistently interpreted with respect to a stable financial baseline.
The optimization of the lower-level framework is based on the DDPG paradigm, augmented with auxiliary components and enhanced exploration mechanisms.
The critic network is optimized using a temporal-difference (TD) objective Silver et al. (2014), the canonical method in reinforcement learning for estimating value functions through bootstrapping from subsequent rewards. This formulation is derived from the Bellman equation Silver et al. (2014), which establishes the recursive structure of value estimation. For a transition (s t-1 , a t-1 , r t , s t ), the objective is expressed as
where Q θ represents the critic network parameterized by θ, which estimates the expected return of a state-action pair. The policy π ϕ , parameterized by ϕ, denotes the actor network that generates deterministic portfolio-weight actions. The pairs (θ -, ϕ -) correspond to the delayed target parameters used to stabilize training by reducing estimation variance and preventing oscillations during learning.
The actor network is optimized by maximizing the critic’s evaluation of its policy outputs, augmented with two auxiliary components: an entropy regularizer and a detached prior-alignment penalty. The overall objective is formulated as
where H(•) denotes the entropy of the action distribution, α = 0.1 and β = 0.01 are fixed coefficients, and sg(•) indicates the stop-gradient operator. Because both terms in L det imit are detached, this component introduces no gradient contributions to network parameters and therefore acts as a constant offset in optimization. Its function is to provide a stable diagnostic of the alignment between intra-group predictions ω (i)′ t and the masked upper-level prior ω(0,i)′ t , while strictly preventing cross-layer gradient propagation Silver et al. (2014).
To ensure effective exploration in the continuous, high-dimensional action space, the framework employs a hybrid exploration scheme comprising:
• Ornstein-Uhlenbeck (OU) process noise. Correlated Gaussian perturbations generated by an OU process are added to the actions, producing temporally coherent trajectories that better capture trading dynamics Lillicrap et al. (2015b). • ϵ-greedy perturbation. With probability ϵ, the action is perturbed by zero-mean Gaussian noise and subsequently clipped to admissible bounds, thereby injecting stochasticity and mitigating entrapment in local optima Sutton and Barto (1998).
Together, these components balance exploration and exploitation: OU-process noise promotes stability and temporal smoothness, whereas ϵ-greedy perturbations enhance resilience in volatile market conditions.
We construct three independent datasets from Yahoo Finance 1 , with the Dow Jones Industrial Average (DJIA) serving as the empirical foundation of the analysis. The DJIA is selected as the benchmark for the SAMP-HDRL framework because it comprises 30 highly liquid equities from diverse industries, thereby providing a representative market environment consistent with the assumptions of this study. Each dataset spans a four-year horizon, in which the first three years function as the training set and the final year is reserved for backtest, as summarized in Table 3. To evaluate the robustness of SAMP-HDRL under extreme market turbulence induced by exogenous shocks, particular emphasis is placed on the backtest set of Dataset 2 and the training set of Dataset 3, especially those reflecting the pronounced volatility observed during the COVID-19 pandemic in 2020 Apergis et al. (2023).
While reinforcement learning provides adaptive mechanisms for sequential decision-making, it cannot fully resolve the challenges posed by non-stationary market dynamics. Hence, before evaluating model performance, it is necessary to characterize the statistical properties of the datasets, particularly the 2020 segment where the COVID-19 pandemic triggered circuit breakers and induced structural breaks in U.S. equity markets. For this purpose, we apply two standard diagnostic tests to the DJIA. As reported in Table 2, the Augmented Dickey-Fuller (ADF) test SAID and DICKEY (1984) fails to reject the unit-root null hypothesis, confirming that the price series is non-stationary, whereas the ARCH-LM test Patilea and Raïssi (2014) strongly rejects the null hypothesis of homoscedasticity, indicating pronounced conditional heteroscedasticity and volatility clustering in the return sequence. Taken together, these findings confirm that the 2020 market exhibits both non-stationarity and time-varying volatility, consistent with the exceptional turbulence of that year. Alongside the upward-trending regime in 2019 Kyriazis et al. (2023); Wang et al. (2021) and the oscillatory dynamics of 2021 Ullah et al. (2023), these diagnostics establish a rigorous basis for evaluating SAMP-HDRL under heterogeneous environments, thereby enabling a comprehensive assessment of robustness and adaptability. Because Dow Inc. was separated from DowDuPont in March 2019, consistent histori- stock is excluded from the analysis, leaving 29 constituents in the sample. The trading horizon is defined on a daily scale, and the adjusted closing price of each equity is employed to construct the price matrix. Compared with raw closing prices, adjusted closing prices correct for distortions arising from dividends and stock splits Gao and Chai (2018), thereby providing a more accurate, comparable, and methodologically coherent representation of asset values Huang (2024).
We benchmark our framework against nine traditional strategies and nine reinforcement learning (RL)-based approaches, three of which are obtained from the TradeMaster platform Sun et al. (2023). 2023), and Financial Transformer Reinforcement Learning (FTRL) Ren et al. (2025). This collection spans widely adopted DRL benchmarks as well as recent approaches that have gained academic visibility, ensuring both methodological breadth and relevance for comparison. EIIE introduces a DRL framework originally applied to cryptocurrency portfolios and demonstrates superior performance relative to heuristic strategies; this design is extended to the stock domain as a comparator against SAMP-HDRL. FinRL provides a standardized platform for automated trading, offering unified data preprocessing, environment construction, and training pipelines, thereby ensuring consistency and reproducibility. EST ensembles A2C, DDPG, and PPO to exploit their complementary strengths, representing a multi-algorithm baseline. SARL fuses heterogeneous data sources to capture high-dimensional features and improve robustness, while II encodes trading expertise within a DRL paradigm, enhancing performance across diverse metrics. TradeMaster (PPO) applies automated machine learning for adaptive hyperparameter selection within PPO. PPN emphasizes latent inter-asset dependencies and introduces a cost-sensitive reward to strengthen risk control, whereas TARN explicitly models inter-asset correlations to balance risk and return under dynamic conditions. Two recent approaches that have attracted significant academic attention since 2023 are also considered. DeepMPT integrates Modern Portfolio Theory’s risk-return principles with DRL optimization, enabling adaptive allocation in non-stationary environments, while LSRE-CAAN designs efficient representations, action spaces, and reward functions to capture temporal dependencies and market non-stationarity in high-frequency settings. Finally, FTRL employs a Transformerbased architecture to model temporal dependencies and cross-asset interactions, thereby addressing the limitations of conventional time-series models in portfolio optimization and providing a strong benchmark under comparable conditions. Detailed specifications of all baselines are summarized in Table 4.
To ensure fairness and computational consistency across all baselines, a unified hyperparameter protocol is adopted. A common learning rate is applied across frameworks to ensure consistent training dynamics, taking into account the limited training horizon (755 daily steps), the architectural complexity of Transformer-based models, and the use of automatic mixed precision (AMP) Micikevicius et al. (2017), which stabilizes optimization through reduced gradient magnitudes. Batch size remains fixed for all methods, and a uniform early-stopping criterion is enforced. Architecture-specific hyperparameters, such as the number of layers or attention heads, are not standardized, since these components are absent in certain baselines (e.g., traditional heuristics or simpler DRL models). To prevent bias from extensive hyperparameter tuning and to guarantee reproducibility, exhaustive grid search is deliberately avoided Henderson et al. (2018). Instead, a restricted set of widely adopted values is manually validated and consistently applied throughout the experimental setting.
The evaluation of the proposed framework is conducted along two primary dimensions: profitability and risk. Profitability is measured by the final cumulative return R f , defined as
where φ t denotes the portfolio return at time t. A larger R f reflects stronger overall profitability across the backtest horizon. Risk assessment employs three wellestablished metrics: the Sharpe ratio r Sharpe Sharpe (1998), the Sortino ratio r Sortino Rollinger and Hoffman (2013), and the Omega ratio r Omega Keating and Shadwick (2002). The Sharpe ratio quantifies excess return relative to total volatility:
,
where E and Var denote the expectation and variance operators, and r A is the daily log risk-free rate. Recognizing that upward price fluctuations should not be classified as risk Rollinger and Hoffman (2013), the Sortino ratio isolates downside volatility:
where r A is the minimum acceptable return (set to r A in this study), and j is the number of periods with φ t < r A . A higher r Sortino reflects superior performance relative to downside risk. The Omega ratio evaluates the entire distribution of returns, incorporating higher-order moments:
, with r A again denoting the minimum acceptable return. A larger r Omega indicates stronger dominance of gains over losses, thus providing a more comprehensive characterization of portfolio quality.
We obtained the results in Tables Tables 5 to 7 Wang et al. (2021). Within this setting, SAMP-HDRL underperforms FTRL by approximately 16%, highlighting the sensitivity of the proposed framework to persistently bullish market regimes. This observation can be attributed to two key factors. First, FTRL does not incorporate explicit risk management, thereby allowing the strategy to fully exploit upward trends and maximize profitability in such conditions Ren et al. (2025). Second, the hierarchical architecture of SAMP-HDRL not only captures temporal patterns but also integrates structural mechanisms for risk balancing and robustness Barto and Mahadevan (2003b). This design creates a dynamic trade-off between return maximization and risk control. Under relatively calm and upward-trending conditions, this trade-off may suppress certain high-risk, high-reward opportunities, resulting in marginally lower returns compared to the To further validate the findings, an additional set of experiments is conducted with a variant of FTRL, denoted as FTRL-risk. This extension augments the original FTRL by incorporating a lower-level module for dynamic asset classification and capital allocation, consistent with the design of SAMP-HDRL. The three backtest datasets are used, and the results are summarized in Table 8, which reports the performance of FTRL, its risk-augmented variant, and SAMP-HDRL. In the time period corresponding to Backtest 1, characterized by a relatively stable and upwardtrending market, FTRL achieves the best outcomes across all metrics, surpassing SAMP-HDRL by 19%, 6%, 51%, and 3% in Return, Sharpe ratio, Sortino ratio, and Omega ratio, respectively. By contrast, FTRL-risk underperforms FTRL, with the same metrics being lower by 23%, 14%, 39%, and 7%, respectively, confirming the inference that explicit risk control may suppress high-risk, high-reward opportunities in bullish regimes Driessen et al. (2019); Xiong et al. (2022). In the period corresponding to Backtest 2, marked by heightened volatility Apergis et al. (2023), SAMP-HDRL consistently outperforms all strategies, exceeding FTRL by 33%, 35%, 37%, and 6% in the four metrics, respectively. FTRL-risk also improves upon FTRL with gains of 30%, 31%, 37%, and 5%, respectively, indicating that risk-sensitive modifications enhance profitability and risk-adjusted performance, though still falling short of the robustness achieved by SAMP-HDRL. In the time horizon of Backtest 3, reflecting sideways and mixed conditions Ullah et al. (2023), SAMP-HDRL again maintains the lead, outperforming FTRL by 5%, 5%, 6%, and 2% in Return, Sharpe ratio, Sortino ratio, and Omega ratio, respectively, while FTRL-risk shows modest improvements of 5%, 4%, 2%, and 2% over FTRL, respectively, yet remains inferior to SAMP-HDRL overall. Overall, these results reinforce the prior analysis: while FTRL benefits most in stable upward markets due to the absence of explicit risk constraints, the integration of risk control substantially strengthens resilience in volatile and uncertain environments, enabling SAMP-HDRL to achieve a more effective dynamic balance between return maximization and risk management. To further validate the effectiveness and necessity of the proposed innovations in SAMP-HDRL-namely upper-lower coordination, dynamic clustering, and capital allocation-we design a set of ablation experiments. Specifically, four variants of SAMP-HDRL are constructed: SAMP-HDRL w/o upper, SAMP-HDRL w/o lower, SAMP-HDRL w/o dc, and SAMP-HDRL w/o ca, which remove the upper-level framework, the lower-level agents, the dynamic clustering module, and the capital allocation module, respectively. To ensure operability, in SAMP-HDRL w/o upper the upper framework is replaced with the UBAH Cover (1991) strategy to provide global decision outputs, thereby testing its contribution to inter-asset correlation modeling. In SAMP-HDRL w/o lower, the entire lower-level agent structure is removed to examine its role in intra-asset temporal modeling and risk control. In SAMP-HDRL w/o dc, the dynamic clustering mechanism is replaced by a static clustering procedure performed before training, verifying its advantage in adapting to evolving market structures. Finally, in SAMP-HDRL w/o ca, the original capital allocation mechanism is replaced by equal allocation between asset groups, in order to assess its contribution to optimizing the risk-return trade-off. The empirical results are reported in Table 9.
Table 9 reports the performance of SAMP-HDRL and its ablated variants across the three backtest datasets. The results demonstrate that the proposed innovations-upper-lower coordination, dynamic clustering, and capital allocationplay indispensable roles in enhancing overall performance.
In the time period corresponding to Backtest 1, SAMP-HDRL outperforms the variant without the upper framework by 29%, 16%, 21%, and 7% in Return, Sharpe ratio, Sortino ratio, and Omega ratio, respectively. Relative to the variant without dynamic clustering, the improvements are 114%, 68%, 93%, and 22%, respectively, while relative to the variant without capital allocation, the gains are 24%, 10%, 15%, and 4%, respectively. However, the variant without the lower agents achieves a Return higher than SAMP-HDRL by 19%, confirming the earlier inference that risk-control mechanisms may restrict profit maximization under one-sided upward conditions Driessen et al. (2019); Xiong et al. (2022).
In the interval of Backtest 2, characterized by heightened volatility, SAMP-HDRL demonstrates consistent superiority. Compared with the variant without the upper framework, the improvements are 38%, 34%, 39%, and 5% in the four metrics, respectively. Relative to the variant without the lower framework, the gains are 33%, 35%, 37%, and 6%, respectively. Against the variant without dynamic clustering, the improvements are 12%, 14%, 10%, and 3%, respectively, while relative to the variant without capital allocation, the margins rise to 51%, 47%, 53%, and 7%, respectively. These results underscore that all three mechanisms contribute indispensably to robustness and the risk-return balance under turbulent conditions.
In the horizon of Backtest 3, reflecting sideways and mixed market conditions, SAMP-HDRL outperforms the variant without the upper framework by 21%, 13%, 12%, and 3% in Return, Sharpe ratio, Sortino ratio, and Omega ratio, respectively, and exceeds the variant without the lower framework by 5%, 1%, 2%, and 2%, respectively. Relative to the variant without capital allocation, the improvements are 13%, 0.1%, -1%, and -0.2% in the same metrics, respectively. When compared with the variant without dynamic clustering, SAMP-HDRL achieves an increase of 1% in Return but decreases of 2% and 0.3% in Sharpe and Sortino ratios, respectively, while its Omega ratio is slightly lower by 0.2%. This indicates that dynamic clustering is particularly effective in optimizing the risk-return trade-off in non-trending markets.
Overall, the ablation study validates the effectiveness of the three key innovations: the upper framework substantially strengthens robustness through inter-asset correlation modeling, the lower agents are indispensable for risk control in volatile markets, and dynamic clustering together with capital allocation effectively optimize the risk-return balance in structurally shifting and oscillating environments, thereby underpinning the overall superiority of SAMP-HDRL.
SHapley Additive exPlanations (SHAP) Lundberg and Lee (2017) offers a cooperative game-theoretic framework for interpretability by attributing the marginal contribution of each input feature to the model output in a fair manner. In this study, SHAP is integrated into the DRL-based portfolio allocation framework to quantify the influence of each asset’s historical price trajectory on the assigned portfolio weights. Specifically, for each backtest period, SHAP values are computed for the actions of both lower-level agents, capturing how past market movements and clustering masks contribute to allocation decisions.
The resulting visualizations (Figures 9 to 11) consist of three subplots corresponding to distinct market regimes. In each subplot, the vertical axis represents the temporal dimension (trading days), while the horizontal axis corresponds to individual assets. The panels labeled R1-R4 correspond to the four post-reclassification periods in which cluster memberships remain stable, and ALL provides the aggregated view across the full year. The color intensity encodes the SHAP value magnitude, indicating each asset’s contribution to portfolio weight decisions at a given time. Warm colors denote positive contributions that increase allocation weights, whereas cool colors indicate negative contributions leading to reduced exposure. The superimposed heightened uncertainty. Agent 1 continues to maintain broad coverage to buffer systemic risk, while Agent 2 sharply narrows its focus to a small set of high-confidence quality assets, capturing rebounds and structural opportunities. The thresholded SHAP results confirm that the model effectively filters noise in this turbulent environment, retaining only decisive features and thereby demonstrating robustness under non-stationarity.
In the oscillating recovery of 2021 Ullah et al. (2023), the market exhibits upward tendencies accompanied by heightened volatility and uncertainty. The SHAP mean results show relatively smaller marginal contributions across assets, indicating weaker single-stock drivers. Agent 1 maintains balanced coverage across most assets to mitigate risks, whereas Agent 2 selectively emphasizes a few quality assets during specific intervals, forming concentrated allocations. The thresholded SHAP analysis further validates a “broad coverage + selective focus” strategy, balancing risk and return in a sideways environment.
Taken together, the SHAP analyses across three regimes demonstrate that the two lower-level agents do not allocate weights arbitrarily but instead operate through a complementary mechanism. Agent 1 ensures diversification across remaining assets, while Agent 2 concentrates on high-quality assets. This coordination enables the policy to emphasize structural drivers in upward markets, reinforce risk buffering in non-stationary regimes, and integrate diversification with selective focus in oscillating recoveries. Such behavior not only substantiates the rationality of the agents’ decisions but also enhances the interpretability and credibility Guan and Liu (2022) of DRL-based portfolio optimization.
This study introduces SAMP-HDRL, a novel hierarchical deep reinforcement learning framework for portfolio management that integrates upper-lower agent coordination, dynamic clustering, and capital allocation. Extensive experiments on three backtest datasets demonstrate that SAMP-HDRL consistently and significantly outperforms both traditional baselines and recent DRL methods across return, Sharpe ratio, Sortino ratio, and Omega ratio. To the best of our knowledge, we are the first to provide a comprehensive integration of hierarchical coordination, clustering, and capital allocation within a unified end-to-end paradigm for portfolio optimization. Comparative and ablation analyses confirm the indispensable contributions of each module: the upper framework enhances inter-asset correlation modeling, the lower agents strengthen risk control, and the combination of clustering with capital allocation ensures an effective balance between diversification and focused allocation. Moreover, SHAP-based interpretability analysis reveals that the two lower-level agents operate in a complementary manner, with Agent 1 supporting diversification across remaining assets and Agent 2 concentrating on high-quality assets, thereby offering transparent and economically consistent explanations of portfolio decisions.
Nevertheless, several limitations warrant careful consideration. First, the hierarchical architecture is not strictly end-to-end, as the upper-and lower-level agents are optimized in a staged manner, which may constrain full cross-level information integration. Second, although dynamic clustering enhances intra-group feature representation, the framework does not explicitly capture inter-cluster dependencies, thereby risking the omission of valuable cross-asset correlations. Third, the interpretability analysis primarily relies on SHAP, which provides post-hoc rather than real-time or causality-aware explanations, leaving open questions regarding transparent monitoring of the decision process in live trading environments. Finally, the evaluation is restricted to price-based time series from the DJIA, without the inclusion of macroeconomic, sentiment, or alternative cross-market signals, which may limit generalizability to broader financial contexts. Collectively, these issues underscore both methodological and practical dimensions in which the framework can be further advanced.
Future research will seek to address these challenges along several directions. First, more integrated end-to-end optimization paradigms will be explored, enabling upperand lower-level agents to co-adapt through shared objectives and thereby strengthen cross-level synergy. Second, graph neural networks and correlation-aware attention mechanisms will be incorporated to explicitly model inter-cluster dependencies and improve diversification. Third, interpretability will be extended toward real-time and causality-aware frameworks, facilitating continuous monitoring of portfolio decisions and validation under counterfactual scenarios. Fourth, the framework will be enriched with multi-modal and cross-market features-including macroeconomic indicators,