Parallel execution of portfolio optimization
Analysis of asset liability management (ALM) strategies especially for long term horizon is a crucial issue for banks, funds and insurance companies. Modern economic models, investment strategies and optimization criteria make ALM studies computationally very intensive task. It attracts attention to multiprocessor system and especially to the cheapest one: multi core PCs and PC clusters. In this article we are analyzing problem of parallel organization of portfolio optimization, results of using clusters for optimization and the most efficient cluster architecture for these kinds of tasks.
💡 Research Summary
The paper addresses the computational challenges inherent in long‑term asset‑liability management (ALM) studies, especially when portfolio optimization must be performed over thousands of economic and liability scenarios. Modern ALM models typically require a Dynamic Decision Rule (DDR) that combines a formal description of investment and contribution strategies (DES), a set of economic scenarios (SCEN), and objective functions. In practice, a realistic ALM analysis may involve 1 000–10 000 scenarios, a time horizon of 10–320 periods, and a parameter space that contains tens of variables each sampled at hundreds of points. Consequently, the total number of DES evaluations can easily exceed tens of millions, making the problem computationally intensive.
The authors propose two high‑level parallelization strategies. The first distributes the parameter domain across nodes while each node processes the full scenario set; the second distributes the scenario set across nodes and evaluates all parameters in parallel on each node. They argue that the second approach, despite requiring more frequent communication of parameter vectors, avoids the duplication of the large scenario data set and simplifies load balancing. Consequently, the paper focuses on the second strategy.
A key contribution is a simple algorithm for equitable scenario distribution:
Portion_Size = floor(Number_of_Scenarios / Cluster_Size)
Rest = Number_of_Scenarios – Portion_Size * Cluster_Size
Each of the first Rest nodes receives Portion_Size + 1 scenarios, the remaining nodes receive Portion_Size. This guarantees that the difference in workload between any two nodes is at most one scenario, eliminating the “slowest node” bottleneck.
The cluster architecture is examined in three variants:
-
Master‑Slave (single‑level tree) – The master holds the complete task, sends parameter vectors to each slave, and collects the DES results. Communication is sequential over a single network bus, which becomes a bottleneck as the number of nodes grows.
-
Two‑level tree – An intermediate layer of “second‑level masters” receives the task from the top master and further distributes it to their own slaves. Results are aggregated hierarchically, reducing contention on the network bus and allowing the system to scale to larger numbers of nodes.
-
Ring (and ring‑of‑rings) – Each node is equipped with two NICs and forms a logical ring with its immediate neighbors. Data flows clockwise and counter‑clockwise, limiting each node’s degree to four and avoiding the need for multiple hubs. The authors note that for workloads with light communication (e.g., scenario processing) the simpler tree may be preferable, while the ring offers better fault tolerance and comparable performance for more communication‑intensive tasks.
For the objective‑function evaluation, the authors discuss parallel computation of both the function value and its gradient. The matrix A that maps independent variables to dependent variables is transmitted once to all nodes; each node computes its assigned rows (scalar products) locally. The master then aggregates the partial results. The speed‑up condition is expressed as
m · (1 – 1/ClusterSize) · Tc > Ti + Td,
where Tc is the time for a scalar product, Ti the time to broadcast the independent‑variable vector, and Td the time to gather results.
Empirical validation is performed on a cluster of 2.4 GHz Intel PCs with 1 GB RAM, connected via a 1 Gbps Ethernet. The ALM software is written in C#; communication uses both TCP (TcpListener/TcpClient) and UDP (UdpClient). Table 11 (reproduced in the paper) shows execution times for scenario counts ranging from 500 to 10 000 and cluster sizes of 1–3 machines. The results demonstrate near‑linear reduction in total runtime with increasing cluster size, while the measured transaction times confirm the model
Tc = Ts / m + m · TransactionTime,
with a maximum deviation of 7 %. UDP consistently yields much lower transaction overhead than TCP, highlighting the importance of lightweight communication protocols for high‑performance financial simulations.
In summary, the paper provides a comprehensive framework for parallelizing portfolio‑optimization workloads in ALM contexts. It combines a practical load‑balancing algorithm, a comparative analysis of three network topologies, and a performance model validated by real‑world experiments. The findings suggest that even modest, low‑cost multi‑core PCs and standard Ethernet clusters can achieve substantial speed‑ups for large‑scale financial simulations, provided that the communication architecture is carefully designed to avoid bottlenecks. This work offers a valuable reference for practitioners seeking to accelerate ALM computations without resorting to expensive high‑performance computing platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment