Fully Empirical Autotuned QR Factorization For Multicore Architectures
Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle the issue of tuning a dense QR factorization on multicore architectures. We show that it is hard to rely on a model, which motivates us to design a fully empirical approach. We exhibit few strong empirical properties that enable us to efficiently prune the search space. Our method is automatic, fast and reliable. The tuning process is indeed fully performed at install time in less than one and ten minutes on five out of seven platforms. We achieve an average performance varying from 97% to 100% of the optimum performance depending on the platform. This work is a basis for autotuning the PLASMA library and enabling easy performance portability across hardware systems.
💡 Research Summary
The paper addresses the challenge of tuning dense QR factorization on modern multicore architectures, focusing on the PLASMA library. The authors argue that model‑driven tuning is unreliable for CPU‑based multicore systems because of complex interactions among cache hierarchies, memory bandwidth, TLB behavior, and other hardware characteristics. Consequently, they propose a fully empirical autotuning methodology that runs automatically at install time.
Two key tunable parameters are identified: the tile size (NB) and the inner block size (IB). NB determines how the matrix is partitioned into tiles; a smaller NB yields more tasks and higher parallelism but reduces the efficiency of the serial kernels that operate on each tile. IB controls the amount of extra computation introduced to improve data locality; setting IB equal to NB can increase the total floating‑point work by up to 25 %. The goal is to find the NB‑IB combination that maximizes performance (measured as 4/3 N³ / t, where t is the factorization time) without altering the algorithmic flop count.
Through extensive experiments on a wide variety of hardware (including Intel Core Tigerton 16‑core, IBM Power6 32‑core, and several AMD systems), the authors derive three robust empirical heuristics:
- Small matrices → small NB – when the problem size is modest relative to the core count, using a small tile size ensures that all cores receive work.
- Large matrices → larger NB – for sufficiently large problems, a tile size proportional to the number of cores yields better cache utilization and higher overall throughput.
- IB ≈ NB/4 – NB/2 – the inner block size should lie between one‑quarter and one‑half of the tile size for most architectures, balancing memory traffic and extra arithmetic.
These heuristics are used in a “Pre‑Select” (PS) stage to prune the exhaustive NB‑IB search space to a manageable set of candidate configurations. The second stage performs actual QR factorizations for each candidate and applies a novel “prune‑as‑you‑go” (PAYG) strategy: during benchmarking, any configuration whose measured performance falls below a dynamic threshold (e.g., a fixed percentage of the best observed so far) is discarded immediately, preventing unnecessary runs.
The combined PS + PAYG approach dramatically reduces tuning time. On five of the seven tested platforms the entire process completes within 1 to 10 minutes, achieving 97 %–100 % of the optimum performance that would be obtained by exhaustive search. The remaining two platforms required longer runs (up to about an hour) due to a larger search space, but still produced near‑optimal results.
The authors also compare their empirical method with prior manual tuning efforts that required days of human intervention. Their automated framework eliminates the need for expert knowledge, makes the PLASMA library portable across diverse multicore systems, and can be executed at installation without burdening end users. Moreover, because the methodology relies solely on measured performance, it is robust to future architectural changes, such as new cache designs or memory hierarchies.
In the discussion, the paper highlights that the same empirical framework can be extended to PLASMA’s other one‑sided factorizations (LU and Cholesky) and potentially to other dense linear algebra kernels. Future work includes integrating more sophisticated performance models to guide the initial candidate selection, supporting non‑square or sparse matrices, and building a community repository where tuned parameter sets can be shared and reused.
Overall, the study demonstrates that a carefully designed empirical autotuning pipeline, grounded in a few strong observations about tile‑based algorithms, can deliver near‑optimal performance on a wide range of multicore platforms while keeping the tuning overhead low enough for practical deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment