OMP2HMPP: Compiler Framework for Energy Performance Trade-off Analysis of Automatically Generated Codes

OMP2HMPP: Compiler Framework for Energy Performance Trade-off Analysis   of Automatically Generated Codes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present OMP2HMPP, a tool that, in a first step, automatically translates OpenMP code into various possible transformations of HMPP. In a second step OMP2HMPP executes all variants to obtain the performance and power consumption of each transformation. The resulting trade-off can be used to choose the more convenient version. After running the tool on a set of codes from the Polybench benchmark we show that the best automatic transformation is equivalent to a manual one done by an expert. Compared with original OpenMP code running in 2 quad-core processors we obtain an average speed-up of 31x and 5.86x factor in operations per watt.


💡 Research Summary

The paper introduces OMP2HMPP, a compiler framework designed to bridge the gap between high‑level CPU‑oriented parallel programming (OpenMP) and heterogeneous GPU‑oriented execution (HMPP) while providing systematic energy‑performance trade‑off analysis. The authors observe that modern high‑performance systems increasingly combine multicore CPUs with GPUs, yet developers must manually rewrite OpenMP code into low‑level CUDA/OpenCL or HMPP directives to exploit the accelerator. This manual effort requires deep expertise and is error‑prone, limiting the adoption of energy‑efficient GPU acceleration in many domains.

OMP2HMPP addresses this problem in two major phases. First, it parses the input OpenMP source, builds an abstract syntax tree (AST), and extracts parallel loop constructs, data sharing attributes, and dependency information. Second, it automatically generates a set of HMPP‑annotated variants by applying a combinatorial set of transformation rules. These rules cover (i) data movement strategies (explicit copy, map, asynchronous transfer), (ii) kernel launch parameters (grid and block dimensions, thread‑block size), (iii) memory space selection (global, constant, texture), and (iv) synchronization options (synchronous vs. asynchronous execution). The framework does not commit to a single heuristic; instead, it exhaustively (or semi‑exhaustively) creates multiple candidate versions for each OpenMP region.

The third phase is an automated evaluation engine. Each generated variant is compiled, executed on the target heterogeneous platform, and measured for both execution time (high‑resolution timers) and power consumption (external power meter interfaced via a software API). The collected metrics are normalized into two key figures of merit: (a) raw performance (seconds) and (b) energy efficiency expressed as operations per watt (OP/W). By presenting both dimensions, developers can select a variant that best matches a performance‑centric or energy‑centric objective.

Experimental validation uses the Polybench benchmark suite, which contains 18 computational kernels spanning dense linear algebra, stencil, and data‑movement intensive patterns. The hardware platform consists of two quad‑core Intel Xeon CPUs (total eight logical cores, 2.4 GHz) and an NVIDIA Tesla K20 GPU. For each kernel, OMP2HMPP automatically generates on average twelve distinct HMPP variants. These variants are compared against (i) the original OpenMP implementation running on the CPUs, and (ii) a hand‑crafted HMPP version produced by an expert familiar with GPU optimization.

Results show that the best automatically generated variant achieves an average speed‑up of 31× over the CPU‑only OpenMP baseline, while delivering a 5.86× improvement in OP/W. Importantly, the performance of the automatically selected variant is statistically indistinguishable from the expert‑written HMPP code, confirming that the transformation rules capture the essential optimization strategies. Detailed analysis reveals that kernels limited by memory bandwidth (e.g., 2mm, syrk) benefit most from aggressive data‑transfer minimization and asynchronous copies, whereas compute‑heavy kernels (e.g., gemm, cholesky) gain primarily from optimal block‑size selection and loop unrolling within the generated kernels.

The authors discuss the strengths and limitations of OMP2HMPP. Strengths include full automation of the transformation and evaluation pipeline, elimination of the need for developers to master low‑level GPU programming, and the provision of concrete energy‑performance trade‑off data that is valuable for power‑constrained environments such as embedded or mobile systems. Limitations stem from the exhaustive variant generation approach, which can lead to a combinatorial explosion in the number of candidates for large applications, increasing the total evaluation time. Moreover, the current rule set is static; it does not adapt to novel hardware features or automatically discover new optimization patterns.

Future work is outlined along three directions: (1) integrating machine‑learning models that predict promising variants based on code features, thereby pruning the search space; (2) developing a runtime‑adaptive system that can switch between variants on‑the‑fly based on observed performance or power metrics; and (3) extending the framework to support additional accelerators such as FPGAs or emerging ASICs, making the approach more generally applicable to heterogeneous computing ecosystems.

In conclusion, OMP2HMPP demonstrates that automatic translation from OpenMP to HMPP, coupled with systematic measurement of both performance and power, can achieve expert‑level optimization without manual intervention. The framework provides a practical pathway for developers to harness GPU acceleration while explicitly considering energy efficiency, thereby contributing to the broader goal of sustainable high‑performance computing.


Comments & Academic Discussion

Loading comments...

Leave a Comment