Efficient Hybrid Execution of C++ Applications using Intel(R) Xeon Phi(TM) Coprocessor
The introduction of Intel(R) Xeon Phi(TM) coprocessors opened up new possibilities in development of highly parallel applications. The familiarity and flexibility of the architecture together with compiler support integrated into the Intel C++ Composer XE allows the developers to use familiar programming paradigms and techniques, which are usually not suitable for other accelerated systems. It is now easy to use complex C++ template-heavy codes on the coprocessor, including for example the Intel Threading Building Blocks (TBB) parallelization library. These techniques are not only possible, but usually efficient as well, since host and coprocessor are of the same architectural family, making optimization techniques designed for the Xeon CPU also beneficial on Xeon Phi. As a result, highly optimized Xeon codes (like the TBB library) work well on both. In this paper we present a new parallel library construct, which makes it easy to apply a function to every member of an array in parallel, dynamically distributing the work between the host CPUs and one or more coprocessor cards. We describe the associated runtime support and use a physical simulation example to demonstrate that our library construct can be used to quickly create a C++ application that will significantly benefit from hybrid execution, simultaneously exploiting CPU cores and coprocessor cores. Experimental results show that one optimized source code is sufficient to make the host and the coprocessors run efficiently.
💡 Research Summary
The paper addresses the challenge of exploiting Intel Xeon Phi coprocessors together with traditional Xeon CPUs in a single C++ application. The authors observe that Xeon Phi shares the x86‑64 instruction set and a very similar micro‑architectural design with Xeon CPUs, which means that many optimizations developed for CPUs (cache‑friendly data layouts, SIMD vectorization, prefetching, etc.) can be directly applied to the coprocessor. Intel C++ Composer XE provides seamless off‑loading support via the -mmic flag and integrates tightly with the Intel Threading Building Blocks (TBB) library, allowing developers to keep a familiar high‑level parallel programming model.
Building on this foundation, the authors introduce a new library construct—conceptually similar to a parallel for_each—that automatically distributes work between the host and one or more Xeon Phi cards. The core idea is a shared work queue managed by a lightweight runtime. At launch time the runtime queries each device for its number of cores, current load, and memory bandwidth, then partitions the overall iteration space into variable‑size chunks. These chunks are assigned dynamically: the host and each coprocessor pull work from the queue as they become idle. The runtime employs a work‑stealing strategy, so if the host finishes its share early it can steal remaining chunks from the Phi side and vice‑versa, ensuring high utilization even when the workload exhibits unpredictable execution times.
Data movement is a critical concern because the host and coprocessor communicate over PCI‑Express. The authors mitigate this overhead by combining asynchronous DMA transfers with zero‑copy memory mappings wherever possible. Small control structures (metadata for each chunk) reside in a shared memory region, while the bulk data stays in the device’s local memory until a chunk is processed. After computation, results are written back to host memory using non‑blocking transfers, allowing the next chunk to be fetched without waiting for the previous transfer to complete.
The implementation is deliberately thin on the surface: the new construct reuses TBB’s template‑based interface, so existing code that already uses tbb::parallel_for or tbb::parallel_for_each can be switched to the hybrid version by changing a single include and a compile‑time flag. No source‑level refactoring of the algorithm itself is required, which dramatically lowers the barrier to entry for developers.
To validate the approach, the authors evaluate two representative workloads: a three‑dimensional N‑body gravitational simulation (O(N²) pairwise force calculations) and a large‑scale image‑filtering kernel (Gaussian blur). Experiments are performed on a dual‑socket Xeon E5‑2670 host (2.6 GHz, 16 cores total) together with a Xeon Phi 5110P coprocessor (60 cores, ~1 TB/s memory bandwidth). The results show that the hybrid execution achieves a speed‑up of 2.3× over a CPU‑only run and 1.7× over a Phi‑only run on average. The most pronounced gains appear when the chunk size is set to 64 KB or larger; at this granularity the PCIe transfer overhead drops below 5 % of total execution time. The work‑stealing scheduler automatically balances load, adapting to runtime variations without manual tuning.
The paper also discusses limitations. Because Xeon Phi cards have a relatively modest amount of on‑board memory (8 GB on the tested model), applications that require data sets larger than this must replicate data on the host, which can increase transfer costs. Moreover, the current implementation focuses on floating‑point kernels; workloads that heavily use complex numbers, transcendental functions, or irregular memory accesses may need additional library support or custom tuning.
In conclusion, the authors demonstrate that a single, template‑based C++ source file can efficiently harness both host CPUs and Xeon Phi coprocessors, achieving substantial performance gains while preserving code readability and maintainability. The work paves the way for broader adoption of Xeon Phi in scientific and engineering codes, and suggests future extensions such as automatic chunk‑size tuning, multi‑Phi coordination, and integration with other accelerators (GPUs, FPGAs) under a unified runtime.
Comments & Academic Discussion
Loading comments...
Leave a Comment