A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API. Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime. These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.


💡 Research Summary

The paper addresses the growing complexity of programming heterogeneous nodes that combine multi‑core CPUs with a variety of accelerators (GPUs, FPGAs, AI‑specific chips). Modern HPC and AI workloads often need to invoke several low‑level accelerator APIs—CUDA, SYCL, Triton, OpenCL—together with vendor‑optimized libraries such as cuBLAS, cuSPARSE, and oneMKL. Each API brings its own abstraction, execution model, and synchronization primitives, making the integration of multiple APIs within a single application error‑prone, labor‑intensive, and prone to performance variability due to runtime interference.

To solve this, the authors reuse a task‑based data‑flow methodology built on the OpenMP/OmpSs‑2 runtime. Applications are expressed as a directed acyclic graph (DAG) where nodes represent either host tasks or device kernels, and edges encode data dependencies. The runtime automatically builds the DAG at execution time, schedules ready tasks, and overlaps computation, communication, and I/O without explicit user‑managed synchronization.

The key innovation is the introduction of Task‑Aware libraries (TA‑libs). The paper extends previous work on TA‑CUDA (which wraps CUDA kernel launches as first‑class OpenMP tasks) with a new Task‑Aware SYCL (TA‑SYCL) library. TA‑SYCL lifts SYCL kernel submissions, queues, and memory operations into OpenMP tasks, allowing the same dependency‑driven execution model to manage SYCL workloads. Consequently, developers can keep existing CUDA or SYCL code largely unchanged while gaining the benefits of task‑based scheduling.

A second source of inefficiency arises when multiple native runtimes (CUDA driver, SYCL runtime, OpenCL/PoCL, etc.) coexist on the same CPU and each creates its own thread pool. This leads to oversubscription, contention, and unpredictable performance. The authors mitigate this by unifying thread management under nOS‑V, a lightweight tasking and threading library. nOS‑V provides a single shared thread pool that all runtimes use, eliminating oversubscription. The authors also contribute a new port of the PoCL (Portable OpenCL) runtime to nOS‑V, enabling OpenCL and SYCL workloads to benefit from the same unified execution layer.

The methodology is evaluated on two platforms: a multi‑core CPU server and a GPU‑accelerated node. Two representative workloads are used: (1) the pre‑training phase of GPT‑2, representing a modern AI pipeline with heavy tensor operations, and (2) the HPCCG conjugate‑gradient benchmark, representing a classic HPC linear‑solver workload. On the GPU node, the authors compare a traditional monolithic kernel execution (single large kernel per phase) with their task‑based data‑flow approach that integrates multiple APIs. The results show comparable execution time (within 2‑3 % difference) and similar memory footprints, demonstrating that the added flexibility does not incur a performance penalty. On the CPU server, unifying all runtimes through nOS‑V eliminates the performance degradation caused by thread oversubscription, achieving performance on par with using a single runtime in isolation.

Key insights from the study include:

  • Fine‑grained dependency management via the DAG enables better overlap of host and device work than coarse fork‑join models, especially for irregular or unbalanced workloads.
  • Task‑Aware libraries provide a thin wrapper that promotes existing accelerator code to first‑class tasks without extensive rewrites, preserving vendor‑specific optimizations.
  • Unified thread management through nOS‑V resolves contention between runtimes, a problem that becomes more acute as more APIs are combined on a single node.
  • The approach is portable and extensible: adding support for other vendor APIs (e.g., ROCm/HIP, Intel Level Zero) or future AI accelerators only requires implementing the corresponding TA‑lib wrapper, while nOS‑V already offers a generic threading substrate.

The authors conclude that the combination of a task‑based data‑flow model, task‑aware accelerator libraries, and a unified threading runtime constitutes a practical framework for programming heterogeneous systems with multiple accelerator APIs. It reduces programming effort, maintains high performance, and is ready for current heterogeneous nodes while being scalable to future architectures that may integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators. Future work will explore automated generation of TA‑libs, deeper integration with distributed runtimes, and broader evaluation on large‑scale supercomputers.


Comments & Academic Discussion

Loading comments...

Leave a Comment