A C++11 implementation of arbitrary-rank tensors for high-performance computing

A C++11 implementation of arbitrary-rank tensors for high-performance   computing

This article discusses an efficient implementation of tensors of arbitrary rank by using some of the idioms introduced by the recently published C++ ISO Standard (C++11). With the aims at providing a basic building block for high-performance computing, a single Array class template is carefully crafted, from which vectors, matrices, and even higher-order tensors can be created. An expression template facility is also built around the array class template to provide convenient mathematical syntax. As a result, by using templates, an extra high-level layer is added to the C++ language when dealing with algebraic objects and their operations, without compromising performance. The implementation is tested running on both CPU and GPU.


💡 Research Summary

The paper presents a modern C++11‑based implementation of arbitrary‑rank tensors that targets high‑performance computing (HPC) environments on both CPUs and GPUs. The authors begin by outlining the limitations of existing tensor libraries: fixed rank, extensive code duplication, unnecessary temporary allocations during arithmetic, and poor integration with accelerator hardware. To address these issues they design a single, highly generic class template named Array<T, N>, where T denotes the element type and N the tensor rank. The dimensions of the tensor are stored in a std::array<std::size_t, N> and the constructor accepts an initializer list, enabling a concise declaration such as Array<double,3> A({4,5,6});.

Indexing is implemented via a variadic template operator()(size_t i1, size_t i2, …) that recursively expands the parameter pack. Compile‑time checks (static_assert) guarantee that the number of indices supplied matches the rank, eliminating a large class of runtime errors. Memory layout defaults to row‑major order, but the design permits a column‑major alternative through an additional template parameter, preserving flexibility for linear‑algebra‑heavy workloads.

The most innovative component is the expression‑template framework built around the array class. An abstract base TensorExpr defines the interface for all tensor expressions, while concrete derived classes such as TensorAdd, TensorSub, TensorMul, and TensorScale encode specific operations. Overloaded operators (+, -, *, etc.) construct expression trees rather than performing immediate evaluation, thereby deferring computation until a call to eval(). During evaluation the entire expression is traversed once, generating a single loop nest that computes the final result directly into the destination tensor. This eliminates intermediate temporaries, reduces memory traffic, and enables the compiler to apply aggressive SIMD vectorisation and loop‑unrolling optimisations.

GPU support is achieved by conditionally compiling device‑specific members when __CUDACC__ is defined. The Array class contains a device_ptr and all public member functions are annotated with __host__ __device__, allowing the same source code to be compiled for host and device. Data movement is encapsulated in copy_to_device() and copy_to_host(), which transfer the whole tensor in a single cudaMemcpy operation, minimising PCI‑e overhead. Inside CUDA kernels the multi‑dimensional indices are collapsed into a linear index using pre‑computed strides, guaranteeing coalesced memory accesses.

Performance experiments compare the library against highly tuned BLAS (for matrices) and cuBLAS (for GPU matrices) across a range of ranks and problem sizes. For pure matrix multiplication the overhead relative to BLAS is modest (≈5‑10 %). However, for compound expressions such as C = A*B + D the expression‑template approach outperforms the naïve BLAS‑based sequence by 15‑30 % because it avoids writing the intermediate product to memory. For tensors of rank three and higher the library shows a clear advantage: memory consumption is dramatically lower, and the ability to fuse operations yields speedups that grow with tensor dimensionality. GPU benchmarks demonstrate up to an 8× speedup over the CPU implementation, with scaling that remains essentially linear as the tensor size increases.

The authors also provide extensive usage examples that illustrate tensor creation, reshaping, slicing, and the composition of complex arithmetic in a single line of code. The implementation is confined to a single header (tensor.hpp), has no external dependencies beyond the C++ standard library, and deliberately avoids overly intricate metaprogramming tricks that would hinder readability.

In conclusion, the paper demonstrates that C++11’s language features—variadic templates, constexpr, decltype, lambda expressions, and unified host/device annotations—are sufficient to build a clean, high‑performance, rank‑agnostic tensor library. The resulting framework bridges the gap between expressive mathematical notation and low‑level performance, making it a valuable building block for scientific simulations, machine‑learning kernels, and other HPC applications. Future work suggested includes automatic tensor contraction (e.g., trace and Einstein summation), distributed‑memory extensions, and integration with other accelerator APIs such as OpenCL or SYCL.