Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

February 22, 2026

Reading time: 1 minute

...

📝 Original Info

Title: Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM
ArXiv ID: 2504.13821
Date: 2025-04-18
Authors: 정보 없음 (제공되지 않음)

📝 Abstract

This paper presents a performant and portable recursive implementation of triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM) in Julia for GPUs, two kernels that underlie many linear-algebra algorithms. We restructure TRMM and TRSM so that most work is executed as general matrix-matrix multiplication (GEMM), improving use of the GPU memory hierarchy and reducing latency. Exploiting Julia's multiple dispatch and metaprogramming together with the GPUArrays and KernelAbstractions frameworks, we expose a single hardware-agnostic API that runs on NVIDIA, AMD, and Apple Silicon GPUs. For large matrices the recursive code reaches throughput comparable to vendor libraries such as cuBLAS and rocBLAS, while providing these routines on Apple Silicon for the first time. The entire implementation is only a few hundred lines of code, showing that unified Julia programs can deliver near-vendor performance across heterogeneous architectures.

Toward Portable GPU Performance: Julia Recursive Implementation of TRMM and TRSM

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Start searching

No results found