End-to-End Throughput Benchmarking of Portable Deterministic CNN-Based Signal Processing Pipelines

End-to-End Throughput Benchmarking of Portable Deterministic CNN-Based Signal Processing Pipelines
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a benchmarking methodology for evaluating end-to-end performance of deterministic signal-processing pipelines expressed using CNN-compatible primitives. The benchmark targets phased-array workloads such as ultrasound imaging and evaluates complete RF-to-image pipelines under realistic execution conditions. Performance is reported using sustained input throughput (MB/s), effective frame rate (FPS), and, where available, incremental energy per run and peak memory usage. Using this methodology, we benchmark a single deterministic, training-free CNN-based signal-processing pipeline executed unmodified across heterogeneous accelerator platforms, including an NVIDIA RTX 5090 GPU and a Google TPU v5e-1. The results demonstrate how different operator formulations (dynamic indexing, fully CNN-expressed, and sparse-matrix-based) impact performance and portability across architectures. This work is motivated by the need for portable, certifiable signal-processing implementations that avoid hardware-specific refactoring while retaining high performance on modern AI accelerators.


💡 Research Summary

The paper introduces a systematic benchmarking methodology for evaluating the end‑to‑end performance of deterministic signal‑processing pipelines that are expressed solely with CNN‑compatible primitives. Targeting phased‑array workloads such as medical ultrasound, the authors construct a single, training‑free pipeline that contains no learned weights and therefore preserves determinism, bounded error, and certifiability. The pipeline is implemented in three variants: (1) a dynamic‑indexing version that uses gather‑style operations, (2) a fully CNN‑expressed version that relies only on convolutions, point‑wise arithmetic, and reductions, and (3) a sparse‑matrix version that replaces dynamic indexing with structured sparse matrix multiplications.

All three variants are executed unmodified on two heterogeneous accelerator platforms – an NVIDIA RTX 5090 GPU and a Google TPU v5e‑1 – using the same source code, same input tensors, and identical warm‑up and timing procedures. The benchmark measures sustained input throughput (MB / s), effective frame rate (FPS), incremental energy per run (GPU only), and peak device memory usage (GPU only). Three imaging modalities are evaluated: B‑mode (RF → IQ → beamformed IQ → dynamic‑range compressed image), Color Doppler (adds lag‑1 autocorrelation and spatial smoothing), and Power Doppler (adds power accumulation and log scaling).

Results on the RTX 5090 show that the dynamic‑indexing variant achieves the highest raw throughput (up to ~7 GB / s) and frame rates exceeding 1 200 FPS for Doppler workloads, thanks to the GPU’s strong support for irregular memory access. However, this variant also incurs higher memory footprints (up to ~1 GB for B‑mode) and higher incremental energy (~0.05 J per run). The fully CNN‑expressed variant trades some raw throughput (e.g., 283 MB / s, 52 FPS for B‑mode) for dramatically lower memory usage (≈0.36 GB) and better energy efficiency. The sparse‑matrix variant on GPU delivers comparable throughput to dynamic indexing for Doppler but suffers from increased peak memory (up to 6 GB for B‑mode).

On the TPU v5e‑1, the dynamic‑indexing variant performs poorly (≈30 MB / s, ~5 FPS) because the XLA backend cannot efficiently handle gather‑style operations. In stark contrast, the fully CNN‑expressed variant leverages the TPU’s matrix‑multiply‑centric architecture to achieve >500 MB / s and up to 104 FPS, representing a ten‑fold speedup over the dynamic‑indexing version. The sparse‑matrix variant could not be evaluated on TPU due to lack of backend support for structured sparse operators.

The authors discuss that end‑to‑end benchmarking captures effects invisible to kernel‑level microbenchmarks, such as inter‑stage memory traffic, tensor materialization, and synchronization overhead. They argue that the fully CNN formulation provides the best balance of portability, determinism, and performance across heterogeneous accelerators, while dynamic indexing remains advantageous only on GPUs that natively support irregular memory patterns.

Limitations include the absence of image‑quality or clinical‑performance evaluation, proprietary constraints preventing public release of the full implementation, and incomplete telemetry on the TPU (no board‑level power or precise memory usage).

In conclusion, the work demonstrates that deterministic, CNN‑compatible signal‑processing pipelines can be executed unchanged on both high‑end GPUs and TPUs, achieving state‑of‑the‑art throughput without hardware‑specific tuning. This establishes a practical path toward portable, certifiable real‑time imaging systems that can be deployed across diverse accelerator ecosystems with minimal software engineering effort.


Comments & Academic Discussion

Loading comments...

Leave a Comment