SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch
We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.
💡 Research Summary
SoftDTW‑CUDA‑Torch is an open‑source PyTorch extension that brings a fully GPU‑accelerated, memory‑efficient, and numerically stable implementation of Soft Dynamic Time Warping (Soft‑DTW) to the deep‑learning community. The authors identify three practical shortcomings of the previously available GPU implementation (Maghoumi 2020): (1) a hard cap of 1 024 on sequence length due to the 1 024‑thread per‑block limit in CUDA, (2) numerical overflow in the backward pass when the smoothing parameter γ is small, and (3) the need to materialize a full pairwise distance tensor of shape (B, N, M), which quickly exhausts GPU memory for realistic batch sizes, sequence lengths, or feature dimensions.
To overcome these issues the paper introduces three key techniques. First, tiled anti‑diagonal execution replaces the monolithic kernel that processes the entire DP matrix with a series of lightweight kernels, each handling a single anti‑diagonal (i + j = p). By launching one kernel per diagonal and using the host‑side loop as an implicit synchronization barrier, the method removes the 1 024‑thread restriction and allows arbitrary N and M while still falling back to the original single‑kernel fast path when max(N, M) ≤ 1 024.
Second, the log‑space backward pass rewrites the gradient recurrence in log‑space, replacing the unstable exponential weight computation with a log‑sum‑exp formulation. Intermediate values are kept as logarithms, and only a final exponentiation is performed after the whole backward DP finishes. This prevents overflow even for γ < 0.1, guaranteeing correct gradients across the full range of smoothing parameters.
Third, the authors propose a fused distance computation mode that eliminates the O(B · N · M) distance tensor. Using the identity ‖xᵢ − yⱼ‖² = ‖xᵢ‖² − 2⟨xᵢ, yⱼ⟩ + ‖yⱼ‖², they pre‑compute the squared norms of each sequence (shapes B × N and B × M) and compute dot‑products on the fly with batched matrix multiplication. During DP kernel execution the required distance for each cell is recomputed directly from the input tensors, reducing memory consumption to O(B · (N + M)). The trade‑off is a 10–15× increase in runtime compared with the “unfused” mode that stores the full distance matrix.
The library supports both unfused (fast, memory‑heavy) and fused (slow, memory‑light) modes, full autograd integration, batch processing, and Soft‑DTW barycenter optimization. The barycenter is obtained by minimizing the sum of Soft‑DTW distances to a set of series using the Adam optimizer, enabling differentiable averaging of time‑series.
Benchmarks on an NVIDIA GTX 1080 (batch sizes 16 and 32, sequence lengths 128–2 048, feature dimension 64) show that the fused mode reduces peak GPU memory by up to 98 % relative to the original implementation, while the unfused mode still achieves roughly 90 % memory savings. The unfused mode remains the fastest, with wall‑clock times comparable to the baseline for short sequences, whereas the fused mode incurs a noticeable slowdown but stays operational where the baseline would run out of memory. Crucially, the new implementation is the only publicly available GPU Soft‑DTW that can handle sequences longer than 1 024 without falling back to CPU.
The paper also discusses current limitations: the fused mode’s runtime overhead could be mitigated by shared‑memory tiling or persistent‑kernel techniques; the implementation currently operates only in FP32, leaving potential gains from mixed‑precision (FP16/BF16) untapped; the large number of kernel launches (N + M − 1) may become a bottleneck for very long series, suggesting the use of CUDA‑graphs or persistent kernels; and the normalized Soft‑DTW variant still requires equal‑length inputs.
In summary, SoftDTW‑CUDA‑Torch delivers a practical, scalable solution for differentiable time‑series alignment on modern GPUs. By removing the sequence‑length cap, ensuring numerical stability in the backward pass, and dramatically cutting memory usage through on‑the‑fly distance computation, it enables researchers and practitioners to incorporate Soft‑DTW into large‑scale training pipelines, clustering, and barycenter estimation without the memory constraints that previously limited its applicability. The library is released under the MIT license and is readily extensible for future enhancements such as mixed‑precision support and more efficient kernel orchestration.
Comments & Academic Discussion
Loading comments...
Leave a Comment