Loss functions are fundamental to learning accurate 3D point cloud models, yet common choices trade geometric fidelity for computational cost. Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost. APML approximates transport with differentiable Sinkhorn iterations and an analytically derived temperature, but its dense formulation scales quadratically in memory. We present CUDA-APML, a sparse GPU implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO form. This yields near-linear memory scaling and preserves gradients on the stored support, while pairwise distance evaluation remains quadratic in the current implementation. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by 99.9%. Code
Three-dimensional point clouds represent geometry from LiDAR, depth cameras, and structured-light scanners, and they also appear in cross-modal pipelines that infer 3D structure from non-visual measurements. Point cloud learning supports robotics, AR and VR, wireless human sensing, and digital twin pipelines [1]- [4]. In supervised settings, models output unordered point sets, so training requires permutation-invariant losses that remain stable at large point counts.
Matching quality and scalability are often in tension. Nearest-neighbor objectives are efficient but allow many-toone assignments that can degrade geometric coverage, whereas optimal transport losses better reflect one-to-one matching but are costly at training scale. This cost constrains batch size and point count, reduces the number of feasible training runs under fixed compute budgets, and increases energy and monetary cost during GPU training and tuning. APML [5] narrows this gap by constructing a differentiable transport plan with temperature-scaled similarities and Sinkhorn normalization, using an analytically derived temperature. However, its dense formulation stores pairwise matrices with quadratic memory growth, which can dominate training resources. For example, [5] reported close to 300 GB of GPU memory when training This research was supported by the Research Council of Finland 6G Flagship Programme (Grant 346208), Horizon Europe CONVERGE (Grant 101094831), and Business Finland WISECOM (Grant 3630/31/2024).
- These authors contributed equally to this work.
FoldingNet [6] on ShapeNet-55 [7]. Such requirements restrict use in higher-resolution regimes that arise in large scenes and repeated updates, including mapping and digital twin pipelines [8]- [13].
We present CUDA-APML, a sparse GPU implementation of APML that exploits the empirical sparsity of the transport plan after adaptive softmax thresholding. CUDA-APML constructs bidirectional assignments directly in COO form, performs ondevice symmetrization and Sinkhorn normalization on the sparse support, and computes the loss and gradients without materializing dense matrices. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by up to 99.9%, improving the practical cost profile of transport-based supervision for point cloud and cross-modal 3D learning.
Point cloud learning requires permutation-invariant losses that compare unordered predicted point sets with reference point sets. Chamfer Distance (CD) is widely used because it reduces the comparison to nearest-neighbor distances computed in both directions [2], [14]. However, its many-to-one assignments can cause point clustering in dense regions and weak coverage in sparse regions [15]. Extensions such as density-aware CD (DCD) [16], HyperCD [15], and InfoCD [17] mitigate some of these effects but still rely on nearestneighbor matching and remain sensitive to sampling density and local ambiguities. By contrast, Earth Mover Distance (EMD) enforces one-to-one transport and better reflects structural alignment [18], but its computational cost and common equal-cardinality constraints limit its use in training. Alternatives include projected distances such as sliced Wasserstein and approximate transport solvers [19], [20], which often introduce additional design choices and may not match full transport behavior on heterogeneous point distributions.
Entropy-regularized optimal transport provides a differentiable approximation of EMD that can be computed by Sinkhorn normalization [21], [22]. Sinkhorn-based losses have been used for soft correspondence problems such as registration and feature matching [23], [24], but typically require tuning the regularization strength to balance sharpness and stability. APML [5] builds a temperature-scaled similarity matrix from pairwise distances and applies Sinkhorn normalization to obtain soft correspondences with approximately one-to-one marginals. Its temperature is derived from local distance gaps to enforce a minimum assignment probability, reducing manual tuning, but its dense formulation stores and updates pairwise matrices with quadratic memory growth, which restricts batch size and point count in practice [5].
Scaling optimal transport has been studied through sparse or structured support restriction and low-rank approximations [25]- [27]. In parallel, kernel-level designs show that removing dense intermediates can shift practical limits for quadratic operators, as in FlashAttention [28] and KeOps [29]. In 3D perception, sparse tensor engines and point-based libraries provide GPU support for irregular sparsity patterns to improve memory behavior and throughput [30], [31]. These directions motivate a sparse and GPU-oriented implementation of APML. In this work, we exploit the empirical sparsity of the APML transport plan after adaptive softmax thresholding and run symmetrization and Sinkhorn normalization on the sparse support to reduce the dense matrix footprint while keeping the objective within a controlled approximation.
This section presents CUDA-APML, a sparse GPU implementation of APML [5]. The objective is to avoid dense N tensors during forward and backward passes while keeping the APML formulation and its differentiability in the sparse support. CUDA-APML constructs row-wise and column-wise adaptive-softmax assignments directly in sparse coordinate (COO) form, symmetrizes them on device, applies Sinkhorn normalization on the sparse support, and evaluates the loss using only stored nonzero entries.
A. Background: dense APML Let X = {x i } N i=1 be the predicted points and Y = {y j } M j=1 be the reference points with x i , y j โ R d . APML starts from the pairwise cost matrix
For a generic cost vector c โ R K , APML shifts it as ck = c k -min โ c โ so that min k ck = 0. Let c(2) denote the second smallest value in c. The local gap is defined as g = c(2) + ฮด with ฮด > 0. APML computes a temperature T that enforces a minimum probability mass p min on the smallest entry, under the approximation in [5], given by:
When c(2) < ฯต g , APML applies a uniform fallback distribution for numerical stability. APML applies the adaptive-temperature softmax row-wise and column-wise on C, producing P row and P col , and initializes P (0) = 1 2 (P row + P col ). Sinkhorn normalization then alternates column and row scaling for L iter iterations with stability constant ฯต stab [5]. The APML objective is the Frobenius inner product
The dense formulation stores and updates C and P , which yields O(N M ) memory.
CUDA-APML is based on an empirical property reported in [5] and confirmed in our measurements. After the adaptive softmax stage, most entries in the unnormalized similarity matrix are close to zero. For a fixed row i, the unnormalized similarity can be written as
where T i is computed from the row-wise gap through Eq. ( 1). The corresponding normalized probabilities are defined as:
(
Analogous definitions hold for the column-wise direction, producing p col ij . CUDA-APML introduces a pruning threshold ฯ applied to the unnormalized similarities and keeps only the index pairs in the support sets, defined as โฆ row ฯ = (i, j) : s row ij โฅ ฯ , and
The sparse initialization keeps the union support โฆ ฯ = โฆ row ฯ โช โฆ col ฯ and stores only those entries in COO format. The objective is to reduce memory traffic and storage while preserving the probability mass that dominates the transport plan.
The number of kept entries per row and per column is data-dependent, so CUDA-APML constructs COO buffers in two GPU passes per direction. In the first pass, each thread block scans one row or column, computes costs C ij on the fly, tracks the minimum and second minimum to obtain the local temperature, and counts entries that satisfy the pruning rule in โฆ row ฯ or โฆ col ฯ . An exclusive prefix sum converts these counts into write offsets and yields the total number of nonzeros. In the second pass, the kernel rescans, recomputes unnormalized similarities (with C i min subtracted before the exponential), writes the kept COO triples (i, j, v), and normalizes within each row or column by the sum of kept similarities on that support. If the uniform fallback is triggered, the kernel writes a uniform distribution over the corresponding row or column. This produces P row on โฆ row ฯ and P col on โฆ col ฯ . To symmetrize without dense materialization, CUDA-APML concatenates the COO streams, maps each (i, j) to a 64-bit key (for example key = i โข M + j), sorts by key on device, and reduces duplicates by averaging values, yielding a single COO plan P (0) on โฆ ฯ . Sinkhorn normalization then operates on this sparse support. Let P be stored in COO with entries {(i t , j t , v t )} nnz t=1 . One iteration alternates column scaling,
and row scaling,
implemented by an accumulation kernel (atomic adds into dense length-M or length-N buffers) followed by a scaling kernel that updates all COO values. This keeps the plan sparse and avoids any N ร M tensor. We use L iter as in dense APML [5].
After Sinkhorn normalization, CUDA-APML evaluates the objective on the COO support as L CUDA-APML = nnz t=1 v t โฅx it -y jt โฅ 2 . No dense cost matrix C is formed, and the forward pass computes distances only for the stored index pairs. The backward pass differentiates through the same sparse computation graph. In particular, the distance term contributes the standard pairwise gradient on the stored pairs:
where ฯต dist > 0 avoids division by zero when two points coincide. Gradients of the transport weights v t with respect to X and Y are computed through the fused kernels for the adaptive softmax and Sinkhorn updates on the sparse support. Pruned entries are treated as zero and do not contribute to forward or backward passes, which suppresses weak competing correspondences and sharpens the dominant gradient signal. CUDA-APML follows the stability conditions of APML [5]. Stability is ensured by clamping the local gap as max(gap, ฯต g ) instead of uniform fallback. Sinkhorn normalization uses ฯต stab , and sparsity is controlled by the pruning threshold ฯ .
Algorithm 1 summarizes the CUDA-APML pipeline for one batch and highlights where dense tensors are avoided.
Algorithm 1 CUDA-APML for one batch Require: Predicted X โ R BรN รd , reference Y โ R BรM รd , p min , ฯ , L iter 1: Row direction: scan each row, compute minima and temperature, count pairs with similarity โฅ ฯ , exclusive scan to obtain offsets, rescan to write and normalize COO 2: Column direction: repeat with swapped roles to obtain P col 3: Symmetrize: concatenate COO, sort by 64-bit key (i, j), reduce duplicates by averaging to obtain P (0) 4: for โ = 1 to L iter do 5:
Column scaling in COO using dense column-sum buffer and ฯต stab 6:
Row scaling in COO using dense row-sum buffer and ฯต stab 7: end for CUDA-APML avoids materializing C and P as dense tensors and stores the plan in COO. With 32-bit indices (i, j) and 32-bit values v, the plan storage is 12 nnz bytes, plus temporary buffers for symmetrization and Sinkhorn. Symmetrization adds a 64-bit key per entry and sorting workspace. Sinkhorn uses two dense 1D accumulation buffers of lengths N and M for row and column sums. Overall, memory scales as O(B(nnz + N + M )) up to constant factors from sorting and scan workspaces.
In the current implementation, the adaptive softmax stage still scans all pairs to identify minima and apply thresholding, so distance evaluation remains O(BN M d). The runtime reduction comes from lower memory traffic and from restricting Sinkhorn updates and loss evaluation to nnz entries. Sinkhorn iterations cost O(B nnz L iter ), and symmetrization is dominated by sorting, with O(nnz log nnz) in the comparison model. Unless stated otherwise, we use ฯ = 10 -8 , L iter = 10, and ฯต stab = 10 -8 , and we set p min as in [5].
We evaluate CUDA-APML on point cloud completion and cross-modal generation. For completion, we use ShapeNet-34 and ShapeNet-55 [7] with the standard partial-to-complete pairs and official splits, and PCN [32] (2,048-point partial input and 16,384-point reference). For cross-modal generation, we use MM-Fi [3], [33], which pairs WiFi CSI with LiDARbased 3D human point clouds, and we follow the CSI2PC manual_split protocol that tests generalization to unseen subjects and environments [3]. We integrate losses into Fold-ingNet [6], PCN [32], and PoinTr [34] for completion, and into CSI2PC [3] for MM-Fi. We compare CUDA-APML against CD [14], DCD [16], HyperCD [15], InfoCD [17], and dense APML [5].
We report CD in โ 1 and โ 2 variants [14], EMDร100 [18], and F1-score at threshold 0.01 [32]. Efficiency is measured by wall-clock time per epoch and peak GPU memory. Each configuration is trained three times with different seeds and we report the mean. ShapeNet results use the standard test split [7] and the default training protocols of each backbone, while MM-Fi uses manual_split [3]. To support the sparse formulation in Sec. III, we also report (i) the number of nonzero COO entries after symmetrization as a function of point count on synthetic random point sets using 500 trials per point count, and (ii) peak memory per sample for dense APML and CUDA-APML in the same synthetic setup.
All experiments use PyTorch 2.5 with CUDA 12.8 and are run on NVIDIA V100 (32 GB) and RTX 4070 GPUs. Unless
Table I reports the main benchmarks (mean of three runs). On ShapeNet-34 with FoldingNet, the similar results for both Dense-APML and CUDA-APML improve EMD over CD, HyperCD, and InfoCD and achieve the highest F1-score. On MM-Fi with CSI2PC, CUDA-APML improves EMD over DCD with slightly better results than Dense-APML.
Table II shows peak GPU memory for FoldingNet on ShapeNet-55 (batch size 32). CUDA-APML reduces memory across all tested point counts. Figure 2 yield more isolated points and gaps in the human silhouette. APML and CUDA-APML produce more coherent coverage with fewer missing regions.
We presented CUDA-APML, a sparse CUDA implementation of APML that removes the quadratic memory bottleneck of dense transport plan construction for point cloud supervision. CUDA-APML constructs row-wise and column-wise adaptive-softmax assignments in COO form, performs ondevice symmetrization, and applies Sinkhorn normalization on the sparse support without materializing dense N รM tensors. Across ShapeNet and MM-Fi, CUDA-APML preserves the accuracy trends of dense APML while reducing peak GPU memory by more than 99.9% and showing near-linear scaling of the stored support and loss-side memory in synthetic tests, which makes APML practical when dense transport losses are memory limited.
A limitation is that the adaptive-softmax stage still scans all pairwise distances, so distance evaluation remains quadratic in the number of points. Future work will restrict the candidate support, for example via neighborhood pruning or approximate nearest neighbors, and will evaluate the method on larger scene-level and digital twin update settings.
AND MM-FI (GENERATION). METRICS: F1-SCORE (โ) AND EMDร100 (โ).
This content is AI-processed based on open access ArXiv data.