Sliced Wasserstein Discrepancy in Disentangling Representation and Adaptation Networks for Unsupervised Domain Adaptation
This paper introduces DRANet-SWD as a novel complete pipeline for disentangling content and style representations of images for unsupervised domain adaptation (UDA). The approach builds upon DRANet by incorporating the sliced Wasserstein discrepancy (SWD) as a style loss instead of the traditional Gram matrix loss. The potential advantages of SWD over the Gram matrix loss for capturing style variations in domain adaptation are investigated. Experiments using digit classification datasets and driving scenario segmentation validate the method, demonstrating that DRANet-SWD enhances performance. Results indicate that SWD provides a more robust statistical comparison of feature distributions, leading to better style adaptation. These findings highlight the effectiveness of SWD in refining feature alignment and improving domain adaptation tasks across these benchmarks. Our code can be found here.
💡 Research Summary
The paper presents DRANet‑SWD, an extension of the Disentangling Representation and Adaptation Network (DRANet) that replaces the traditional Gram‑matrix style loss with a Sliced Wasserstein Discrepancy (SWD) loss. DRANet originally separates image content and style in a single encoder‑generator pipeline and uses a VGG‑19 perceptual network to compute content and style losses; the style loss is based on Gram matrices, which capture only second‑order statistics of feature activations. The authors argue that this limitation hampers effective style alignment, especially when the source and target domains exhibit complex texture, resolution, or illumination differences.
SWD, rooted in optimal transport theory, measures the distance between two high‑dimensional feature distributions by projecting them onto random one‑dimensional directions, sorting the resulting scalars, and computing an L2 distance between the sorted lists. This procedure captures the full set of moments of the distributions, offering a richer statistical comparison than the Gram loss. In DRANet‑SWD, the SWD loss is computed for each selected layer of the perceptual network (all layers except the final one) and combined with the content loss (MSE on the final VGG layer) to form the overall perceptual loss.
The overall architecture retains DRANet’s components: an encoder E, a separator S that splits encoded features into content (C) and style (S) codes, a generator G that recombines content and swapped style codes to produce translated images, and two domain‑specific discriminators D_X and D_Y for adversarial alignment. A fixed VGG‑19 network P provides the feature maps used in the perceptual losses. The total training objective L_d for each domain d∈{X,Y} is a weighted sum of reconstruction (L1), consistency (L1 between original and translated content/style codes), adversarial (hinge) loss, and perceptual loss (content + λ·style). The discriminators maximize the loss while the encoder, separator, and generator minimize it, forming a min‑max game.
Experiments cover two benchmark families. For digit classification, four datasets (MNIST, MNIST‑M, USPS, SVHN) are used in all pairwise adaptation directions. DRANet‑SWD improves accuracy in most cases, notably achieving a 0.3 % absolute gain on MNIST↔USPS (94.98 % → 95.28 %) and a dramatic increase on the challenging SVHN→MNIST direction (19.8 % → 47.3 %). The authors attribute these gains to SWD’s ability to align texture and resolution differences more faithfully than Gram matrices. However, the SVHN↔MNIST scenario also reveals a limitation: multiple digits per image cause the content‑style separator to become confused, leading to degraded performance in the opposite direction (SVHN→MNIST still lower than MNIST→SVHN).
For semantic segmentation, the synthetic GT‑A5 dataset serves as source and real‑world Cityscapes as target. Using mean Intersection‑over‑Union (mIoU) as the metric, DRANet‑SWD modestly outperforms the original on several classes (e.g., road: 33.63 % → 33.77 %; sidewalk: 76.8 % → 81.1 %). Visual results show sharper boundaries and better recovery of small objects. The authors also evaluate a version with a pretrained DRN‑26 backbone; DRANet‑SWD with pretraining maintains comparable performance, confirming that the SWD loss integrates well with stronger feature extractors.
The paper discusses computational overhead: SWD requires multiple random projections and sorting operations, increasing GPU memory and runtime compared with the simple Gram computation. To mitigate this, the authors limit the number of projection directions per batch and carefully select layers for SWD computation. They also note that domains where style and content are heavily entangled (e.g., SVHN) challenge the separator, suggesting future work on more robust disentanglement mechanisms or adaptive projection strategies.
In summary, DRANet‑SWD demonstrates that replacing Gram‑matrix style loss with a sliced Wasserstein discrepancy yields more accurate style alignment, leading to measurable improvements in both classification and segmentation UDA tasks. The work validates SWD as a viable, statistically richer alternative for style‑based domain adaptation, while also highlighting areas—computational cost and highly entangled domains—where further research is needed.
Comments & Academic Discussion
Loading comments...
Leave a Comment