HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency

HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: https://github.com/higun2/HP-GAN.


💡 Research Summary

HP‑GAN introduces a novel framework that leverages pretrained visual models—both convolutional neural networks (CNNs) and vision transformers (ViTs)—to improve generative adversarial network (GAN) training. The authors identify two main shortcomings in prior work: (1) pretrained networks are typically used only for perceptual losses or feature projection, which offers limited benefit when data are scarce, and (2) GAN training remains unstable due to mode collapse and discriminator over‑fitting, especially in low‑data regimes. To address these issues, HP‑GAN proposes two complementary mechanisms: FakeTwins and Discriminator Consistency.

FakeTwins treats a frozen pretrained CNN or ViT as an encoder and applies the Barlow Twins self‑supervised loss directly to generated images. Barlow Twins minimizes redundancy between two embeddings obtained from different augmentations of the same image by forcing their cross‑correlation matrix toward the identity. Because the loss is lower when the batch contains diverse samples, the generator is encouraged to produce a wide variety of high‑information images. This approach sidesteps the need for negative samples, large batch sizes, or asymmetric networks that are required by contrastive SSL methods such as SimCLR or MoCo, making it memory‑efficient and well‑suited for GAN training.

Discriminator Consistency introduces two discriminators: one operating on CNN feature maps and another on ViT token sequences. Owing to architectural differences, their raw scores would normally diverge, potentially destabilizing training. HP‑GAN adds a consistency loss that penalizes the discrepancy between the two discriminator outputs for the same image (using L2 or cosine similarity). By forcing the discriminators to agree on image quality, the generator receives coherent feedback, reducing mode collapse and improving training stability. Moreover, the complementary perspectives of CNNs (local texture) and ViTs (global context) are fused, yielding a richer assessment of realism.

The authors evaluate HP‑GAN on 17 datasets spanning large‑scale (FFHQ, LSUN‑Bedroom), medium‑scale (CelebA‑HQ), and very small (Pokémon, medical imaging) domains. Metrics include Fréchet Inception Distance (FID), Kernel Inception Distance (KID), precision‑recall, and Perceptual Path Length (PPL). An ablation study shows a progressive improvement: FastGAN → Projected GAN (CNN) → addition of ViT → Discriminator Consistency → FakeTwins. On FFHQ, FID drops from 12.69 (FastGAN) to 1.69 (HP‑GAN) and PPL improves from 150.3 to 132.5, indicating both higher fidelity and smoother latent interpolation. On the tiny Pokémon dataset (833 images), HP‑GAN reduces FID by roughly 20‑30 % compared with the strongest baseline, demonstrating robustness in data‑limited settings.

Key contributions are: (1) the FakeTwins module, which brings information‑maximizing self‑supervision to the generator via frozen pretrained encoders; (2) the Discriminator Consistency loss, a novel regularizer that aligns the outputs of heterogeneous discriminators, enhancing stability; (3) extensive empirical validation across diverse domains, consistently outperforming state‑of‑the‑art methods; and (4) open‑source release of code and pretrained models, facilitating reproducibility and adoption. HP‑GAN thus offers a practical recipe for researchers and practitioners seeking high‑quality image synthesis, especially when computational resources or training data are limited.


Comments & Academic Discussion

Loading comments...

Leave a Comment