Pixel Seal: Adversarial-only training for invisible image and video watermarking
📝 Abstract
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
💡 Analysis
Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
📄 Content
The rapid advancement of generative models such as DALL•E (Ramesh et al., 2022), Stable Diffusion (Rombach et al., 2022), Sora (Brooks et al., 2024), Veo (Google, 2025), and MovieGen (Polyak et al., 2024) has enabled the creation of high-fidelity synthetic content at scale. In response, invisible watermarking-the process of imperceptibly embedding a message into digital content-has emerged as a critical infrastructure for ensuring authenticity. This technique not only serves to distinguish synthetic media from real images, but also to establish broader provenance, such as verifying original uploaders and identifying source tools (Castro, 2025). With that, there is a pressing need for techniques that are both robust and imperceptible.
Multi-bit image watermarking, as established by the seminal work of Zhu et al. (2018), uses an embedder neural network to embed a binary message into an image as an imperceptible perturbation and an extractor neural network that retrieves the embedded binary message from the perturbed image. Typically, the embedder and extractor models are trained end-to-end by minimizing a compound loss function with two opposing objectives: message reconstruction loss to ensure the message can be recovered from a watermarked image, and perceptual losses such as MSE and LPIPS, or adversarial discriminator loss that ensures the embedded watermark remains imperceptible for humans. To achieve robustness of the embedded watermark against common user manipulations, such as application of Instagram filters, cropping, etc., during training, the watermarked image is augmented before the hidden message is retrieved using the extractor. By backpropagating through the augmentations, the model allows for the hidden message to be recovered even after image edits. The Pixel Seal family of models sets new state-of-the-art results for multi-bit image watermarking, both in watermark imperceptibility and robustness. The figure shows average values across 1000 test images generated by Meta AI. The robustness is measured for a combined attack of brightness change (0.5), crop (50% area), and JPEG compression (quality 40). Imperceptibility of Video Seal 0.0 is heavily skewed due to its small but very visible artifacts. Each Pixel Seal model is trained with a different watermark boosting factor β (see Section 4.1 for more details).
Despite the conceptual simplicity of this framework, training a model that is simultaneously robust, fast, and truly imperceptible remains notoriously difficult, as increasing the model’s robustness can lead to more perceptible watermarks and increasing the model’s imperceptibility often leads to less robust watermarks. We identify three fundamental bottlenecks in existing methods: First, standard training pipelines rely on a complex mix of perceptual losses to achieve watermark imperceptibility. These losses are often pixel-wise metrics, such as mean squared error (MSE), or deep perceptual metrics, such as LPIPS, but they remain imperfect proxies for human perception. For example, mean squared error loss does not discriminate between smooth and highly textured areas of an image, whereas humans are more likely to notice artifacts in the smooth regions of an image than in the highly textured ones. Second, jointly optimizing for robustness and imperceptibility creates a contradictory loss landscape-a global optimum for perceptual loss is a zero watermark (no message can be recovered), whereas a global optimum for robustness yields a highly visible watermark. Finding the optimal tradeoff between robustness and imperceptibility requires precise tuning of the learning dynamics. Without it, training frequently collapses-either the model fails to hide the information, or it hides it so well that the decoder cannot retrieve it. Third, applying existing models to high-resolution media at inference time requires watermark upscaling. This is due to the models being trained on low-resolution crops, and it results in distracting artifacts and inconsistencies that are easily visible to the human eye. In this work, we systematically address these challenges to advance the state of the art in image watermarking.
Our main contributions are as follows:
• First, we introduce an adversarial-only training paradigm that removes standard perceptual losses such as MSE or LPIPS. By relying solely on a discriminator, we avoid the failure modes and manual tuning associated with traditional loss functions.
• Second, we propose a three-stage training schedule that decouples the competing objectives of robustness and imperceptibility. By first achieving robust (but visible) watermarks and gradually enforcing invisibility, we ensure stable convergence across random initializations.
• Third, we propose to simulate the inference pipeline during training and apply Just-Noticeable Difference (JND) attenuation at the original input resolution. This eliminates the artifacts commonly found in upscaled watermarks while i
This content is AI-processed based on ArXiv data.