Stable Signature meets BZH

9 min readDec 13, 2023
beautiful landscape scenery glass bottle with a galaxy inside cute fennec fox snow HDR sunset
What SDXL-turbo thinks is a “beautiful landscape scenery glass bottle with a galaxy inside cute fennec fox snow HDR sunset”, generated in less than a second

Last week, we released a public demo of AI-generated content watermarking and communicated on it. We wanted to give a bit more details on what we did from a scientific point of view and share some benchmarks.

Stable Signature

Let’s start first with the method we followed, namely Stable Signature by Meta and INRIA, described in depth here. The general idea of this method is to fine-tune the VAE decoder of Stable Diffusion to make it directly produce the specific watermark signal expected by a differentiable detector for a fixed key. The principle is very similar to a targeted adversarial attack, except we want to boost the response of the watermark detector for a fixed key rather than boosting a classifier’s response on a target class. Also, instead of directly modifying the pixels of the image as in SSL watermarking, the gradient information is back-propagated one stage further into the weights of the VAE decoder (Fig. 2 (b)).

Let’s stop there for a second and think about the implications. First of all, the watermark is merged in the weights of the VAE decoder, making it hard to remove unless the original weights were made public (which unfortunately is the case for Stable Diffusion). It also means the watermark comes at no additional computational cost compared to a non-watermarked generation.

However, the VAE decoder structure adds some constraints on the watermark. One is that it tiles the output image, meaning it is unlikely low-frequency features of the watermark larger than the patch size are reproduced. Also, adjusting the output size of the generation will change the number of patches the detector can rely on, but not their scale, which is fixed. This means overall the watermark is naturally robust to crop but not to scale: the original paper shows that even only 10% of the original pixels are enough to find the watermark with strong confidence, but “The resize and JPEG 50 transformations seems to be the most challenging ones, and sometimes get bellow 0.9 [bit accuracy]” as it relies only on the training of the detector.

Another point is that the perceptibility of the watermark depends on how much the initial model was fine-tuned, and may be hard to control. Instead of a fixed PSNR/SSIM budget, the strength of the watermark is controlled by the learning rate and lambda of the loss function used during fine-tuning.

Finally, even if the detector is kept the same, the fine-tuning procedure needs to be done from stratch for each secret key to detect.


Presenting BZH (Blind Zero-bit Hidding) would need a full blog post of its own, but let’s just say its derived from HiDDeN too, like the detector used in Stable Signature. However, instead of decoding a binary message, we rely on zero-bit watermarking and extract a high-dimensional vector that we correlate with the vector generated from the key. For a random key, this expected vector is uniformly drawn on the surface of the unit hypersphere in dimension d. Detecting the watermark on a query content then amounts to computing the probability that by chance, when running the detector on a non-watermarked image, we obtain a higher correlation (C) than the one (c) observed on the query content. This is called the p-value of rejecting the null hypothesis (H0) and leads to the probability of claiming wrongly that a content is watermarked while it’s not (false positive), for a given threshold. It turns out it can be computed analytically from the area of a hyperspherical cap by evaluating the regularized incomplete beta function I:

However, since the key is fixed in this use case, we need to be careful on how the output of the detector behaves when we run it on a dataset of random unwatermarked images. Indeed, as noted with the binary output of Stable Signature, in this case the output of the detector may be correlated and uncentered, preventing computation of the p-value with the formula above as it is not distributed uniformly on the hypersphere. However by learning a whitening transform of the output on a dataset of images, one can restore this assumption to a high level of confidence. Here we used Flickr100k to train this linear transform and perform ZCA whitening. Under H0 the p-value should be uniformly distributed. We tested this hypothesis on 5000 AI-generated images by visualizing the histogram of the p-value, which should be flat, and performing a Kolmogorov–Smirnov test to check if we could reject this null hypothesis:

Histogram before whitening, the distribution is not flat
before whitening, KS test p-value = 4e-34 -> definitely not uniform!
distribution after whitening, the histogram is flat
after whitening, KS test p-value = 0.2586 .. much better!

Other improvements compared to HiDDeN include retraining for crop/resize/JPEG but also recapture attacks, better aggregation procedure, strict control on watermark MSE, support for masked (think alpha) input in the detector, etc…


We changed a few things in the Stable Signature procedure to adapt it to our needs. First of all, since our detector is zero-bit, the loss function to optimize was changed from binary cross-entropy between the expected message and predicted message to negative cosine similarity between the expected vector v and the extracted vector v’, which are both normalized. So, using the same notations as the paper,

Then, rather than using a perceptual loss (LPIPS) to constrain the decoded patches to be similar to the non-watermarked patches, we reverted to the initial training loss of the KL-autoencoder (Stable Diffusion paper, appendix G, equation (25)) which is only interested in reconstructing the input image in a plausible way:

We believe this is a better objective as the non-watermarked model is not supposed to be released, giving more freedom to how we can generate patches. Also, reintroducing the discriminator term to ensure the distribution of decoded patches is hard to distinguish from the distribution of the original patches feels important, especially at high distorsion rates. Overall we want to use the degrees of freedom of the decoder to reconstruct plausible patches while including the watermark, rather than fitting to a preexisting model that is not supposed to be available anyway.

Unfortunately, the discriminator learned during the KL-autoencoder training of Stable Diffusion is part of the loss and was not released (to our knowledge), so it has to be relearned from scratch during the fine-tuning procedure.

We finetuned multiple models for various compromises between perceptibility and robustness by simply varying the lambda parameter of the final loss:

Finally, we preprocess the image to detect with a fixed aspect-ratio-preserving rescale to a rectangle of 256-pixel side. This corresponds to the settings on which BZH was trained, and allows the watermark to be naturally robust to downscale. However it means the VAE decoder is trained for a specific generation resolution and has to be retrained if output size changes. We think this is ok as most generative models are working at fixed size (512px, 768px, 1024px) and generating at a different resolution may produce artifacts.


We compared this watermarking solution to a few others in terms of robustness and perceptibility. We generated ~5000 images with SDXL-turbo using COCO 2017 validation captions as prompts. As a baseline we used DCTDWT from the invisible-watermark package, since it is the default watermark used in the StableDiffusion original code and pipeline of HuggingFace’s diffusers library. We compared also to watermarking after generation with our internal BZH watermarker (the one corresponding to the detector we finetune for), and IMATAG’s production watermark (lamark).

The solutions are evaluated by computing the decoder p-values and plotting the corresponding ROC curves. The false-positive rate axis uses logarithmic scale to better assess the performance at very low rates, which is generally the regime we are interested in. P-value thresholds of 1e-12 (almost no errors per test) or 1e-3 (generally used in the literature) are represented with light gray vertical lines. StableSignature variants are represented with solid lines while post-watermarking solutions are presented with dashed lines. Here’s what we get with no attacks:

Performance with no attacks

Among all the watermarks, DCTDWT (operating at average PSNR 42.6dB) and the weak model are the worst. They are still better than “none” which corresponds to random detection. The p-value for DCTDWT is computed by counting the number of matching bits (m) in its 48-bit message and assuming the bits are random and equiprobable under H0. Then the p-value is given by

and there is always a chance among 2⁴⁸ that the code is correct. Therefore the performance at false positive rate below 1/2⁴⁸ is undefined, which is why the dashed gray line stops at this value. However note that although the bits of the DCTDWT key of Stable Diffusion were drawn randomly and equiprobably, the output of DCTDWT was not whitenened, contrary to our models and the original StableSignature paper. Therefore the assumption above does not actually hold in practice and the p-values for this system are indicative only. Using DCTDWT to claim whether a content is watermarked or not should be done with extreme care.

BZH1, BZH2 and BZH3 correspond to three different levels of watermark power, with an average PSNR of 46.8dB (almost invisible), 43.8dB, and 41.9dB (slightly visible) respectively. They all compare well to their StableSignature counterpart, with the “extreme” model being very visible but on par with BZH1 in terms of performance. Note that without attacks, IMATAG’s production watermark (lamark) operating at average PSNR 45.0dB is detected perfectly, so we don’t show it in this graph. This would suggest post-watermarking wins.. but what if we start altering the image before detection?

Performance under soft attack

The graph above shows performance after the “combined” attack used in the StableSignature paper: brighten of 1.5, central crop 50% followed by JPEG compression at 80% quality. First of all, DCTDWT is not robust to this attack, showing performance that is worse than a random guess. The StableSignature-based approaches stay competitive with respect to post-watermarking, with BZH1 now on par with the weak model. This attack is still very easy for lamark which beats all other methods by a large margin. If we attack even stronger, with 1.5 brightening, 2x downscale, 50% crop and 50% JPEG compression we get this:

Performance under hard attack

Now most watermarks are struggling a lot to resist this attack, with BZH2 and lamark being on par in terms of performance, and only BZH3 maintaining a decent level of performance, at the cost of being quite visible. Here’s what the images look like:

No attack, no watermark
No attack, medium watermark
Soft “Combined” attack, medium watermark
Hard “combined” attack, medium watermark

To conclude, DCTDWT (invisible-watermark) is not good enough to resist even unintentional attacks. Methods based on StableSignature are competitive if one wants to benefit from its easy integration in diffusion model generation at zero additional computation cost, and the difficulty of removing it (one can’t comment a line of code to do so…). Post watermarking with the corresponding watermarker allows better control over distorsion and is less perceptible. Finally, IMATAG’s production watermark is still the best if one wants to maximize the perceptibility/robustness compromise, and maybe some day we’ll explain a bit why :-)