Following up on our last post on combining Stable Signature with BZH, let’s discuss a bit about the security aspects of the solution. We don’t want another untamed artificially intelligent beast to wreck havor our digital ecosystem, do we?
For ease of use and open access to watermarking technology, anyone would like a self-contained solution where both the watermarker and the detector are public. However, unlike public key cryptography, current state-of-the-art watermarking systems are all symmetric, meaning the detector needs access to the same secret key that was used by the watermarker. Although asymmetric or zero-knowledge watermarking systems exist, they are lagging behind their symmetric counterpart in terms of performance and robustness to attacks.
Releasing the watermarker has the security risk that it allows anyone to watermark content. In the context of generative AI, it means anyone can take a real image and add a watermark to it, pretenting it is generated. In the particular case of Stable Signature this is done by encoding and decoding the image with the watermarked VAE. However, one can argue that the image has been processed and is therefore not authentic anymore. In any case, detecting the watermark simply means that the image has gone through this specific watermarking VAE decoder, to which anyone has access. Also, by watermarking many images with the same watermarker, one can learn a detector from it.
Releasing the full detector enables anyone to extract any watermark from any content. Even if the watermarker is kept private, this allows to just read the expected signal and retrain a new watermarker that will produce it. Also, giving access to the binary decision only still allows to train a proxy detector and perform adversarial attacks. By the way, these were called oracle attacks by the watermarking communauty long before the adversarial term became popular. Although some techniques exist to mitigate this issue, access to the detector decisions (e.g. via API) should always be controlled and limited.
Also, one aspect we did not discuss yet is the secret key. The mere fact that it is fixed makes the solution insecure. For a secure solution, one would need to have multiple keys, change it often enough and run the detection on all the keys used beforehand. Ideally the key should change after each use, like a one-time-password, but it makes detection untractable and less robust. Also, the keyed watermarker and detector should be kept secret.
Finally, for the particular case of Stable Signature, Stable diffusion’s original VAE was released publicly, making it easy to attack the watermark strongly (as was noted in the paper), or to extract the watermark and move it to another image.
So, where do we go from there?
Well, maybe the most important message is that our demo is just that, a demo. We do not pretend it is secure, and we know it is not robust to advanced attacks (such as diffusion purification with the original VAE) or even some simple ones such as a flip (although one could simply run the detector twice in this case). For a secure solution, more robustness, advanced features such as payload support, etc.. one should still contact IMATAG.
However the good news is that since security is broken anyway for the demo watermarking system with a fixed key, maybe we can release more than just the watermarker. Indeed, as long as we can detect the demo watermark and this watermark only we do not weaken the security of the system for another secret key. In order to do so, we trained a resnet18 proxy classifier on the watermarked images, just as an adversarial attacker would do on a black box classification system. But we actually have white box access to our full detector, so we could use knowledge distillation instead to have a better detector. Since we don’t want to leak too much information on the teacher, we don’t want to mimic it when the image is not watermarked. So we ended up using a mix of a classification loss (on non-watermarked images) and knowledge distillation loss (on watermarked images) to train the resnet18 student.
This resulted in a binary classifier able to detect the demo watermark. Unlike the full detector we only have a binary decision, not a p-value estimate. In order to recalibrate the detector, we computed the logits on 1M images from the Flickr1M dataset and stored them. An approximate p-value is then obtained by simply counting the proportion of logits from these samples that are below the observed logit on the image to detect. Although a crude estimate, it provides a fast way to detect with moderate confidence and robustness most watermarked images.
We are quite happy that the demo watermarking system is now self-contained for basic use cases and, while just as insecure, is still way more robust than the previous solution to unintentional attacks.
We released the Detector, a unique beast from the deep, learning forbidden knowledge from distilled watermarks; all others shall remain secret!