How SynthID Was Broken: Three Attacks That Defeated Google's AI Watermark
Google built SynthID to be the invisible proof that an image, audio clip, or text passage came from an AI. Shipped inside Gemini, Imagen, and Veo, it was supposed to solve the deepfake problem at the source. Tag every piece of AI content at the moment of creation, and the provenance question answers itself.
Within months, independent researchers proved it can be defeated three different ways. One extracts the watermark’s frequencies with signal processing. Another overwrites it using a second round of AI generation. The third just swaps a few words.
Key Takeaways
- SynthID uses a fixed phase template per model with 99.5%+ cross-image consistency, which acts as an extractable key
- Spectral analysis can surgically remove the watermark with a 91.4% phase coherence drop while keeping 43+ dB PSNR (visually imperceptible)
- A low-denoise diffusion pass with edge guidance overwrites the watermark at the pixel level without altering composition
- Text watermarks break under simple synonym substitution because they depend on exact word sequences
- Invisible watermarks are useful speed bumps, not security boundaries. Cryptographic approaches like C2PA fill the gap.
How Does SynthID Work?
SynthID operates as a dual-model system: an embedding model injects the watermark, and a detection model reads it back. For images, the embedding model introduces a sub-visible pixel color shift distributed across the entire image at specific frequency bins. Your eyes cannot see it. A Fourier transform can.
I made a video breaking down the mechanics of how SynthID works, particularly for text watermarking:
For images, the watermark lives in the frequency domain. The embedding model picks specific carrier frequencies and shifts pixel values at those positions. The detection model runs a Fourier transform on a suspect image and checks whether those frequencies carry the expected pattern.
For text, SynthID takes a different approach. It uses a sliding window of previous words combined with a private key to alter the probability distribution of the next token. Instead of always picking the most likely word, the model nudges its choices in a pattern that the detector can recognize statistically. The text reads naturally, but the word choices carry an invisible signature.
The design is clever. The watermark survives JPEG compression, cropping, and moderate color adjustments. Google positioned it as resistant to “common modifications.” If you’ve read about penetration testing and security thinking in software , you’ll recognize the pattern: the question is never whether a system can be broken, but how much effort it takes. Researchers did not limit themselves to common modifications.
Why Is a 99.5% Consistent Phase Template a Problem?
The fundamental vulnerability comes down to one design choice: every image generated by the same model carries the same watermark phase template. Cross-image phase coherence exceeds 99.5% at the carrier frequencies ( zDarkHaze/reverse-synthid ). The green channel carries the strongest signal.
Think about what that means. If you can isolate the watermark from a single image, you have effectively extracted the key for every image that model will ever produce. It is not a per-image secret. It is a per-model constant.
Here is what the watermark looks like when amplified 500x. Invisible to the naked eye, but a clear structured pattern under analysis:
The carrier frequencies also shift position depending on image resolution. A 1024x1024 image places its primary carrier at frequency bin (9, 9). A 1536x2816 image moves it to bin (768, 704). Different resolutions, different locations. But the phase template at those locations? Identical.
This is a classic trade-off in watermarking systems. A fixed template makes detection reliable and fast. You know exactly what to look for. But it also means an attacker knows exactly what to look for. Once extracted, the watermark becomes a known signal that can be subtracted.
Attack 1: How Does Spectral Analysis Strip the Watermark?
The reverse-synthid project (forked from aloshdenny/reverse-SynthID) takes the most surgical approach. Instead of degrading the image or regenerating it, it identifies the exact frequencies making up the watermark and directly subtracts them.
Extracting the Fingerprint
The extraction is almost embarrassingly simple. Generate a pure black image and a pure white image through the watermarked model. In these outputs, the watermark represents nearly the entire pixel content, since the “image” itself is a flat color. Run a Fourier transform, and the carrier frequencies light up.
For higher resolutions where black/white images are not available, the researchers averaged the frequency content of 88 diverse watermarked images. The image content averages out to noise. The watermark, being identical across all 88, remains as a clear fixed pattern.
The Spectral Codebook
Because the watermark shifts frequency positions at different resolutions, a single filter cannot work universally. The solution is a multi-resolution Spectral Codebook: a file storing exact carrier positions, magnitudes, and phases per resolution. When processing an image, the system looks up the matching resolution profile and applies the correct subtraction pattern.
The Subtraction Process
The V3 bypass uses direct known-signal subtraction. No Wiener filter, no approximation.
The channel weights match SynthID’s embedding strength exactly. The attack knows how much energy was put into each color channel and calibrates its subtraction accordingly:
- Green: weight 1.0 (strongest signal)
- Red: weight 0.85
- Blue: weight 0.70
The Result
The attack drops phase coherence by 91.4%, effectively making the watermark unreadable to the detector. Meanwhile, the image quality stays at 43+ dB PSNR. For reference, anything above 40 dB is considered visually imperceptible. The watermark is gone. The image looks the same. The project references the official SynthID paper ( arXiv:2510.09263 ).
Attack 2: Can Diffusion Models Overwrite the Watermark?
The Synthid-Bypass project, built by contributors 00quebec (Klaus) and MegalithicBTC, takes a completely different philosophy. Rather than understanding the watermark’s structure, it just overwrites it.
The Core Insight
SynthID lives in the same place as pixel-level noise. It has to: that is how it stays invisible. But diffusion models generate images by progressively removing noise. If you feed a watermarked image back through a diffusion model at low strength, the model regenerates the noise layer while keeping the composition intact. The new noise replaces the old noise. The watermark was part of the old noise.
The researchers discovered the watermark by asking the AI to generate a pure black image, then cranking the exposure and contrast to reveal the hidden pattern:
How the ComfyUI Workflow Operates
The bypass uses a single ComfyUI workflow combining several tools:
- ControlNet with Canny edge detection locks the image’s structural composition. Edges, shapes, and layout stay fixed.
- A low-denoise diffusion pass (strength 0.2 to 0.4) regenerates only the fine details: texture, grain, and the noise floor where the watermark lives.
- Qwen Image models and Z-Image Turbo handle the generation.
- Ultralytics and MediaPipe preserve face integrity when portraits are involved.
Here is the actual V2 ComfyUI node graph used to execute the bypass:
The process is conceptually simple: tell the AI to “repaint” the image at very low intensity. The major features survive because ControlNet holds them in place. The watermark does not survive because it exists at the same scale as the noise being regenerated.
The project references the academic paper “Stable signature is unstable: Removing image watermark from diffusion models” by Hu, Y., et al. (2024), which established the theoretical basis for this class of attack.
Why This Is Hard to Defend Against
This attack does not require any knowledge of SynthID’s internals. It works against any watermark that embeds information in pixel-space noise. A stronger watermark would need to modify the image at a level visible to humans, which defeats the purpose of being invisible. The attacker’s tool is the same class of model that created the image in the first place. You cannot easily strip the ability to generate images from systems designed to generate images.
Attack 3: Why Do Synonym Swaps Break Text Watermarks?
The text watermark vulnerability is the simplest to understand and the hardest to fix.
SynthID for text works by altering token probabilities during generation. It uses a sliding window of previous words combined with a private key to create a deterministic pattern. Instead of choosing word A with 80% probability and word B with 20%, the watermarked model might shift to 70% and 30%. Across hundreds of tokens, these small biases create a detectable statistical signature.
The dependency on exact word sequences is the weakness. Replace “significant” with “major.” Swap “utilize” for “use.” Restructure a sentence from passive to active voice. Each substitution breaks the sliding window’s expected sequence. The detector can no longer trace the pattern back through the chain of conditional probabilities.
You don’t need sophisticated tools for this. Any paraphrasing model, even a simple grammar checker that suggests alternatives, can inadvertently strip the watermark. The attack does not require intent. It happens as a side effect of normal text editing.
This is a design constraint, not a bug. The watermark must be invisible (no effect on text quality), must survive across long passages (statistical detection), and must not alter the meaning of the output. These requirements force the watermark to operate at the shallowest level possible: token probability nudging. And that shallow level cannot survive content transformation.
What Does This Mean for AI Content Trust?
Three attacks, three different approaches, one conclusion: invisible watermarks are speed bumps, not security boundaries.
That does not make them useless. Speed bumps slow down casual misuse. They provide a quick check for platforms that want to flag likely AI content. They give well-meaning users a way to verify origin. For the average screenshot shared on social media, SynthID works fine.
But against a motivated actor with freely available tools and a few minutes? The watermark is gone. Spectral analysis strips it surgically. A diffusion pass overwrites it blindly. Synonym swaps defeat the text version as a side effect of basic editing.
The Arms Race Problem
The developers behind these bypass projects frame their work as educational and safety research. They want to expose limitations so that more robust systems get built. But the fundamental asymmetry is structural: the watermark must be invisible and survive common modifications, while the attacker can use any tool that preserves visual quality.
Watermarking researchers can adjust carrier frequencies, add redundancy, vary the phase template per image. Each defense makes the watermark more complex, potentially more visible, and definitely more computationally expensive. The attacker just runs a diffusion pass.
C2PA: A Complementary Approach
The Coalition for Content Provenance and Authenticity (C2PA) takes a fundamentally different strategy. Instead of embedding information in pixel values, C2PA attaches cryptographic metadata to files. Think of it as a signed certificate that travels with the image, recording who created it, when, and with what tools.
C2PA does not try to be invisible. It is an explicit declaration of provenance, similar in spirit to how structured data with JSON-LD makes page metadata machine-readable rather than hidden. C2PA can be stripped by re-saving the image, but that is a detectable act. An image without C2PA metadata in a world where major platforms require it is suspicious by its absence.
The strongest stance combines both: SynthID (or similar) as a quick statistical check, C2PA as a cryptographic proof of chain of custody. Neither alone solves the problem. Together, they raise the cost of convincing forgery.
Where This Leaves Developers
If you are building systems that consume AI-generated content, do not treat watermark detection as a binary trust signal. Treat it as one input among several. Combine it with metadata checks, provenance chains, and contextual analysis. The watermark tells you “this probably came from an AI model.” It does not tell you “this definitely came from an AI model,” and it definitely cannot tell you “this did not come from an AI model.”
The deeper lesson from these attacks is that any security system operating purely in pixel space is competing against the same generative models it is trying to control. As 1-bit LLMs make local inference cheaper and diffusion models keep improving, the cost of regeneration drops while the cost of robust watermarking rises. That is the structural reality, and building systems that acknowledge it is better than building systems that pretend otherwise.
Frequently Asked Questions
Can SynthID detect if a watermark has been removed?
No. SynthID’s detection model checks for the presence of the watermark pattern. If the pattern is absent, the detector returns “not detected,” which is the same result as for any non-AI image. There is no separate signal indicating removal. The spectral attack achieves a 91.4% phase coherence drop, pushing the watermark below detection thresholds.
Does removing the watermark affect image quality?
The spectral analysis approach maintains 43+ dB PSNR, which is visually imperceptible. The diffusion re-nosing method can introduce subtle texture changes depending on the denoise strength (0.2-0.4), but at low settings the output is indistinguishable from the original to the human eye.
Are other AI watermarks vulnerable to the same attacks?
The diffusion re-nosing attack works against any watermark embedded in pixel-space noise, not just SynthID. The paper “Stable signature is unstable” (Hu et al., 2024) demonstrated this against Stable Diffusion’s watermarking. The spectral attack is specific to SynthID’s fixed phase template design, but similar frequency-analysis approaches could target other spectral watermarks.
What is C2PA and how does it differ from watermarking?
C2PA (Coalition for Content Provenance and Authenticity) attaches cryptographic metadata to files rather than embedding information in pixels. It records creation details in a signed manifest. Unlike watermarks, C2PA metadata is explicit and verifiable. It can be stripped by re-saving the file, but the absence of metadata is itself a signal in ecosystems that expect it.