Results:
- CER 0.066 → 0.036 (DiffusionGemma) vs 0.042 (Gemma-4)
- fixed 85% of OCR errors vs 62%
- median 1.7s vs 14.7s per passage
- ~10 denoising steps vs ~200 sequential tokens
In the transformers source: generate() accepts an undocumented starting canvas (decoder_input_ids). So I seeded it with the OCR text itself, i.e. "here's a nearly-right draft, denoise it."
One idea went the wrong way: I expected the diffusion model to over-correct less ("it just denoises"). But it did the opposite; it edits more aggressively in both directions. Greedy AR decoding is the conservative editor (0.4% vs 1.4% of already-correct text touched).
A day-one experimental model beating its AR cousin on quality and speed on this task. I guess fine-tuning could make this a lot better?
Can the new DiffusionGemma model help fix broken OCR?
In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront?
Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.
Can the new DiffusionGemma model help fix broken OCR?
In theory, denoising tokens in parallel could work better for OCR correction since context is seen upfront?
Pointed it at 19th-century newspaper OCR. It corrected better than the autoregressive baseline — at ~8x the speed.
Got a digitised collection that needs OCR? uv-scripts is a set of single-file Python scripts that OCR a whole image dataset to markdown in one command — 20+ open VLMs to pick from, nothing to install but uv.
github.com/davanstrien/...
The setup: 75 passages of BLN600 (19th-century British Library newspapers with human ground-truth transcriptions), DiffusionGemma-26B-A4B vs Gemma-4-E4B, identical prompt, bf16, one A100.
For me, this was a negative result: 2–5 steps, but it barely edits with 61/75 outputs identical to the noisy input. Real text is off-distribution as noise, so the sampler just accepts it maybe? (You can try this in the demo)