Discrete Diffusion Language Models for Interactive Radiology Report Drafting
Abstract
Diffusion language models match or exceed autoregressive models in medical visual question answering while offering faster decoding and bidirectional text editing capabilities.
Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive. We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.
Community
Discrete diffusion LMs can draft radiology reports interactively - and match autoregression while doing it.
We finetune an MoE diffusion VLM (DiffusionGemma-26B, 3.8B active) head-to-head against its autoregressive sibling (Gemma-4-26B) under an identical LoRA recipe - same backbone, vision tower, data, LoRA targets - so the generative paradigm is nearly the only variable. Then we lean on a property AR doesn't have off the shelf: any-order infill.
- Matches/edges out AR on medical VQA (VQA-RAD, SLAKE, VQA-Med), LLM-judge scored โ same recipe, no diffusion-specific tricks.
- 3.5โ4.4ร faster decoding at matched output budgets.
- Training-free interactive infill: clamp fixed report fragments at each denoising step and the model fills the gaps using both-sided context. The radiologist pins what they know; the model drafts the rest.
To our knowledge this is the first medical finetune of DiffusionGemma, currently done per-dataset on medical VQA (+ single-sentence infill on MIMIC-CXR), so accuracy here is VQA. A full medical finetune targeting report generation is in the works.
๐ https://arxiv.org/abs/2607.01436
๐ค weights: https://huggingface.co/gevaertlab/diffusiongemma-radiology-vqa
๐ป code: https://github.com/mxvp/discrete_diffusion_RRG
Feel free to reach out! ๐ค
Get this paper in your agent:
hf papers read 2607.01436 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper