arxiv:2607.01436

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Published on Jul 1

· Submitted by

Max Van Puyvelde on Jul 3

Gevaert Lab

Upvote

Authors:

Abstract

Diffusion language models match or exceed autoregressive models in medical visual question answering while offering faster decoding and bidirectional text editing capabilities.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Diffusion language models, which generate text by denoising a token canvas bidirectionally instead of emitting tokens left to right, have become competitive with autoregressive (AR) generation. Medical foundation models, however, remain almost entirely autoregressive. We adapt a mixture-of-experts diffusion language model, DiffusionGemma-26B, and benchmark it against its same-size AR sibling Gemma-4-26B under an identical LoRA recipe on medical visual question answering datasets, scored by a verbosity-robust LLM judge. Diffusion matches or exceeds AR on all of them, and the finetuned model (3.8B active) is competitive with frontier vision-language models; its decoding is also 3.5-4.4x faster. Beyond this parity, the diffusion model offers a drafting capability AR lacks: any-order infill. Because the canvas is denoised bidirectionally, a radiologist can fix report fragments and have the model fill the text between them, an operation inherent to diffusion but not to autoregression, which is subpar at it. This suits real reports, which are often terse or inconsistent across clinicians and institutions.

View arXiv page View PDF Add to collection

Community

mxvp

Paper submitter about 10 hours ago

Discrete diffusion LMs can draft radiology reports interactively - and match autoregression while doing it.

We finetune an MoE diffusion VLM (DiffusionGemma-26B, 3.8B active) head-to-head against its autoregressive sibling (Gemma-4-26B) under an identical LoRA recipe - same backbone, vision tower, data, LoRA targets - so the generative paradigm is nearly the only variable. Then we lean on a property AR doesn't have off the shelf: any-order infill.

Matches/edges out AR on medical VQA (VQA-RAD, SLAKE, VQA-Med), LLM-judge scored — same recipe, no diffusion-specific tricks.
3.5–4.4× faster decoding at matched output budgets.
Training-free interactive infill: clamp fixed report fragments at each denoising step and the model fills the gaps using both-sided context. The radiologist pins what they know; the model drafts the rest.

To our knowledge this is the first medical finetune of DiffusionGemma, currently done per-dataset on medical VQA (+ single-sentence infill on MIMIC-CXR), so accuracy here is VQA. A full medical finetune targeting report generation is in the works.

📄 https://arxiv.org/abs/2607.01436
🤗 weights: https://huggingface.co/gevaertlab/diffusiongemma-radiology-vqa
💻 code: https://github.com/mxvp/discrete_diffusion_RRG

Feel free to reach out! 🤗

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2607.01436

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.01436 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.01436 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.01436 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.