Instructions to use ModalityDance/Omni-R1-Zero with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModalityDance/Omni-R1-Zero with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ModalityDance/Omni-R1-Zero") model = AutoModelForImageTextToText.from_pretrained("ModalityDance/Omni-R1-Zero") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - multimodal | |
| - reasoning | |
| - sft | |
| - rl | |
| datasets: | |
| - LightChen2333/M3CoT | |
| - ModalityDance/Omni-Bench | |
| base_model: | |
| - GAIR/Anole-7b-v0.1 | |
| pipeline_tag: any-to-any | |
| # Omni-R1-Zero | |
| [](https://arxiv.org/abs/2601.09536) | |
| [](https://github.com/ModalityDance/Omni-R1) | |
| [](https://huggingface.co/datasets/ModalityDance/Omni-Bench) | |
| ## Overview | |
| **Omni-R1-Zero** is trained **without multimodal annotations**. It bootstraps **step-wise visualizations** from **text-only CoT seeds** (e.g., M3CoT), and then follows the same PeSFT+PeRPO recipe as Omni-R1 to learn interleaved multimodal reasoning. | |
| ## Usage | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from transformers import ChameleonProcessor, ChameleonForConditionalGeneration | |
| # 1) Import & load | |
| model_id = "ModalityDance/Omni-R1-Zero" # or a local checkpoint path | |
| processor = ChameleonProcessor.from_pretrained(model_id) | |
| model = ChameleonForConditionalGeneration.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| model.eval() | |
| # 2) Prepare a single input | |
| prompt = "You are a helpful assistant.\nUser: Which of these would appear shinier when polished? A. Metal spoon B. Wooden spoon\nThink with images first, the image reasoning process and answer are enclosed within <reserved12856> <reserved12857> and <reserved12866> <reserved12867> XML tags, respectively.\nAssistant:" | |
| inputs = processor( | |
| prompt, | |
| padding=False, | |
| return_for_text_completion=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| # 3) Call the model | |
| outputs = model.generate( | |
| **inputs, | |
| max_length=4096, | |
| do_sample=True, | |
| temperature=1.0, | |
| top_p=0.9, | |
| pad_token_id=1, | |
| multimodal_generation_mode="unrestricted", | |
| ) | |
| # 4) Get results | |
| text = processor.batch_decode(outputs, skip_special_tokens=False)[0] | |
| print(text) | |
| ``` | |
| For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository: | |
| https://github.com/ModalityDance/Omni-R1 | |
| ## License | |
| This project is licensed under the **MIT License**. | |
| It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**. | |
| ## Citation | |
| ```bibtex | |
| @misc{cheng2026omnir1unifiedgenerativeparadigm, | |
| title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, | |
| author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li}, | |
| year={2026}, | |
| eprint={2601.09536}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2601.09536}, | |
| } | |
| ``` | |