Yunfei Li2, Siliang Tang1, Jun Xiao1, Fei Wu1, Hang Zhao2, Yueting Zhuang1
1Zhejiang University, 2Ant Group
*Equal Contribution, ‡Project Leader, †Corresponding Authors
- [July 23, 2025] We have released the checkpoint and training data of Janus-Pro-R1.
- [June 18, 2025] We have released the training and inference scripts of Janus-Pro-R1.
- [June 2, 2025] Our paper is now available on arXiv: Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation.
- Release the paper
- Release training scripts
- Release inference scripts
- Release training data
- Release Janus-Pro-R1 checkpoint
We propose a two-stage training paradigm to enable introspective text-to-image generation via genuine reasoning chains (CoT), unlocking what we call Aha Moments in visual generation:
-
Stage 1 – Supervised Fine-Tuning (SFT):
The model learns structured visual reasoning through three subtasks:- Text-to-image generation
- Image-text consistency self-evaluation
- Image regeneration through reflection
-
Stage 2 – Reinforcement Learning (RL):
The model is trained using a token-level Markov decision process with bi-level QA-based rewards to encourage spontaneous reasoning and correction, optimizing via GRPO.
With self-reflective capabilities, this approach bridges the gap between text-to-image generation and image editing, enabling a unified and coherent visual reasoning process.
1. Prepare Environment
We recommend using Python>=3.10 and setting up a virtual environment:
# clone our repo
git clone https://github.com/wendell0218/Janus-Pro-R1.git
cd Janus-Pro-R1
# prepare python environment for sft
conda create -n janus-pro-r1-sft python=3.11
conda activate janus-pro-r1-sft
pip install -r requirements-sft.txt
# prepare python environment for rl
conda create -n janus-pro-r1-rl python=3.11
conda activate janus-pro-r1-rl
pip install -r requirements-rl.txt
2. Prepare Pretrained Model
Janus-Pro-R1 utilizes Janus-Pro-7B
as the pretrained model for subsequent supervised fine-tuning. You can download the corresponding model using the following command:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/Janus-Pro-7B
cd Janus-Pro-7B
git lfs pull
The SFT training data for introspective text-to-image generation is released in https://huggingface.co/datasets/midbee/Janus-Pro-R1-Data.
You can perform SFT for Text-to-Image Generation using the following command:
cd janus-sft
python launch.py --args_yml_fn configs/t2i_generation.yml
Additionally, you can using the following command to SFT Image Editing:
cd janus-sft
python launch.py --args_yml_fn configs/image_editing.yml
For a more detailed introduction of the Supervised Fine-Tuning stage, please refer to here.
You can perform RL for Text-to-Image Generation using the following command:
cd janus-rl/src/open_r1
export ACCELERATE_CONFIG=../../recipes/accelerate_configs/zero2.yaml
export GRPO_CONFIG=../../recipes/t2i_generation/grpo.yml
export NUM_PROCESSES=8
accelerate launch \
--config_file $ACCELERATE_CONFIG \
--num_processes $NUM_PROCESSES \
grpo_t2i.py \
--config $GRPO_CONFIG
Additionally, you can use the following command for RL on Image Editing:
cd janus-rl/src/open_r1
export ACCELERATE_CONFIG=../../recipes/accelerate_configs/zero2.yaml
export GRPO_CONFIG=../../recipes/image_editing/grpo.yml
export NUM_PROCESSES=8
accelerate launch \
--config_file $ACCELERATE_CONFIG \
--num_processes $NUM_PROCESSES \
grpo_editing.py \
--config $GRPO_CONFIG
For a more detailed introduction of the Reinforcement Learning stage, please refer to here.
We illustrate the inference process of introspective text-to-image generation under the simplest scenario, where the model performs a one-time image self-evaluation and image regeneration after the initial text-to-image generation.
First please prepare the model Janus-Pro-R1-7B, which utilizes Janus-Pro-7B as the backbone model. You can download the corresponding model from 🤗https://huggingface.co/midbee/Janus-Pro-R1-7B.
You can conduct the inference process using the following command. model_path
refers to the local path where you have downloaded Janus-Pro-R1-7B.
python inference/inference.py \
--model_path $CKPT_PATH \
--caption "a brown giraffe and a white stop sign" \
--gen_path "results/samples" \
--reason_path "results/reason.jsonl" \
--regen_path "results/regen_samples" \
--cfg 5.0 \
--parallel_size 4
After completing the inference, the structure of the results
directory will be as follows:
results/
├── reason.jsonl
├── samples/
│ ├── 0000.png
│ ├── 0001.png
│ ├── 0002.png
│ └── 0003.png
└── regen_samples/
├── 0000.png
├── 0001.png
├── 0002.png
└── 0003.png
For a more detailed introduction for inference, please refer to here.
Our project is developed based on the following repositories:
- Janus-Series: Unified Multimodal Understanding and Generation Models
- Open-R1: Fully open reproduction of DeepSeek-R1
If you find this work useful for your research, please cite our paper and star our git repo:
@article{pan2025unlocking,
title={Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation},
author={Pan, Kaihang and Wu, Yang and Bu, Wendong and Shen, Kai and Li, Juncheng and Wang, Yingting and Li, Yunfei and Tang, Siliang and Xiao, Jun and Wu, Fei and others},
journal={arXiv preprint arXiv:2506.01480},
year={2025}
}
@article{pan2025focusdiff,
title={FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL},
author={Pan, Kaihang and Bu, Wendong and Wu, Yuruo and Wu, Yang and Shen, Kai and Li, Yunfei and Zhao, Hang and Li, Juncheng and Tang, Siliang and Zhuang, Yueting},
journal={arXiv preprint arXiv:2506.05501},
year={2025}
}