Skip to content

Official repository of the paper "Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation"

Notifications You must be signed in to change notification settings

wendell0218/Janus-Pro-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation

Kaihang Pan1*, Yang Wu2*, Wendong Bu1*, Kai Shen1‡, Juncheng Li1†, Yingting Wang2,

Yunfei Li2, Siliang Tang1, Jun Xiao1, Fei Wu1, Hang Zhao2, Yueting Zhuang1

1Zhejiang University, 2Ant Group

*Equal Contribution, Project Leader, Corresponding Authors

alt text

🔥 News

📝 TODO

  • Release the paper
  • Release training scripts
  • Release inference scripts
  • Release training data
  • Release Janus-Pro-R1 checkpoint

🚀 Overview

We propose a two-stage training paradigm to enable introspective text-to-image generation via genuine reasoning chains (CoT), unlocking what we call Aha Moments in visual generation:

  • Stage 1 – Supervised Fine-Tuning (SFT):
    The model learns structured visual reasoning through three subtasks:

    • Text-to-image generation
    • Image-text consistency self-evaluation
    • Image regeneration through reflection
  • Stage 2 – Reinforcement Learning (RL):
    The model is trained using a token-level Markov decision process with bi-level QA-based rewards to encourage spontaneous reasoning and correction, optimizing via GRPO.

With self-reflective capabilities, this approach bridges the gap between text-to-image generation and image editing, enabling a unified and coherent visual reasoning process.

alt text

⚙️ Installation

1. Prepare Environment

We recommend using Python>=3.10 and setting up a virtual environment:

# clone our repo
git clone https://github.com/wendell0218/Janus-Pro-R1.git
cd Janus-Pro-R1

# prepare python environment for sft
conda create -n janus-pro-r1-sft python=3.11
conda activate janus-pro-r1-sft
pip install -r requirements-sft.txt

# prepare python environment for rl
conda create -n janus-pro-r1-rl python=3.11
conda activate janus-pro-r1-rl
pip install -r requirements-rl.txt

2. Prepare Pretrained Model

Janus-Pro-R1 utilizes Janus-Pro-7B as the pretrained model for subsequent supervised fine-tuning. You can download the corresponding model using the following command:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/Janus-Pro-7B
cd Janus-Pro-7B
git lfs pull

🧪 Supervised Fine-Tuning (Stage 1)

The SFT training data for introspective text-to-image generation is released in https://huggingface.co/datasets/midbee/Janus-Pro-R1-Data.

You can perform SFT for Text-to-Image Generation using the following command:

cd janus-sft
python launch.py --args_yml_fn configs/t2i_generation.yml

Additionally, you can using the following command to SFT Image Editing:

cd janus-sft
python launch.py --args_yml_fn configs/image_editing.yml

For a more detailed introduction of the Supervised Fine-Tuning stage, please refer to here.

🎯 Reinforcement Learning (Stage 2)

You can perform RL for Text-to-Image Generation using the following command:

cd janus-rl/src/open_r1

export ACCELERATE_CONFIG=../../recipes/accelerate_configs/zero2.yaml
export GRPO_CONFIG=../../recipes/t2i_generation/grpo.yml
export NUM_PROCESSES=8

accelerate launch \
  --config_file $ACCELERATE_CONFIG \
  --num_processes $NUM_PROCESSES \
  grpo_t2i.py \
  --config $GRPO_CONFIG

Additionally, you can use the following command for RL on Image Editing:

cd janus-rl/src/open_r1

export ACCELERATE_CONFIG=../../recipes/accelerate_configs/zero2.yaml
export GRPO_CONFIG=../../recipes/image_editing/grpo.yml
export NUM_PROCESSES=8

accelerate launch \
  --config_file $ACCELERATE_CONFIG \
  --num_processes $NUM_PROCESSES \
  grpo_editing.py \
  --config $GRPO_CONFIG

For a more detailed introduction of the Reinforcement Learning stage, please refer to here.

🎨 Inference

We illustrate the inference process of introspective text-to-image generation under the simplest scenario, where the model performs a one-time image self-evaluation and image regeneration after the initial text-to-image generation.

First please prepare the model Janus-Pro-R1-7B, which utilizes Janus-Pro-7B as the backbone model. You can download the corresponding model from 🤗https://huggingface.co/midbee/Janus-Pro-R1-7B.

You can conduct the inference process using the following command. model_path refers to the local path where you have downloaded Janus-Pro-R1-7B.

  python inference/inference.py \
      --model_path $CKPT_PATH \
      --caption "a brown giraffe and a white stop sign" \
      --gen_path "results/samples" \
      --reason_path "results/reason.jsonl" \
      --regen_path "results/regen_samples" \
      --cfg 5.0 \
      --parallel_size 4

After completing the inference, the structure of the results directory will be as follows:

results/
├── reason.jsonl
├── samples/
│   ├── 0000.png
│   ├── 0001.png
│   ├── 0002.png
│   └── 0003.png
└── regen_samples/
    ├── 0000.png
    ├── 0001.png
    ├── 0002.png
    └── 0003.png

For a more detailed introduction for inference, please refer to here.

📊 Main Results

  • Triggering Aha Moments

    alt text

  • Text-to-Image Generation

    alt text

  • Image Editing

    alt text

🤝 Acknowledgment

Our project is developed based on the following repositories:

  • Janus-Series: Unified Multimodal Understanding and Generation Models
  • Open-R1: Fully open reproduction of DeepSeek-R1

📜 Citation

If you find this work useful for your research, please cite our paper and star our git repo:

@article{pan2025unlocking,
  title={Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation},
  author={Pan, Kaihang and Wu, Yang and Bu, Wendong and Shen, Kai and Li, Juncheng and Wang, Yingting and Li, Yunfei and Tang, Siliang and Xiao, Jun and Wu, Fei and others},
  journal={arXiv preprint arXiv:2506.01480},
  year={2025}
}

@article{pan2025focusdiff,
  title={FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL},
  author={Pan, Kaihang and Bu, Wendong and Wu, Yuruo and Wu, Yang and Shen, Kai and Li, Yunfei and Zhao, Hang and Li, Juncheng and Tang, Siliang and Zhuang, Yueting},
  journal={arXiv preprint arXiv:2506.05501},
  year={2025}
}

About

Official repository of the paper "Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages