Open Source 🎲 RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

[Project Page] [arXiv] [HuggingFace]

Overview

Ever thinking about what is the prerequisite for a visual model achieving the impact of GPT in language? The prequisite should be its ability of zero-shot generalization to various applications, prompts, etc. Our RandAR is one of the attempts towards this objective.

🎲 RandAR is a decoder-only AR model generating image tokens in arbitrary orders.

🚀 RandAR supports parallel-decoding without additional fine-tuning and brings 2.5 $\times$ acceleration for AR generation.

🛠️ RandAR unlocks new capabilities for causal GPT-style transformers: inpainting, outpainting, zero-shot resolution extrapolation, and bi-directional feature encoding.

News

[12/29/2024] 🎉 We release example checkpoints for RandAR. We are continuing to train more models and release the code supporting the diverse zero-shot tasks by 1/09/2025.
[12/09/2024] 🎉 The initial code is released, including the tokenization/modeling/training pipeline. I found that augmentation & tokenization different from the LLaMAGEN's designs are better for FID. From the current speed of training, I expect to release model checkpoints and verified training/eval scripts before 12/18/2024.
[12/02/2024] 📋 I am trying my best to re-implement the code and re-train the model as soon as I can. I plan to release the code before 12/09/2024 and the models afterwards. I am going to make my clusters running so fiecely that they will warm up the whole Illinois during this winter. 🔥🔥🔥
[12/02/2024] 🎉 The paper appears on Arxiv.

Getting Started

Checkout our documentation DOCUMENTATION.md for more details.

Pre-trained Models

We have tried slightly modified (1) tokenizer: using either MaskGIT's or LLaMAGen's tokenizer, (2) learning rate schedule: using either cosine or linear schedule, for training RandAR with the hope that they could improve the performance. Their performance is slightly different from the paper's numbers, and we will release the checkpoints following the paper's numbers soon. All the checkpoints are available on HuggingFace.

We would like to highlight two observations:

Using MaskGIT's tokenizer improves the FID, because of its smaller vocabulary.
Using cosine or linear learning rate does not show significant performance difference at 0.7B model size.

Model	Param	Tokenizer	LR Schedule	Optimal CFG	FID	IS	Precision	Recall	Training Finished
RandAR-L (Paper)	0.3B	LLaMAGen	Linear	-	2.55	288	0.82	0.58	N
RandAR-L	0.3B	LLaMAGen	Cosine	3.4	2.65	249	0.82	0.56	Y
RandAR-L	0.3B	MaskGIT	Cosine	4.0	2.47	271	0.84	0.54	Y
RandAR-XL (Paper)	0.7B	LLaMAGen	Linear	-	2.25	318	0.80	0.60	N
RandAR-XL	0.7B	LLaMAGen	Cosine	4.0	2.27	275	0.81	0.59	Y
RandAR-XL	0.7B	MaskGIT	Cosine	-	-	-	-	-	N

Citation

If you find this work useful in your research, please consider citing:

@article{pang2024randar,
    title={RandAR: Decoder-only Autoregressive Visual Generation in Random Orders},
    author={Pang, Ziqi and Zhang, Tianyuan and Luan, Fujun and Man, Yunze and Tan, Hao and Zhang, Kai and Freeman, William T. and Wang, Yu-Xiong},
    journal={arXiv preprint arXiv:2412.01827},
    year={2024}
}

Acknowledgement

Thank you to the open-source community for their explorations on autoregressive generation, especially LLaMAGen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Source 🎲 RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Overview

News

Getting Started

Pre-trained Models

Citation

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Source 🎲 RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Overview

News

Getting Started

Pre-trained Models

Citation

Acknowledgement