[Project Page
] [arXiv
] [HuggingFace
]
Ever thinking about what is the prerequisite for a visual model achieving the impact of GPT in language? The prequisite should be its ability of zero-shot generalization to various applications, prompts, etc. Our RandAR is one of the attempts towards this objective.
🎲 RandAR is a decoder-only AR model generating image tokens in arbitrary orders.
🚀 RandAR supports parallel-decoding without additional fine-tuning and brings 2.5
🛠️ RandAR unlocks new capabilities for causal GPT-style transformers: inpainting, outpainting, zero-shot resolution extrapolation, and bi-directional feature encoding.
-
[12/29/2024] 🎉 We release example checkpoints for RandAR. We are continuing to train more models and release the code supporting the diverse zero-shot tasks by 1/09/2025.
-
[12/09/2024] 🎉 The initial code is released, including the tokenization/modeling/training pipeline. I found that augmentation & tokenization different from the LLaMAGEN's designs are better for FID. From the current speed of training, I expect to release model checkpoints and verified training/eval scripts before 12/18/2024.
-
[12/02/2024] 📋 I am trying my best to re-implement the code and re-train the model as soon as I can. I plan to release the code before 12/09/2024 and the models afterwards. I am going to make my clusters running so fiecely that they will warm up the whole Illinois during this winter. 🔥🔥🔥
-
[12/02/2024] 🎉 The paper appears on Arxiv.
Checkout our documentation DOCUMENTATION.md for more details.
We have tried slightly modified (1) tokenizer: using either MaskGIT's or LLaMAGen's tokenizer, (2) learning rate schedule: using either cosine or linear schedule, for training RandAR with the hope that they could improve the performance. Their performance is slightly different from the paper's numbers, and we will release the checkpoints following the paper's numbers soon. All the checkpoints are available on HuggingFace.
We would like to highlight two observations:
- Using MaskGIT's tokenizer improves the FID, because of its smaller vocabulary.
- Using cosine or linear learning rate does not show significant performance difference at 0.7B model size.
Model | Param | Tokenizer | LR Schedule | Optimal CFG | FID | IS | Precision | Recall | Training Finished |
---|---|---|---|---|---|---|---|---|---|
RandAR-L (Paper) | 0.3B | LLaMAGen | Linear | - | 2.55 | 288 | 0.82 | 0.58 | N |
RandAR-L | 0.3B | LLaMAGen | Cosine | 3.4 | 2.65 | 249 | 0.82 | 0.56 | Y |
RandAR-L | 0.3B | MaskGIT | Cosine | 4.0 | 2.47 | 271 | 0.84 | 0.54 | Y |
RandAR-XL (Paper) | 0.7B | LLaMAGen | Linear | - | 2.25 | 318 | 0.80 | 0.60 | N |
RandAR-XL | 0.7B | LLaMAGen | Cosine | 4.0 | 2.27 | 275 | 0.81 | 0.59 | Y |
RandAR-XL | 0.7B | MaskGIT | Cosine | - | - | - | - | - | N |
If you find this work useful in your research, please consider citing:
@article{pang2024randar,
title={RandAR: Decoder-only Autoregressive Visual Generation in Random Orders},
author={Pang, Ziqi and Zhang, Tianyuan and Luan, Fujun and Man, Yunze and Tan, Hao and Zhang, Kai and Freeman, William T. and Wang, Yu-Xiong},
journal={arXiv preprint arXiv:2412.01827},
year={2024}
}
Thank you to the open-source community for their explorations on autoregressive generation, especially LLaMAGen.