[Read the Paper] | [Project Page] | [Huggingface Tokenizer Demo] | [Huggingface Generation Demo]
Xin Wen1*,
Bingchen Zhao2*,
Ismail Elezi3,
Jiankang Deng4,
Xiaojuan Qi1
* Equal Contribution
1 University of Hong Kong |
2 University of Edinburgh |
3 Noah's Ark Lab |
4 Imperial College London
Deep generative models have revolutionized image synthesis, but how we tokenize visual data remains an open question. While classical methods like Principal Component Analysis (PCA) introduced compact, structured representations, modern visual tokenizers—from VQ-VAE to SD-VAE—often prioritize reconstruction fidelity at the cost of interpretability and efficiency.
- Lack of Structure: Tokens are arbitrarily learned, without an ordering that prioritizes important visual features first.
- Semantic-Spectrum Coupling: Tokens entangle high-level semantics with low-level spectral details, leading to inefficiencies in downstream applications.
Can we design a compact, structured tokenizer that retains the benefits of PCA while leveraging modern generative techniques?
- 📌 PCA-Guided Tokenization: Introduces a causal ordering where earlier tokens capture the most important visual details, reducing redundancy.
- ⚡ Semantic-Spectrum Decoupling: Resolves the issue of semantic-spectrum coupling to ensure tokens focus on high-level semantic information.
- 🌀 Diffusion-Based Decoding: Uses a diffusion decoder for the spectral auto-regressive property to naturally separate semantic and spectral content.
- 🚀 Compact & Interpretability-Friendly: Enables flexible token selection, where fewer tokens can still yield high-quality reconstructions.
For more details, please refer to our project page - semanticist.github.io.
First please makesure pytorch is installed (we used 2.5.1 but we expect any version >= 2.0 to work).
Then install the rest of the dependencies.
pip install -r requirements.txt
Please then download ImageNet and soft-link it to ./dataset/imagenet
. For evaluating FID, it is recommended to pre-process the validation set of ImageNet with this script. The target folder is ./dataset/imagenet/val256
in our case.
Our codebase supports DDP training with accelerate, torchrun, and submitit (for slurm users). To train a Semanticist tokenizer with DiT-L tokenizer on 8 GPUs, you can run
accelerate launch --config_file=configs/onenode_config.yaml train_net.py --cfg configs/tokenizer_l.yaml
or
torchrun --nproc-per-node=8 train_net.py --cfg configs/tokenizer_l.yaml
or
python submitit_train.py --ngpus=8 --nodes=1 --partition=xxx --config configs/tokenizer_l.yaml
We used a global batch size of 2048 and thus the effective batch size per GPU is 256 in this case. Your may modify the batch size and gradient accumulation steps in the config file accrrding to your training resources.
To train a ϵLlamaGen autoregressive model with a tokenizer trained as above, you can run the following command. Remember to change the path to the tokenizer in the config file. The EMA model is custom_checkpoint_1.pkl
under the output folder.
accelerate launch --config_file=configs/onenode_config.yaml train_net.py --cfg configs/autoregressive_l.yaml
Note that caching is enabled by default and it takes around 400GB memory (dumped to /dev/shm
) for ten_crop augmentation on ImageNet. If you want to disable it, you can set enable_cache_latents
to False in the config file and/or specify a different data augmentation method (e.g., centercrop_cached, centercrop, randcrop).
By default, when evaluating online we do not use the EMA model. Thus to obtain the final performance, you are suggested to perform a separate evaluation after training. Like above, our scripts are compatible with accelerate, torchrun, and submitit.
accelerate launch --config_file=configs/onenode_config.yaml eval_net.py --model ./output/tokenizer/models_l --step 250000 --cfg_value 3.0 --test_num_slots 256
or
torchrun --nproc-per-node=8 eval_net.py --model ./output/tokenizer/models_l --step 250000 --cfg_value 3.0 --test_num_slots 256
or
python submitit_eval.py --ngpus=8 --nodes=1 --partition=xxx --model ./output/tokenizer/models_l --step 250000 --cfg_value 3.0 --test_num_slots 256
And for the AR model:
torchrun --nproc-per-node=8 eval_net.py --model ./output/autoregressive/models_l --step 250000 --cfg_value 6.0 --ae_cfg 1.0 --test_num_slots 256
If enable_ema
is set to True, the EMA model will be loaded automatically. You can adjust the number of GPUs flexibly. You can also specify multiple arguments in the command line to perform a grid search.
Please refer to our demo pages on Huggingface for the tokenizer and the AR model.
It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us. Additionally, we'll make an effort to refine the README and code, and carry out sanity-check experiments in the near future.
Our codebase builds upon several existing publicly available codes. Specifically, we have modified or taken inspiration from the following repos: DiT, SiT, DiffAE, LlamaGen, RCG, MAR, REPA, etc. We thank the authors for their contributions to the community.
If you find this work useful in your research, please consider citing us!
@article{semanticist,
title={``Principal Components'' Enable A New Language of Images},
author={Wen, Xin and Zhao, Bingchen and Elezi, Ismail and Deng, Jiankang and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2503.08685},
year={2025}
}