"Principal Components" Enable A New Language of Images

A New Paradigm for Compact and Interpretable Image Representations

[Read the Paper] | [Project Page] | [Huggingface Tokenizer Demo] | [Huggingface Generation Demo]

Xin Wen^1*, Bingchen Zhao^2*, Ismail Elezi³, Jiankang Deng⁴, Xiaojuan Qi¹
^* Equal Contribution
¹ University of Hong Kong | ² University of Edinburgh | ³ Noah's Ark Lab | ⁴ Imperial College London

Introduction & Motivation

Deep generative models have revolutionized image synthesis, but how we tokenize visual data remains an open question. While classical methods like Principal Component Analysis (PCA) introduced compact, structured representations, modern visual tokenizers—from VQ-VAE to SD-VAE—often prioritize reconstruction fidelity at the cost of interpretability and efficiency.

The Problem

Lack of Structure: Tokens are arbitrarily learned, without an ordering that prioritizes important visual features first.
Semantic-Spectrum Coupling: Tokens entangle high-level semantics with low-level spectral details, leading to inefficiencies in downstream applications.

Can we design a compact, structured tokenizer that retains the benefits of PCA while leveraging modern generative techniques?

Key Contributions (What's New?)

📌 PCA-Guided Tokenization: Introduces a causal ordering where earlier tokens capture the most important visual details, reducing redundancy.
⚡ Semantic-Spectrum Decoupling: Resolves the issue of semantic-spectrum coupling to ensure tokens focus on high-level semantic information.
🌀 Diffusion-Based Decoding: Uses a diffusion decoder for the spectral auto-regressive property to naturally separate semantic and spectral content.
🚀 Compact & Interpretability-Friendly: Enables flexible token selection, where fewer tokens can still yield high-quality reconstructions.

For more details, please refer to our project page - semanticist.github.io.

Getting Started

Preparation

First please makesure pytorch is installed (we used 2.5.1 but we expect any version >= 2.0 to work).

Then install the rest of the dependencies.

pip install -r requirements.txt

Please then download ImageNet and soft-link it to ./dataset/imagenet. For evaluating FID, it is recommended to pre-process the validation set of ImageNet with this script. The target folder is ./dataset/imagenet/val256 in our case.

Training

Our codebase supports DDP training with accelerate, torchrun, and submitit (for slurm users). To train a Semanticist tokenizer with DiT-L tokenizer on 8 GPUs, you can run

accelerate launch --config_file=configs/onenode_config.yaml train_net.py --cfg configs/tokenizer_l.yaml

or

torchrun --nproc-per-node=8 train_net.py --cfg configs/tokenizer_l.yaml

or

python submitit_train.py --ngpus=8 --nodes=1 --partition=xxx --config configs/tokenizer_l.yaml

We used a global batch size of 2048 and thus the effective batch size per GPU is 256 in this case. Your may modify the batch size and gradient accumulation steps in the config file accrrding to your training resources.

To train a ϵLlamaGen autoregressive model with a tokenizer trained as above, you can run the following command. Remember to change the path to the tokenizer in the config file. The EMA model is custom_checkpoint_1.pkl under the output folder.

accelerate launch --config_file=configs/onenode_config.yaml train_net.py --cfg configs/autoregressive_l.yaml

Note that caching is enabled by default and it takes around 400GB memory (dumped to /dev/shm) for ten_crop augmentation on ImageNet. If you want to disable it, you can set enable_cache_latents to False in the config file and/or specify a different data augmentation method (e.g., centercrop_cached, centercrop, randcrop).

Evaluation

By default, when evaluating online we do not use the EMA model. Thus to obtain the final performance, you are suggested to perform a separate evaluation after training. Like above, our scripts are compatible with accelerate, torchrun, and submitit.

accelerate launch --config_file=configs/onenode_config.yaml eval_net.py --model ./output/tokenizer/models_l --step 250000 --cfg_value 3.0 --test_num_slots 256

or

torchrun --nproc-per-node=8 eval_net.py --model ./output/tokenizer/models_l --step 250000 --cfg_value 3.0 --test_num_slots 256

or

python submitit_eval.py --ngpus=8 --nodes=1 --partition=xxx --model ./output/tokenizer/models_l --step 250000 --cfg_value 3.0 --test_num_slots 256

And for the AR model:

torchrun --nproc-per-node=8 eval_net.py --model ./output/autoregressive/models_l --step 250000 --cfg_value 6.0 --ae_cfg 1.0 --test_num_slots 256

If enable_ema is set to True, the EMA model will be loaded automatically. You can adjust the number of GPUs flexibly. You can also specify multiple arguments in the command line to perform a grid search.

Demos

Please refer to our demo pages on Huggingface for the tokenizer and the AR model.

Note

It's possible that this code may not accurately replicate the results outlined in the paper due to potential human errors during the preparation and cleaning of the code for release. If you encounter any difficulties in reproducing our findings, please don't hesitate to inform us. Additionally, we'll make an effort to refine the README and code, and carry out sanity-check experiments in the near future.

Acknowledgements

Our codebase builds upon several existing publicly available codes. Specifically, we have modified or taken inspiration from the following repos: DiT, SiT, DiffAE, LlamaGen, RCG, MAR, REPA, etc. We thank the authors for their contributions to the community.

Citation

If you find this work useful in your research, please consider citing us!

@article{semanticist,
    title={``Principal Components'' Enable A New Language of Images},
    author={Wen, Xin and Zhao, Bingchen and Elezi, Ismail and Deng, Jiankang and Qi, Xiaojuan},
    journal={arXiv preprint arXiv:2503.08685},
    year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
configs		configs
examples		examples
fid_stats		fid_stats
pages		pages
semanticist		semanticist
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ar_gen.ipynb		ar_gen.ipynb
gen_demo.py		gen_demo.py
imagenet_classes.py		imagenet_classes.py
requirements.txt		requirements.txt
submitit_test.py		submitit_test.py
submitit_train.py		submitit_train.py
test.sh		test.sh
test_net.py		test_net.py
tok_demo.py		tok_demo.py
train.sh		train.sh
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"Principal Components" Enable A New Language of Images

A New Paradigm for Compact and Interpretable Image Representations

Introduction & Motivation

The Problem

Key Contributions (What's New?)

Getting Started

Preparation

Training

Evaluation

Demos

Note

Acknowledgements

Citation

About

Releases

Packages

Contributors 2

Languages

License

visual-gen/semanticist

Folders and files

Latest commit

History

Repository files navigation

"Principal Components" Enable A New Language of Images

A New Paradigm for Compact and Interpretable Image Representations

Introduction & Motivation

The Problem

Key Contributions (What's New?)

Getting Started

Preparation

Training

Evaluation

Demos

Note

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages