VisionTransformer

This repository attempted to reproduce the ViT from the COYO-Labeled-300M dataset.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
COYO-700M: Image-Text Pair Dataset
COYO-Labeled-300M: Image-labeled Dataset

The model was pre-trained on the labeled COYO-Labeled-300M dataset, which is the largest number of published classification ViT.

We provide the code for pretraining and finetuning in Tensorflow2.

We will also work with HuggingFace to provide the weights file and make it usable in pytorch and jax through the HuggingFace platform as well.

Training

We have trained and evaluated using tpu-v3 with bfloat16.
The pretraining weight we provide is last_checkpoint trained with COYO-Labeled-300M.
The finetuing weights we provide are best_checkpoint trained with Imagenet.
We used the hyperparameter search below to explore the best_weight files in finetuing.
```
learning_rate:  [0.06, 0.03, 0.01]
steps:          [20_000, 40_000]
```
We provide a weight file trained in bfloat16 and we have confirmed that there is a performance change when evaluating with float32. (But imagenet-real was evaluated in float32)

The code in this repository can be reproduced on gpu as well as tpu.

# configs/trainer.yaml
---
runtime:
  strategy: 'tpu' # one of ['cpu', 'tpu', 'gpu', 'gpu_multinode', 'gpu_multinode_async']
  use_mixed_precision: true
  tpu:
    version: 2.8.0
    name: ???
    zone: 'europe-west4-a'
    type: 'v3-32'
---
change to
---
runtime:
  strategy: 'gpu'
  use_mixed_precision: true
---

To train, you need to set the path to your dataset in here.

# configs/dataset/coyo300m.yaml
train:
  cache: false
  supervised_key: 'labels'
  builder:
    - tfds_name: null
      tfds_data_dir: {your dir}
      tfds_split: 'train'

validation:
  cache: false
  supervised_key: 'labels'
  builder:
    - tfds_name: null
      tfds_data_dir: {your dir}
      tfds_split: 'validation[:50000]' # We performed validation as part of the Imagenet21k dataset. Or you can use subset of COYO-Labeled-300M

Results

Model	Upstream Dataset	Resolution	ImageNet (downstream)	ImageNet-ReaL (downstream)	Public
ViT-L/16	JFT-300M	512	87.76	90.54	X
ViT-L/16	COYO-Labeled-300M	512	87.24 (-0.52)	90.03 (-0.51)	O
ViT-L/16	JFT-300M	384	87.12	89.99	X
ViT-L/16	COYO-Labeled-300M	384	86.72 (-0.40)	89.84 (-0.15)	O

Checkpoints

Model	Upstream Dataset	Downstream Dataset	Resolution	link
ViT-L/16	COYO-Labeled-300M	-	224	link
ViT-L/16	COYO-Labeled-300M	ImageNet	384	link
ViT-L/16	COYO-Labeled-300M	ImageNet	512	link

Requirements

We have tested our codes on the environment below
python==3.7.3 / tensorflow==2.8.0 / tensorflow-datasets==4.5.0
Please run the following command to install the necessary dependencies
```
pip install -r requirements.txt
```

Commands

We have used hydra to manage the configuration. For detailed usage, see here.

Pretraining

python3 -m trainer trainer=vit_l16_coyo300m \
runtime.tpu.name={your_tpu_name} \
runtime.tpu.type={your_tpu_type} \
experiment.debug=false experiment.save_dir={your_save_dir}

Finetuning

python3 -m trainer trainer=vit_l16_i1k_downstream \
  runtime.tpu.name={your_tpu_name} \
  runtime.tpu.type={your_tpu_type} \
  experiment.debug=false \
  experiment.save_dir={your_save_dir} \
  trainer.backbone.pretrained={your_pretrained_weight}

Also, you can experiment by changing the configuration as follows.

python3 -m trainer trainer=vit_l16_i1k_downstream \
  runtime.tpu.name={your_tpu_name} \
  runtime.tpu.type={your_tpu_type} \
  experiment.debug=false experiment.save_dir={your_save_dir} \
  trainer.backbone.pretrained={your_pretrained_weight} \
  trainer.epochs=16 \
  trainer.learning_rate.base_lr=3e-2

Evaluation

python3 -m trainer trainer=vit_l16_i1k_downstream \
  runtime.tpu.name={your_tpu_name} \
  runtime.tpu.type={your_tpu_type} \
  experiment.debug=false \
  experiment.save_dir={your_weight_path} \
  experiment.mode='eval'

Citation

@misc{kakaobrain2022coyo-vit,
  title         = {COYO-ViT},
  author        = {Lee, Sungjun and Park, Beomhee},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-vit}},
}

@misc{kakaobrain2022coyo-700m,
  title         = {COYO-700M: Image-Text Pair Dataset},
  author        = {Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon},
  year          = {2022},
  howpublished  = {\url{https://github.com/kakaobrain/coyo-dataset}},
}

@misc{dosovitskiy2020image,
    title   = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author  = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
    year    = {2020},
    eprint  = {2010.11929},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

People

Sungjun Lee (@justhungryman)
Beomhee Park (@beomheepark)

Contact

This is released as an open source in the hope that it will be helpful to many research institutes and startups for research purposes.

jun.untitled@kakaobrain.com

License

The source codes are licensed under Apache 2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VisionTransformer

Training

Results

Checkpoints

Requirements

Commands

Pretraining

Finetuning

Evaluation

Citation

People

Contact

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

VisionTransformer

Training

Results

Checkpoints

Requirements

Commands

Pretraining

Finetuning

Evaluation

Citation

People

Contact

License