Code for the paper Video Occupancy Models
, includes three versions of quantizing the input video frames -- vae
which uses a VQ-VAE, dino
which uses quantized DINO, and musik
which uses quantized Multi-step Inverse Dynamics.
This is a PyTorch/GPU implementation of the paper Video Occupancy Models:
@Article{VideoOccupancyModels2024,
author = {Manan Tomar and Philippe Hansen-Estruch and Philip Bachman and Alex Lamb and John Langford and Matthew E. Taylor and Sergey Levine,
journal = {arXiv:2407.09533},
title = {Video Occupancy Models},
year = {2024},
}
The main packages are provided in the requirements.txt
file. This code has been tested on a virtual env with Python-3.8 with the package versions listed in the requirements file.
The following table provides the pre-trained model checkpoints and datasets used in the paper:
Cheetah | Walker | |
---|---|---|
VQ-VAE fine-tuned model checkpoint | download | download |
DINO latent datasets | link | |
VQ-VAE latent datasets | link | link |
You would need to download the contents of this folder and place them one directory above where this repo is present. This folder contains model descriptions for using a VQ-VAE model from the taming-transformers codebase.
Run train_vq_vae_voc.py to train a VOC model on stored VQ-VAE latents. If you want to train both the VQ-VAE and the VOC model on pixel data then run train_pixel_vq_vae_voc.py. In case you want to create your own latents by traning VQ-VAE on a custom dataset use the collect_latents()
and train_vq_latents()
methods in save_vq_codes.py.
We use a quantized verison of DINO from BEiT-v2. You would need to download this dino model file and place them one directory above where this repo is present.
Run train_vq_dino_voc.py to train a VOC model on stored DINO latents. Again, in case you want to create your own latents by running a quantized version of DINO on a custom dataset use the collect_latents()
method in save_dino_codes.py.
In the case, action data is also available, we use a quantized multi-step inverse kinematics (MUSIK) objective to train the representation.
Run train_vq_musik_voc.py to train a VOC model along with the MUSIK objective on pixel data.