Video Occupancy Models

Code for the paper Video Occupancy Models, includes three versions of quantizing the input video frames -- vae which uses a VQ-VAE, dino which uses quantized DINO, and musik which uses quantized Multi-step Inverse Dynamics.

This is a PyTorch/GPU implementation of the paper Video Occupancy Models:

@Article{VideoOccupancyModels2024,
  author  = {Manan Tomar and Philippe Hansen-Estruch and Philip Bachman and Alex Lamb and John Langford and Matthew E. Taylor and Sergey Levine,
  journal = {arXiv:2407.09533},
  title   = {Video Occupancy Models},
  year    = {2024},
}

Installation

The main packages are provided in the requirements.txt file. This code has been tested on a virtual env with Python-3.8 with the package versions listed in the requirements file.

Model Checkpoints and Datasets

The following table provides the pre-trained model checkpoints and datasets used in the paper:

	Cheetah	Walker
VQ-VAE fine-tuned model checkpoint	download	download
DINO latent datasets	link
VQ-VAE latent datasets	link	link

VQ-VAE VOC

You would need to download the contents of this folder and place them one directory above where this repo is present. This folder contains model descriptions for using a VQ-VAE model from the taming-transformers codebase.

Run train_vq_vae_voc.py to train a VOC model on stored VQ-VAE latents. If you want to train both the VQ-VAE and the VOC model on pixel data then run train_pixel_vq_vae_voc.py. In case you want to create your own latents by traning VQ-VAE on a custom dataset use the collect_latents() and train_vq_latents() methods in save_vq_codes.py.

DINO VOC

We use a quantized verison of DINO from BEiT-v2. You would need to download this dino model file and place them one directory above where this repo is present.

Run train_vq_dino_voc.py to train a VOC model on stored DINO latents. Again, in case you want to create your own latents by running a quantized version of DINO on a custom dataset use the collect_latents() method in save_dino_codes.py.

MUSIK VOC

In the case, action data is also available, we use a quantized multi-step inverse kinematics (MUSIK) objective to train the representation.

Run train_vq_musik_voc.py to train a VOC model along with the MUSIK objective on pixel data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Video Occupancy Models

Installation

Model Checkpoints and Datasets

VQ-VAE VOC

DINO VOC

MUSIK VOC

Files

README.md

Latest commit

History

README.md

File metadata and controls

Video Occupancy Models

Installation

Model Checkpoints and Datasets

VQ-VAE VOC

DINO VOC

MUSIK VOC