Skip to content

Latest commit

 

History

History
59 lines (44 loc) · 4.34 KB

README.md

File metadata and controls

59 lines (44 loc) · 4.34 KB

Video Occupancy Models

Code for the paper Video Occupancy Models, includes three versions of quantizing the input video frames -- vae which uses a VQ-VAE, dino which uses quantized DINO, and musik which uses quantized Multi-step Inverse Dynamics.

Screenshot 2024-07-16 at 12 05 30 PM

This is a PyTorch/GPU implementation of the paper Video Occupancy Models:

@Article{VideoOccupancyModels2024,
  author  = {Manan Tomar and Philippe Hansen-Estruch and Philip Bachman and Alex Lamb and John Langford and Matthew E. Taylor and Sergey Levine,
  journal = {arXiv:2407.09533},
  title   = {Video Occupancy Models},
  year    = {2024},
}

Installation

The main packages are provided in the requirements.txt file. This code has been tested on a virtual env with Python-3.8 with the package versions listed in the requirements file.

Model Checkpoints and Datasets

The following table provides the pre-trained model checkpoints and datasets used in the paper:

Cheetah Walker
VQ-VAE fine-tuned model checkpoint download download
DINO latent datasets link
VQ-VAE latent datasets link link

VQ-VAE VOC

You would need to download the contents of this folder and place them one directory above where this repo is present. This folder contains model descriptions for using a VQ-VAE model from the taming-transformers codebase.

Run train_vq_vae_voc.py to train a VOC model on stored VQ-VAE latents. If you want to train both the VQ-VAE and the VOC model on pixel data then run train_pixel_vq_vae_voc.py. In case you want to create your own latents by traning VQ-VAE on a custom dataset use the collect_latents() and train_vq_latents() methods in save_vq_codes.py.

DINO VOC

We use a quantized verison of DINO from BEiT-v2. You would need to download this dino model file and place them one directory above where this repo is present.

Run train_vq_dino_voc.py to train a VOC model on stored DINO latents. Again, in case you want to create your own latents by running a quantized version of DINO on a custom dataset use the collect_latents() method in save_dino_codes.py.

MUSIK VOC

In the case, action data is also available, we use a quantized multi-step inverse kinematics (MUSIK) objective to train the representation.

Run train_vq_musik_voc.py to train a VOC model along with the MUSIK objective on pixel data.