This repository contains code for VideoGemma multimodal language model.
VideoGemma combines LanguageBind video encoder with performant and flexible Gemma LLM in a LLaVA-style architecture.
We recommend using Dev Containers to create the environment.
-
Install PyTorch.
-
Install Python dependencies.
pip3 install -r requirements.txt
pip3 install git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
-
For checkpoint loading and model configuration see
run_finetune.ipynb
.
Pretrained checkpoint for the model can be found here: HF 🤗.
- The model's projector has been pretrained for 1 epoch on the Valley dataset.
- LLM and the projector have been jointly fine-tuned using the Video-ChatGPT dataset.