[Note] To run the captioning code, please make sure you follow this guideline and correctly prepare vicuna-7b-v0 weight. You need to first download the original weights and then apply delta weights. Improper weights preparation will lead to meaningless outputs.
We propose a video captioning model to generate a caption for a short video clip. The model includes vision (green) and textual (blue) branches to benefit video captioning by both video and text inputs. We release the checkpoint trained on Panda-70M.
git clone https://github.com/snap-research/Panda-70M.git
cd Panda-70M/captioning
# create a conda environment
conda create --name panda70m_captioning python=3.9 -y
conda activate panda70m_captioning
pip install -r requirements.txt
# install default JRE
apt update
apt install default-jre
You can manually download the file here (3.82GB) and move it to the checkpoint
folder or run:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5" -O checkpoint/checkpoint_best.pth && rm -rf /tmp/cookies.txt
- Please follow the intructions from FastChat to install vicuna-7b-v0 weight.
- [Note] You need to apply delta weights and after processed, the weights should be moved to
vicuna_weights/vicuna-7b-v0
folder with the file list like this.
python inference.py --video-list inputs/video_list.txt --prompt-list inputs/prompt_list.txt
The code will caption two test videos listed in the video_list.txt
with the extra inputs of textual information from the prompt_list.txt
. Here are some output examples:
**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.
- [Note] You might get different outputs due to the randomness of LLM's generation.
BLEU-4 | ROUGE-L | METEOR | CIDEr | BertScore | |
---|---|---|---|---|---|
MSRVTT | 25.4% | 50.1% | 27.7% | 31.5% | 87.9% |
MSVD | 32.8% | 61.2% | 35.3% | 49.2% | 90.2% |
- [Note] The results might not be perfectly reproduced due to the randomness of LLM's generation and could have an deviation of ±0.5%.
- You can download the video samples here [MSRVTT / MSVD] and move them to
test_datasets/video_samples/MSRVTT
orMSVD
folder. - The caption annotations of the testing samples are already saved in
test_datasets/anno_downstream
folder.
# MSRVTT
python inference.py --video-list test_datasets/video_list/msrvtt_test.txt --output-json msrvtt_caption.json
python compute_results.py --predict-json msrvtt_caption.json --target-json test_datasets/anno_downstream/msrvtt_caption_test.json
# MSVD
python inference.py --video-list test_datasets/video_list/msvd_test.txt --output-json msvd_caption.json
python compute_results.py --predict-json msvd_caption.json --target-json test_datasets/anno_downstream/msvd_caption_test.json
The code for video captioning is built upon Video-LLaMA. Thanks for sharing the great work!