Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
checkpoint		checkpoint
eval_configs		eval_configs
inputs		inputs
test_datasets		test_datasets
utils		utils
vicuna_weights/vicuna-7b-v0		vicuna_weights/vicuna-7b-v0
video_llama		video_llama
README.md		README.md
compute_results.py		compute_results.py
inference.py		inference.py
requirements.txt		requirements.txt

README.md

🐼 Panda-70M: Video Captioning

[Note] To run the captioning code, please make sure you follow this guideline and correctly prepare vicuna-7b-v0 weight. You need to first download the original weights and then apply delta weights. Improper weights preparation will lead to meaningless outputs.

Introduction

We propose a video captioning model to generate a caption for a short video clip. The model includes vision (green) and textual (blue) branches to benefit video captioning by both video and text inputs. We release the checkpoint trained on Panda-70M.

Preparations

Setup Repository and Enviroment

git clone https://github.com/snap-research/Panda-70M.git
cd Panda-70M/captioning

# create a conda environment
conda create --name panda70m_captioning python=3.9 -y
conda activate panda70m_captioning
pip install -r requirements.txt

# install default JRE
apt update
apt install default-jre

Download Checkpoint

You can manually download the file here (3.82GB) and move it to the checkpoint folder or run:

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1Gjp5LrgGJobcFi3AaXvLnzlY7IWXyaI5" -O checkpoint/checkpoint_best.pth && rm -rf /tmp/cookies.txt

Prepare Vicuna:

Please follow the intructions from FastChat to install vicuna-7b-v0 weight.
[Note] You need to apply delta weights and after processed, the weights should be moved to vicuna_weights/vicuna-7b-v0 folder with the file list like this.

Quick Demo

python inference.py --video-list inputs/video_list.txt --prompt-list inputs/prompt_list.txt

The code will caption two test videos listed in the video_list.txt with the extra inputs of textual information from the prompt_list.txt. Here are some output examples:

Input Video	Input Text	Output Caption
	^{Some information about a video you will get: Transcription: Today we're gonna take a quick look at the 1966 Ford Mustang GT 289 v8 under the hood. Metadata: ['Old VS New - 1966 Ford Mustang GT & 2018 Ford Mustang \| Just a Quick Look', 'Lets check out this beautiful 1966 Ford Mustang GT 289 in the showroom with the 2018 Ford Mustang!'] Please look at the video and faithfully summarize it in one sentence.}	A red mustang parked in a showroom with american flags hanging from the ceiling.
	Please faithfully summarize the following video in one sentence.	An aerial view of a city with a river running through it.

^{**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.}

[Note] You might get different outputs due to the randomness of LLM's generation.

Evaluation

Zero-shot Captioning Performance

	BLEU-4	ROUGE-L	METEOR	CIDEr	BertScore
MSRVTT	25.4%	50.1%	27.7%	31.5%	87.9%
MSVD	32.8%	61.2%	35.3%	49.2%	90.2%

[Note] The results might not be perfectly reproduced due to the randomness of LLM's generation and could have an deviation of ±0.5%.

Prepare Testing Data

You can download the video samples here [MSRVTT / MSVD] and move them to test_datasets/video_samples/MSRVTT or MSVD folder.
The caption annotations of the testing samples are already saved in test_datasets/anno_downstream folder.

Evaluation

# MSRVTT
python inference.py --video-list test_datasets/video_list/msrvtt_test.txt --output-json msrvtt_caption.json
python compute_results.py --predict-json msrvtt_caption.json --target-json test_datasets/anno_downstream/msrvtt_caption_test.json

# MSVD
python inference.py --video-list test_datasets/video_list/msvd_test.txt --output-json msvd_caption.json
python compute_results.py --predict-json msvd_caption.json --target-json test_datasets/anno_downstream/msvd_caption_test.json

Acknowledgements

The code for video captioning is built upon Video-LLaMA. Thanks for sharing the great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

captioning

captioning

README.md

🐼 Panda-70M: Video Captioning

Introduction

Preparations

Setup Repository and Enviroment

Download Checkpoint

Prepare Vicuna:

Quick Demo

Evaluation

Zero-shot Captioning Performance

Prepare Testing Data

Evaluation

Acknowledgements

Files

captioning

Directory actions

More options

Directory actions

More options

Latest commit

History

captioning

Folders and files

parent directory

README.md

🐼 Panda-70M: Video Captioning

Introduction

Preparations

Setup Repository and Enviroment

Download Checkpoint

Prepare Vicuna:

Quick Demo

Evaluation

Zero-shot Captioning Performance

Prepare Testing Data

Evaluation

Acknowledgements