forked from facebookresearch/VMZ
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Major update codebase, adding CSN models, with IG65M pre-trained models
- Loading branch information
Showing
30 changed files
with
2,035 additions
and
547 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,47 @@ | ||
# R(2+1)D and Mixed-Convolutions for Action Recognition | ||
# VMZ: Model Zoo for Video Modeling | ||
|
||
 | ||
VMZ is a Caffe2 codebase for video modeling developed by the Computer Vision team at Facebook AI. The aim of this codebase is to help other researchers and industry practitioners: | ||
+ reproduce some of our research results and | ||
+ leverage our very strong pre-trained models. | ||
|
||
[[project page](https://dutran.github.io/R2Plus1D/)] [[paper](https://arxiv.org/abs/1711.11248)] | ||
Currently, this codebase supports the following models: | ||
+ R(2+1)D, MCx models [[1]](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf). | ||
+ CSN models [[2]](https://arxiv.org/pdf/1904.02811.pdf). | ||
+ R(2+1)D and CSN models pre-trained on large-scale (65 million!) weakly-supervised public Instagram videos (**IG-65M**) [[3]](https://research.fb.com/wp-content/uploads/2019/05/Large-scale-weakly-supervised-pre-training-for-video-action-recognition.pdf). | ||
|
||
If you find this work helpful for your research, please cite our following paper: | ||
## Main Models | ||
|
||
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri. **A Closer Look at Spatiotemporal Convolutions for Action Recognition.** IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. | ||
We provide our latest video models including R(2+1)D, ir-CSN, ip-CSN (all with 152 layers) which are pre-trained on Sports-1M or **IG-65M**, then fine-tuned on Kinetics-400. Both pre-trained and fine-tuned models are provided in the table below. We hope these models will serve as valuable baselines and feature extractors for the related video modeling tasks such as action detection, video captioning, and video Q&A. More models, e.g. shallower or with shorter clip input are also provided in the [model zoo](tutorials/model_zoo.md). | ||
|
||
``` | ||
@inproceedings{r2plus1d_cvpr18, | ||
title = {A Closer Look at Spatiotemporal Convolutions for Action Recognition}, | ||
author = {Du Tran and Heng Wang and Lorenzo Torresani and Jamie Ray and Yann LeCun and | ||
Manohar Paluri}, | ||
booktitle = {CVPR}, | ||
year = 2018 | ||
} | ||
``` | ||
If you have any question or feedback about the code, please contact: [email protected], [email protected]. | ||
### R(2+1)D-152 | ||
|
||
## Requirements | ||
R2Plus1D requires the following dependencies: | ||
* [OpenCV](https://opencv.org) (tested on 3.4.1) and [ffmpeg](https://trac.ffmpeg.org). | ||
* [Caffe2](https://caffe2.ai) and its dependencies. | ||
* You will need to build from source and install with `USE_OPENCV=1 USE_FFMPEG=1 USE_LMDB=1 python setup.py install` for OpenCV, ffmpeg, and lmdb support. | ||
* And lmdb, python-lmdb, and pandas. | ||
| Input size | Pretrained dataset | Pretrained model | Video@1 Kinetics | Video@5 Kinetics | Finetuned model | GFLOPs | params(M) | | ||
| ---------- | --------| ---- | ------- | ------- | -------- | ----- | ------ | | ||
| 32x112x112 | Sports1M | [link](https://www.dropbox.com/s/w5cdqeyqukuaqt7/r2plus1d_152_sports1m_from_scratch_f127111290.pkl?dl=0) | 79.5 | 94.0 | [link](https://www.dropbox.com/s/twvcpe30rxuaf45/r2plus1d_152_ft_kinetics_from_sports1m_f128957437.pkl?dl=0) | 329.1 | 118.0 | | ||
| 32x112x112 | IG-65M | [link](https://www.dropbox.com/s/oqdg176p7nqc84v/r2plus1d_152_ig65m_from_scratch_f106380637.pkl?dl=0) | 81.6 | 95.3 | [link](https://www.dropbox.com/s/tmxuae8ubo5gipy/r2plus1d_152_ft_kinetics_from_ig65m_f107107466.pkl?dl=0) | 329.1 | 118.0 | | ||
|
||
## Installation | ||
* You need to install ffmpeg, OpenCV, and caffe2. Caffe2 source build instructions can be found [here](https://caffe2.ai/docs/getting-started.html?configuration=compile) but make sure you install with `USE_OPENCV=1 USE_FFMPEG=1 USE_LMDB=1 python setup.py install`. You also need to install lmdb, python-lmdb, and pandas. | ||
|
||
### ir-CSN-152 | ||
| Input size | Pretrained dataset | Pretrained model | Video@1 Kinetics | Video@5 Kinetics | Finetuned model | GFLOPS | params(M) | | ||
| ---------- | ------| ------ | ------- | ------- | -------- | ----- | ------ | | ||
| 32x224x224 | Sports1M | [link](https://www.dropbox.com/s/woh99y2hll1mlqv/irCSN_152_Sports1M_from_scratch_f99918785.pkl?dl=0) | 78.2 | 93.0 | [link](https://www.dropbox.com/s/zuoj1aqouh6bo6k/irCSN_152_ft_kinetics_from_Sports1M_f101599884.pkl?dl=0) | 96.7 | 29.6 | | ||
| 32x224x224 | IG-65M | [link](https://www.dropbox.com/s/r0kppq7ox6c57no/irCSN_152_ig65m_from_scratch_f125286141.pkl?dl=0) | 82.6 | 95.3 | [link](https://www.dropbox.com/s/gmd8r87l3wmkn3h/irCSN_152_ft_kinetics_from_ig65m_f126851907.pkl?dl=0) | 96.7 | 29.6 | | ||
|
||
## Tutorials | ||
We provide some basic tutorials for you to get familar with the code and tools. | ||
* [Installation Guide](tutorials/Installation_guide.md) | ||
* [Training Kinetics from scratch](tutorials/kinetics_train.md) | ||
* [Finetuning R(2+1)D on HMDB51](tutorials/hmdb51_finetune.md) | ||
* [Dense prediction](tutorials/dense_prediction.md) | ||
* [Feature extraction](tutorials/feature_extraction.md) | ||
* [Download and evaluate pre-trained models](tutorials/models.md) | ||
### ip-CSN-152 | ||
| Input size | Pretrained dataset | Pretrained model | Video@1 Kinetics | Video@5 Kinetics | Finetuned model | GFLOPS | params(M) | | ||
| ---------- | ------ | ------ | ------- | ------- | -------- | ----- | ------ | | ||
| 32x224x224 | Sports1M | [link](https://www.dropbox.com/s/70di7o7qz6gjq6x/ipCSN_152_Sports1M_from_scratch_f111018543.pkl?dl=0) | 78.8 | 93.5 | [link](https://www.dropbox.com/s/ir7cr0hda36knux/ipCSN_152_ft_kinetics_from_Sports1M_f111279053.pkl?dl=0) | 108.8 | 32.8 | | ||
| 32x224x224 | IG-65M | TBD | TBD | TBD | TBD | 108.8 | 32.8 | | ||
|
||
|
||
## References | ||
1. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri. **A Closer Look at Spatiotemporal Convolutions for Action Recognition.** CVPR 2018. | ||
2. D. Tran, H. Wang, L. Torresani and M. Feiszli. **Video Classification with Channel-Separated Convolutional Networks.** ICCV 2019. | ||
3. D. Ghadiyaram, M. Feiszli, D. Tran, X. Yan, H. Wang and D. Mahajan, **Large-scale weakly-supervised pre-training for video action recognition.** CVPR 2019. | ||
|
||
|
||
## License | ||
R2Plus1D is Apache 2.0 licensed, as found in the LICENSE file. | ||
VMZ is Apache 2.0 licensed, as found in the LICENSE file. | ||
|
||
### Acknowledgements | ||
The authors would like to thank Ahmed Taei, Aarti Basant, Aapo Kyrola, and the Facebook Caffe2 team for their help in implementing ND-convolution, in optimizing video I/O, and in providing support for distributed training. We are grateful to Joao Carreira for sharing I3D results on the Kinetics validation set. | ||
## Suporting Team | ||
This codebase is actively supported by some members of CV team (Facebook AI): @CHJoanna, @weiyaowang, @bjuncek, @hengcv, @deeptigp, and @dutran. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.