Major update codebase, adding CSN models, with IG65M pre-trained models

bjuncek · Aug 1, 2019 · a5235b3 · a5235b3
1 parent f93bea5
commit a5235b3
Show file tree

Hide file tree

Showing 30 changed files with 2,035 additions and 547 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,4 +1,4 @@
-# Contributing to R2Plus1D
+# Contributing to VMZ
 We want to make contributing to this project as easy and transparent as
 possible.
 
@@ -35,5 +35,5 @@ outlined on that page and do not file a public issue.
 * ...
 
 ## License
-By contributing to R2Plus1D, you agree that your contributions will be licensed
+By contributing to VMZ, you agree that your contributions will be licensed
 under the LICENSE file in the root directory of this source tree.
diff --git a/README.md b/README.md
@@ -1,47 +1,47 @@
-# R(2+1)D and Mixed-Convolutions for Action Recognition
+# VMZ: Model Zoo for Video Modeling
 
-![r2plus1d1](https://raw.githubusercontent.com/dutran/R2Plus1D/master/r2plus1d.png)
+VMZ is a Caffe2 codebase for video modeling developed by the Computer Vision team at Facebook AI. The aim of this codebase is to help other researchers and industry practitioners:
++ reproduce some of our research results and 
++ leverage our very strong pre-trained models. 
 
-[[project page](https://dutran.github.io/R2Plus1D/)] [[paper](https://arxiv.org/abs/1711.11248)]
+Currently, this codebase supports the following models:
++ R(2+1)D, MCx models [[1]](https://research.fb.com/wp-content/uploads/2018/04/a-closer-look-at-spatiotemporal-convolutions-for-action-recognition.pdf).
++ CSN models [[2]](https://arxiv.org/pdf/1904.02811.pdf).
++ R(2+1)D and CSN models pre-trained on large-scale (65 million!) weakly-supervised public Instagram videos (**IG-65M**) [[3]](https://research.fb.com/wp-content/uploads/2019/05/Large-scale-weakly-supervised-pre-training-for-video-action-recognition.pdf).
 
-If you find this work helpful for your research, please cite our following paper:
+## Main Models
 
-D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri. **A Closer Look at Spatiotemporal Convolutions for Action Recognition.** IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
+We provide our latest video models including R(2+1)D, ir-CSN, ip-CSN (all with 152 layers) which are pre-trained on Sports-1M or **IG-65M**, then fine-tuned on Kinetics-400. Both pre-trained and fine-tuned models are provided in the table below. We hope these models will serve as valuable baselines and feature extractors for the related video modeling tasks such as action detection, video captioning, and video Q&A. More models, e.g. shallower or with shorter clip input are also provided in the [model zoo](tutorials/model_zoo.md). 
 
-```
-@inproceedings{r2plus1d_cvpr18,
-    title = {A Closer Look at Spatiotemporal Convolutions for Action Recognition},
-    author = {Du Tran and Heng Wang and Lorenzo Torresani and Jamie Ray and Yann LeCun and
-               Manohar Paluri},
-    booktitle = {CVPR},
-    year = 2018
-}
-```
-If you have any question or feedback about the code, please contact: [email protected], [email protected]. 
+### R(2+1)D-152
 
-## Requirements
-R2Plus1D requires the following dependencies:
-* [OpenCV](https://opencv.org) (tested on 3.4.1) and [ffmpeg](https://trac.ffmpeg.org).
-* [Caffe2](https://caffe2.ai) and its dependencies.
-  * You will need to build from source and install with `USE_OPENCV=1 USE_FFMPEG=1 USE_LMDB=1 python setup.py install` for OpenCV, ffmpeg, and lmdb support.
-* And lmdb, python-lmdb, and pandas.
+| Input size | Pretrained dataset | Pretrained model  | Video@1 Kinetics | Video@5 Kinetics | Finetuned model | GFLOPs | params(M) |
+| ---------- | --------| ---- | ------- | ------- | -------- | ----- | ------ |
+| 32x112x112 | Sports1M | [link](https://www.dropbox.com/s/w5cdqeyqukuaqt7/r2plus1d_152_sports1m_from_scratch_f127111290.pkl?dl=0) | 79.5   | 94.0    | [link](https://www.dropbox.com/s/twvcpe30rxuaf45/r2plus1d_152_ft_kinetics_from_sports1m_f128957437.pkl?dl=0)      | 329.1 | 118.0 |
+| 32x112x112 | IG-65M | [link](https://www.dropbox.com/s/oqdg176p7nqc84v/r2plus1d_152_ig65m_from_scratch_f106380637.pkl?dl=0)      | 81.6    | 95.3    | [link](https://www.dropbox.com/s/tmxuae8ubo5gipy/r2plus1d_152_ft_kinetics_from_ig65m_f107107466.pkl?dl=0)      | 329.1 | 118.0 |
 
-## Installation
-* You need to install ffmpeg, OpenCV, and caffe2. Caffe2 source build instructions can be found [here](https://caffe2.ai/docs/getting-started.html?configuration=compile) but make sure you install with `USE_OPENCV=1 USE_FFMPEG=1 USE_LMDB=1 python setup.py install`. You also need to install lmdb, python-lmdb, and pandas.
 
+### ir-CSN-152
+| Input size | Pretrained dataset | Pretrained model  | Video@1 Kinetics | Video@5 Kinetics | Finetuned model | GFLOPS | params(M) |
+| ---------- | ------| ------ | ------- | ------- | -------- | ----- | ------ |
+| 32x224x224 | Sports1M | [link](https://www.dropbox.com/s/woh99y2hll1mlqv/irCSN_152_Sports1M_from_scratch_f99918785.pkl?dl=0) | 78.2    | 93.0    | [link](https://www.dropbox.com/s/zuoj1aqouh6bo6k/irCSN_152_ft_kinetics_from_Sports1M_f101599884.pkl?dl=0) | 96.7 | 29.6 |
+| 32x224x224 | IG-65M | [link](https://www.dropbox.com/s/r0kppq7ox6c57no/irCSN_152_ig65m_from_scratch_f125286141.pkl?dl=0)      | 82.6       | 95.3       | [link](https://www.dropbox.com/s/gmd8r87l3wmkn3h/irCSN_152_ft_kinetics_from_ig65m_f126851907.pkl?dl=0)      | 96.7 | 29.6 |
 
-## Tutorials
-We provide some basic tutorials for you to get familar with the code and tools.
-* [Installation Guide](tutorials/Installation_guide.md)
-* [Training Kinetics from scratch](tutorials/kinetics_train.md)
-* [Finetuning R(2+1)D on HMDB51](tutorials/hmdb51_finetune.md)
-* [Dense prediction](tutorials/dense_prediction.md)
-* [Feature extraction](tutorials/feature_extraction.md)
-* [Download and evaluate pre-trained models](tutorials/models.md)
+### ip-CSN-152
+| Input size | Pretrained dataset | Pretrained model  | Video@1 Kinetics | Video@5 Kinetics | Finetuned model | GFLOPS | params(M) |
+| ---------- | ------ | ------ | ------- | ------- | -------- | ----- | ------ |
+| 32x224x224 | Sports1M | [link](https://www.dropbox.com/s/70di7o7qz6gjq6x/ipCSN_152_Sports1M_from_scratch_f111018543.pkl?dl=0) | 78.8    | 93.5    | [link](https://www.dropbox.com/s/ir7cr0hda36knux/ipCSN_152_ft_kinetics_from_Sports1M_f111279053.pkl?dl=0)      | 108.8 | 32.8 |
+| 32x224x224 | IG-65M | TBD | TBD    | TBD    | TBD   | 108.8 | 32.8 |
+
+
+## References
+1. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri. **A Closer Look at Spatiotemporal Convolutions for Action Recognition.** CVPR 2018.
+2. D. Tran, H. Wang, L. Torresani and M. Feiszli. **Video Classification with Channel-Separated Convolutional Networks.** ICCV 2019.
+3. D. Ghadiyaram, M. Feiszli, D. Tran, X. Yan, H. Wang and D. Mahajan, **Large-scale weakly-supervised pre-training for video action recognition.** CVPR 2019.
 
 
 ## License
-R2Plus1D is Apache 2.0 licensed, as found in the LICENSE file.
+VMZ is Apache 2.0 licensed, as found in the LICENSE file.
 
-### Acknowledgements
-The authors would like to thank Ahmed Taei, Aarti Basant, Aapo Kyrola, and the Facebook Caffe2 team for their help in implementing ND-convolution, in optimizing video I/O, and in providing support for distributed training. We are grateful to Joao Carreira for sharing I3D results on the Kinetics validation set.
+## Suporting Team
+This codebase is actively supported by some members of CV team (Facebook AI): @CHJoanna, @weiyaowang, @bjuncek, @hengcv, @deeptigp, and @dutran.
diff --git a/data/create_video_db.py b/data/create_video_db.py
@@ -11,6 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from __future__ import absolute_import, division, print_function, unicode_literals
+
 import lmdb
 import pandas
 import sys
@@ -37,7 +39,6 @@ def create_video_db(
     # read csv list file
     list = pandas.read_csv(list_file)
 
-
     # checking necessary fields of the provided csv file
     assert 'org_video' in list, \
         "The input list does not have org_video column"
@@ -121,17 +122,17 @@ def main():
     parser.add_argument("--output_file", type=str, default=None,
                         help="Path to output lmdb data")
     parser.add_argument("--use_list", type=int, default=0,
-                        help="0: write video encoded data to lmdb, " +
-                        "1: write only full path to local video files")
+                        help="0: write video encoded data to lmdb, "
+                        + "1: write only full path to local video files")
     parser.add_argument("--use_video_id", type=int, default=0,
-                        help="0: does not use video_id, " +
-                        "1: write also video_id to lmdb")
+                        help="0: does not use video_id, "
+                        + "1: write also video_id to lmdb")
     parser.add_argument("--use_start_frame", type=int, default=0,
-                        help="0: does not use start_frame, " +
-                        "1: write also start_frame to lmdb")
+                        help="0: does not use start_frame, "
+                        + "1: write also start_frame to lmdb")
     parser.add_argument("--num_epochs", type=int, default=1,
-                        help="Due to lmdb does not allow online shuffle" +
-                        "we can write multiple shuffled list")
+                        help="Due to lmdb does not allow online shuffle"
+                        + "we can write multiple shuffled list")
     args = parser.parse_args()
     create_video_db(
         args.list_file,

diff --git a/lib/models/builder/__init__.py b/lib/models/builder/__init__.py