Automatic video captioning, a final project for CS5422 Neural Networks and Deep Learning. This project uses neural network to produce a simple fixed 3-words caption (<noun> <verb> <noun>) for each sequence of video frames.
The neural network is composed of 4 main components:
- Feature extractor using pretrained EfficientNet
- Object classifier, a linear layer which sole purpose is to capture the presence of objects in each frame
- Encoder, a linear layer which helps capture the action happening in each frame
- Decoder, a RNN which produces the caption