Skip to content

Latest commit

 

History

History
21 lines (17 loc) · 1.94 KB

README.md

File metadata and controls

21 lines (17 loc) · 1.94 KB

NLVL_DETR: Neural Network Architecture

NLVL_DETR (Natural Language Video Localization Detection Transformer) is a neural network designed for the task of localizing moments in videos based on natural language queries. This architecture integrates both video and text processing modules, and uses a Transformer-based encoder-decoder mechanism to predict the span of the relevant video segment.

+--------------+    +--------------------+    +---------------------+    +---------------------+         +---------------------+    +-----------------+
|              |    |                    |    |      Kmeans or      |    |                     | context |                     |    |                 |
| Video Frames +--->| Vision Transformer +--->| Positional Encoding +--->| Transformer Encoder +-------->| Transformer Decoder +--->| Span Prediction |
|              |    |                    |    |                     |    |                     |         |                     |    |                 |
+--------------+    +--------------------+    +---------------------+    +---------------------+         +---------------------+    +-----------------+
                                                                                                                     ^
+------------+    +-------+                                                                                          |
|            |    |       |                        input sequence                                                    |
| Text Query +--->| Phi-2 +-------------------------------------------------------------------------------------------
|            |    |       |
+------------+    +-------+

To view training/eval loss metrics, run tensorboard --logdir results

Link to the Charades-STA dataset. "Data (scaled to 480p, 13GB)" was used for this project.