Skip to content

Commit

Permalink
Update chapters/en/unit7/video-processing/transformers-based-models.mdx
Browse files Browse the repository at this point in the history
Co-authored-by: Woojun Jung <[email protected]>
  • Loading branch information
mreraser and jungnerd authored Oct 8, 2024
1 parent 48f7543 commit 60ca8ed
Showing 1 changed file with 1 addition and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ In model 4, half of the attention heads are designed to operate with keys and va

After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the best performance but required the longest training time. In contrast, Model 2 demonstrated relatively high performance with shorter training times compared to Models 3 and 4, making it the most efficient model overall.

The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer (ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model.
The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer(ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model.

## TimeSFormer[[timesformer]]

Expand Down

0 comments on commit 60ca8ed

Please sign in to comment.