diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 1f98c2c0d..90bc6d219 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -123,7 +123,7 @@ In model 4, half of the attention heads are designed to operate with keys and va After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the best performance but required the longest training time. In contrast, Model 2 demonstrated relatively high performance with shorter training times compared to Models 3 and 4, making it the most efficient model overall. - The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer (ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. + The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer(ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. ## TimeSFormer[[timesformer]]