Update chapters/en/unit7/video-processing/transformers-based-models.mdx

Co-authored-by: Woojun Jung <[email protected]>
johko · Oct 8, 2024 · 60ca8ed · 60ca8ed
1 parent 48f7543
commit 60ca8ed
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx
@@ -123,7 +123,7 @@ In model 4, half of the attention heads are designed to operate with keys and va
 
 After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the best performance but required the longest training time. In contrast, Model 2 demonstrated relatively high performance with shorter training times compared to Models 3 and 4, making it the most efficient model overall.
 
- The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer (ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. 
+ The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer(ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. 
 
 ## TimeSFormer[[timesformer]]