diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx
index 1f98c2c0d..90bc6d219 100644
--- a/chapters/en/unit7/video-processing/transformers-based-models.mdx
+++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx
@@ -123,7 +123,7 @@ In model 4, half of the attention heads are designed to operate with keys and va
 
 After comparing Models 1, 2, 3, and 4, it is evident that Model 1 achieved the best performance but required the longest training time. In contrast, Model 2 demonstrated relatively high performance with shorter training times compared to Models 3 and 4, making it the most efficient model overall.
 
- The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer (ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. 
+ The ViViT model fundamentally faces the issue of dataset sparsity. Like the Vision Transformer(ViT), ViViT requires an extremely large dataset to achieve good performance. However, such a scale of dataset is often unavailable for videos. Given that the learning task is more complex, the approach is to first pre-train on a large image dataset using ViT to initialize the model. 
 
 ## TimeSFormer[[timesformer]]