diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 1204dd41b..1f98c2c0d 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -90,7 +90,7 @@ The approach in Model 1 was somewhat inefficient, as it contextualized all patch Factorised encoder (Model 2). Taken from the original paper. -First, only spatial interactions are contextualized through Spatial Transformer Encoder (=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). +First, only spatial interactions are contextualized through Spatial Transformer Encoder(=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). **complexity : O(n_h^2 x n_w^2 + n_t^2)**