From 48f754302bdd0f93942a1f4f9ae790a68471ad8d Mon Sep 17 00:00:00 2001 From: Jiwook Han <33192762+mreraser@users.noreply.github.com> Date: Tue, 8 Oct 2024 15:54:29 +0900 Subject: [PATCH] Update chapters/en/unit7/video-processing/transformers-based-models.mdx Co-authored-by: Woojun Jung <46880056+jungnerd@users.noreply.github.com> --- .../en/unit7/video-processing/transformers-based-models.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/chapters/en/unit7/video-processing/transformers-based-models.mdx b/chapters/en/unit7/video-processing/transformers-based-models.mdx index 1204dd41b..1f98c2c0d 100644 --- a/chapters/en/unit7/video-processing/transformers-based-models.mdx +++ b/chapters/en/unit7/video-processing/transformers-based-models.mdx @@ -90,7 +90,7 @@ The approach in Model 1 was somewhat inefficient, as it contextualized all patch Factorised encoder (Model 2). Taken from the original paper. -First, only spatial interactions are contextualized through Spatial Transformer Encoder (=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). +First, only spatial interactions are contextualized through Spatial Transformer Encoder(=ViT). Then, each frame is encoded to a single embedding, fed into the Temporal Transformer Encoder(=general transformer). **complexity : O(n_h^2 x n_w^2 + n_t^2)**