diffusion video post

lilianweng · Apr 15, 2024 · e8ca8c0 · e8ca8c0
1 parent 380d820
commit e8ca8c0
Show file tree

Hide file tree

Showing 38 changed files with 1,259 additions and 93 deletions.
diff --git a/archives/index.html b/archives/index.html
@@ -229,8 +229,19 @@
   <h1>Archive</h1>
 </header>
 <div class="archive-year">
-  <h2 class="archive-year-header">2024<sup class="archive-count">&nbsp;&nbsp;1</sup>
+  <h2 class="archive-year-header">2024<sup class="archive-count">&nbsp;&nbsp;2</sup>
   </h2>
+  <div class="archive-month">
+    <h3 class="archive-month-header">April<sup class="archive-count">&nbsp;&nbsp;1</sup></h3>
+    <div class="archive-posts">
+      <div class="archive-entry">
+        <h3 class="archive-entry-title">Diffusion Models for Video Generation
+        </h3>
+        <div class="archive-meta">Date: April 12, 2024  |  Estimated Reading Time: 20 min  |  Author: Lilian Weng</div>
+        <a class="entry-link" aria-label="post link to Diffusion Models for Video Generation" href="https://lilianweng.github.io/posts/2024-04-12-diffusion-video/"></a>
+      </div>
+    </div>
+  </div>
   <div class="archive-month">
     <h3 class="archive-month-header">February<sup class="archive-count">&nbsp;&nbsp;1</sup></h3>
     <div class="archive-posts">
@@ -376,7 +387,7 @@ <h3 class="archive-month-header">July<sup class="archive-count">&nbsp;&nbsp;1</s
       <div class="archive-entry">
         <h3 class="archive-entry-title">What are Diffusion Models?
         </h3>
-        <div class="archive-meta">Date: July 11, 2021  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng</div>
+        <div class="archive-meta">Date: July 11, 2021  |  Estimated Reading Time: 32 min  |  Author: Lilian Weng</div>
         <a class="entry-link" aria-label="post link to What are Diffusion Models?" href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/"></a>
       </div>
     </div>

diff --git a/index.html b/index.html
@@ -235,6 +235,19 @@ <h1>👋 Welcome to Lil&rsquo;Log</h1>
     </footer>
 </article>
 
+<article class="post-entry"> 
+  <header class="entry-header">
+    <h2>Diffusion Models for Video Generation
+    </h2>
+  </header>
+  <section class="entry-content">
+    <p>Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task—using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because
+It has extra requirements on temporal consistency across frames, which naturally demand more world knowledge to be encoded into the model....</p>
+  </section>
+  <footer class="entry-footer">Date: April 12, 2024  |  Estimated Reading Time: 20 min  |  Author: Lilian Weng</footer>
+  <a class="entry-link" aria-label="post link to Diffusion Models for Video Generation" href="https://lilianweng.github.io/posts/2024-04-12-diffusion-video/"></a>
+</article>
+
 <article class="post-entry"> 
   <header class="entry-header">
     <h2>Thinking about High-Quality Human Data
@@ -400,7 +413,7 @@ <h2>What are Diffusion Models?
     <p>[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)]. [Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen. [Updated on 2022-08-31: Added latent diffusion model. [Updated on 2024-04-13: Added progressive distillation, consistency models, and the Model Architecture section.
 So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own....</p>
   </section>
-  <footer class="entry-footer">Date: July 11, 2021  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng</footer>
+  <footer class="entry-footer">Date: July 11, 2021  |  Estimated Reading Time: 32 min  |  Author: Lilian Weng</footer>
   <a class="entry-link" aria-label="post link to What are Diffusion Models?" href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/"></a>
 </article>
 
@@ -482,18 +495,6 @@ <h2>Exploration Strategies in Deep Reinforcement Learning
   <footer class="entry-footer">Date: June 7, 2020  |  Estimated Reading Time: 36 min  |  Author: Lilian Weng</footer>
   <a class="entry-link" aria-label="post link to Exploration Strategies in Deep Reinforcement Learning" href="https://lilianweng.github.io/posts/2020-06-07-exploration-drl/"></a>
 </article>
-
-<article class="post-entry"> 
-  <header class="entry-header">
-    <h2>The Transformer Family
-    </h2>
-  </header>
-  <section class="entry-content">
-    <p>[Updated on 2023-01-27: After almost three years, I did a big refactoring update of this post to incorporate a bunch of new Transformer models since 2020. The enhanced version of this post is here: The Transformer Family Version 2.0. Please refer to that post on this topic.] It has been almost two years since my last post on attention. Recent progress on new and enhanced versions of Transformer motivates me to write another post on this specific topic, focusing on how the vanilla Transformer can be improved for longer-term attention span, less memory and computation consumption, RL task solving and more....</p>
-  </section>
-  <footer class="entry-footer">Date: April 7, 2020  |  Estimated Reading Time: 25 min  |  Author: Lilian Weng</footer>
-  <a class="entry-link" aria-label="post link to The Transformer Family" href="https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/"></a>
-</article>
 <footer class="page-footer">
   <nav class="pagination">
     <a class="next" href="https://lilianweng.github.io/page/2/"> »</a>

diff --git a/index.json b/index.json
diff --git a/index.xml b/index.xml
@@ -6,7 +6,17 @@
     <description>Recent content on Lil&#39;Log</description>
     <generator>Hugo -- gohugo.io</generator>
     <language>en-us</language>
-    <lastBuildDate>Mon, 05 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://lilianweng.github.io/index.xml" rel="self" type="application/rss+xml" />
+    <lastBuildDate>Fri, 12 Apr 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://lilianweng.github.io/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>Diffusion Models for Video Generation</title>
+      <link>https://lilianweng.github.io/posts/2024-04-12-diffusion-video/</link>
+      <pubDate>Fri, 12 Apr 2024 00:00:00 +0000</pubDate>
+
+      <guid>https://lilianweng.github.io/posts/2024-04-12-diffusion-video/</guid>
+      <description>Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task&amp;mdash;using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because
+It has extra requirements on temporal consistency across frames, which naturally demand more world knowledge to be encoded into the model.</description>
+    </item>
+
     <item>
       <title>Thinking about High-Quality Human Data</title>
       <link>https://lilianweng.github.io/posts/2024-02-05-human-data-quality/</link>

diff --git a/page/2/index.html b/page/2/index.html
@@ -191,6 +191,18 @@
 </header>
 <main class="main"> 
 
+<article class="post-entry"> 
+  <header class="entry-header">
+    <h2>The Transformer Family
+    </h2>
+  </header>
+  <section class="entry-content">
+    <p>[Updated on 2023-01-27: After almost three years, I did a big refactoring update of this post to incorporate a bunch of new Transformer models since 2020. The enhanced version of this post is here: The Transformer Family Version 2.0. Please refer to that post on this topic.] It has been almost two years since my last post on attention. Recent progress on new and enhanced versions of Transformer motivates me to write another post on this specific topic, focusing on how the vanilla Transformer can be improved for longer-term attention span, less memory and computation consumption, RL task solving and more....</p>
+  </section>
+  <footer class="entry-footer">Date: April 7, 2020  |  Estimated Reading Time: 25 min  |  Author: Lilian Weng</footer>
+  <a class="entry-link" aria-label="post link to The Transformer Family" href="https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/"></a>
+</article>
+
 <article class="post-entry"> 
   <header class="entry-header">
     <h2>Curriculum for Reinforcement Learning
@@ -433,19 +445,6 @@ <h2>Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS
   <footer class="entry-footer">Date: October 29, 2017  |  Estimated Reading Time: 15 min  |  Author: Lilian Weng</footer>
   <a class="entry-link" aria-label="post link to Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS" href="https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/"></a>
 </article>
-
-<article class="post-entry"> 
-  <header class="entry-header">
-    <h2>Learning Word Embedding
-    </h2>
-  </header>
-  <section class="entry-content">
-    <p>Human vocabulary comes in free text. In order to make a machine learning model understand and process the natural language, we need to transform the free-text words into numeric values. One of the simplest transformation approaches is to do a one-hot encoding in which each distinct word stands for one dimension of the resulting vector and a binary value indicates whether the word presents (1) or not (0).
-However, one-hot encoding is impractical computationally when dealing with the entire vocabulary, as the representation demands hundreds of thousands of dimensions....</p>
-  </section>
-  <footer class="entry-footer">Date: October 15, 2017  |  Estimated Reading Time: 18 min  |  Author: Lilian Weng</footer>
-  <a class="entry-link" aria-label="post link to Learning Word Embedding" href="https://lilianweng.github.io/posts/2017-10-15-word-embedding/"></a>
-</article>
 <footer class="page-footer">
   <nav class="pagination">
     <a class="prev" href="https://lilianweng.github.io/">« </a>

diff --git a/page/3/index.html b/page/3/index.html
@@ -191,6 +191,19 @@
 </header>
 <main class="main"> 
 
+<article class="post-entry"> 
+  <header class="entry-header">
+    <h2>Learning Word Embedding
+    </h2>
+  </header>
+  <section class="entry-content">
+    <p>Human vocabulary comes in free text. In order to make a machine learning model understand and process the natural language, we need to transform the free-text words into numeric values. One of the simplest transformation approaches is to do a one-hot encoding in which each distinct word stands for one dimension of the resulting vector and a binary value indicates whether the word presents (1) or not (0).
+However, one-hot encoding is impractical computationally when dealing with the entire vocabulary, as the representation demands hundreds of thousands of dimensions....</p>
+  </section>
+  <footer class="entry-footer">Date: October 15, 2017  |  Estimated Reading Time: 18 min  |  Author: Lilian Weng</footer>
+  <a class="entry-link" aria-label="post link to Learning Word Embedding" href="https://lilianweng.github.io/posts/2017-10-15-word-embedding/"></a>
+</article>
+
 <article class="post-entry"> 
   <header class="entry-header">
     <h2>Anatomize Deep Learning with Information Theory

diff --git a/posts/2021-07-11-diffusion-models/index.html b/posts/2021-07-11-diffusion-models/index.html
diff --git a/posts/2024-02-05-human-data-quality/index.html b/posts/2024-02-05-human-data-quality/index.html
@@ -541,6 +541,11 @@ <h1 id="citation">Citation<a hidden class="anchor" aria-hidden="true" href="#cit
       <li><a href="https://lilianweng.github.io/tags/human-ai/">human-ai</a></li>
     </ul>
 <nav class="paginav">
+  <a class="prev" href="https://lilianweng.github.io/posts/2024-04-12-diffusion-video/">
+    <span class="title">« </span>
+    <br>
+    <span>Diffusion Models for Video Generation</span>
+  </a>
   <a class="next" href="https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/">
     <span class="title"> »</span>
     <br>

diff --git a/posts/2024-04-12-diffusion-video/3D-U-net.png b/posts/2024-04-12-diffusion-video/3D-U-net.png
diff --git a/posts/2024-04-12-diffusion-video/control-video.png b/posts/2024-04-12-diffusion-video/control-video.png
diff --git a/posts/2024-04-12-diffusion-video/gen-1.png b/posts/2024-04-12-diffusion-video/gen-1.png
diff --git a/posts/2024-04-12-diffusion-video/imagen-video-Unet-block.png b/posts/2024-04-12-diffusion-video/imagen-video-Unet-block.png
diff --git a/posts/2024-04-12-diffusion-video/imagen-video.png b/posts/2024-04-12-diffusion-video/imagen-video.png
diff --git a/posts/2024-04-12-diffusion-video/index.html b/posts/2024-04-12-diffusion-video/index.html
diff --git a/posts/2024-04-12-diffusion-video/lumiere-STUnet.png b/posts/2024-04-12-diffusion-video/lumiere-STUnet.png
diff --git a/posts/2024-04-12-diffusion-video/lumiere.png b/posts/2024-04-12-diffusion-video/lumiere.png
diff --git a/posts/2024-04-12-diffusion-video/make-a-video-layers.png b/posts/2024-04-12-diffusion-video/make-a-video-layers.png
diff --git a/posts/2024-04-12-diffusion-video/make-a-video.png b/posts/2024-04-12-diffusion-video/make-a-video.png
diff --git a/posts/2024-04-12-diffusion-video/sora.png b/posts/2024-04-12-diffusion-video/sora.png
diff --git a/posts/2024-04-12-diffusion-video/text2video-zero.png b/posts/2024-04-12-diffusion-video/text2video-zero.png
diff --git a/posts/2024-04-12-diffusion-video/tune-a-video.png b/posts/2024-04-12-diffusion-video/tune-a-video.png
diff --git a/posts/2024-04-12-diffusion-video/v-param.png b/posts/2024-04-12-diffusion-video/v-param.png
diff --git a/posts/2024-04-12-diffusion-video/video-LDM-autoencoder.png b/posts/2024-04-12-diffusion-video/video-LDM-autoencoder.png
diff --git a/posts/2024-04-12-diffusion-video/video-LDM.png b/posts/2024-04-12-diffusion-video/video-LDM.png
diff --git a/posts/index.html b/posts/index.html
@@ -194,6 +194,19 @@
   <h1>Posts</h1>
 </header>
 
+<article class="post-entry"> 
+  <header class="entry-header">
+    <h2>Diffusion Models for Video Generation
+    </h2>
+  </header>
+  <section class="entry-content">
+    <p>Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task—using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because
+It has extra requirements on temporal consistency across frames, which naturally demand more world knowledge to be encoded into the model....</p>
+  </section>
+  <footer class="entry-footer">Date: April 12, 2024  |  Estimated Reading Time: 20 min  |  Author: Lilian Weng</footer>
+  <a class="entry-link" aria-label="post link to Diffusion Models for Video Generation" href="https://lilianweng.github.io/posts/2024-04-12-diffusion-video/"></a>
+</article>
+
 <article class="post-entry"> 
   <header class="entry-header">
     <h2>Thinking about High-Quality Human Data
@@ -359,7 +372,7 @@ <h2>What are Diffusion Models?
     <p>[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)]. [Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen. [Updated on 2022-08-31: Added latent diffusion model. [Updated on 2024-04-13: Added progressive distillation, consistency models, and the Model Architecture section.
 So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own....</p>
   </section>
-  <footer class="entry-footer">Date: July 11, 2021  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng</footer>
+  <footer class="entry-footer">Date: July 11, 2021  |  Estimated Reading Time: 32 min  |  Author: Lilian Weng</footer>
   <a class="entry-link" aria-label="post link to What are Diffusion Models?" href="https://lilianweng.github.io/posts/2021-07-11-diffusion-models/"></a>
 </article>
 
@@ -441,18 +454,6 @@ <h2>Exploration Strategies in Deep Reinforcement Learning
   <footer class="entry-footer">Date: June 7, 2020  |  Estimated Reading Time: 36 min  |  Author: Lilian Weng</footer>
   <a class="entry-link" aria-label="post link to Exploration Strategies in Deep Reinforcement Learning" href="https://lilianweng.github.io/posts/2020-06-07-exploration-drl/"></a>
 </article>
-
-<article class="post-entry"> 
-  <header class="entry-header">
-    <h2>The Transformer Family
-    </h2>
-  </header>
-  <section class="entry-content">
-    <p>[Updated on 2023-01-27: After almost three years, I did a big refactoring update of this post to incorporate a bunch of new Transformer models since 2020. The enhanced version of this post is here: The Transformer Family Version 2.0. Please refer to that post on this topic.] It has been almost two years since my last post on attention. Recent progress on new and enhanced versions of Transformer motivates me to write another post on this specific topic, focusing on how the vanilla Transformer can be improved for longer-term attention span, less memory and computation consumption, RL task solving and more....</p>
-  </section>
-  <footer class="entry-footer">Date: April 7, 2020  |  Estimated Reading Time: 25 min  |  Author: Lilian Weng</footer>
-  <a class="entry-link" aria-label="post link to The Transformer Family" href="https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/"></a>
-</article>
 <footer class="page-footer">
   <nav class="pagination">
     <a class="next" href="https://lilianweng.github.io/posts/page/2/"> »</a>

diff --git a/posts/index.xml b/posts/index.xml
@@ -6,7 +6,17 @@
     <description>Recent content in Posts on Lil&#39;Log</description>
     <generator>Hugo -- gohugo.io</generator>
     <language>en-us</language>
-    <lastBuildDate>Mon, 05 Feb 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://lilianweng.github.io/posts/index.xml" rel="self" type="application/rss+xml" />
+    <lastBuildDate>Fri, 12 Apr 2024 00:00:00 +0000</lastBuildDate><atom:link href="https://lilianweng.github.io/posts/index.xml" rel="self" type="application/rss+xml" />
+    <item>
+      <title>Diffusion Models for Video Generation</title>
+      <link>https://lilianweng.github.io/posts/2024-04-12-diffusion-video/</link>
+      <pubDate>Fri, 12 Apr 2024 00:00:00 +0000</pubDate>
+
+      <guid>https://lilianweng.github.io/posts/2024-04-12-diffusion-video/</guid>
+      <description>Diffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task&amp;mdash;using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because
+It has extra requirements on temporal consistency across frames, which naturally demand more world knowledge to be encoded into the model.</description>
+    </item>
+
     <item>
       <title>Thinking about High-Quality Human Data</title>
       <link>https://lilianweng.github.io/posts/2024-02-05-human-data-quality/</link>

diff --git a/posts/page/2/index.html b/posts/page/2/index.html
@@ -194,6 +194,18 @@
   <h1>Posts</h1>
 </header>
 
+<article class="post-entry"> 
+  <header class="entry-header">
+    <h2>The Transformer Family
+    </h2>
+  </header>
+  <section class="entry-content">
+    <p>[Updated on 2023-01-27: After almost three years, I did a big refactoring update of this post to incorporate a bunch of new Transformer models since 2020. The enhanced version of this post is here: The Transformer Family Version 2.0. Please refer to that post on this topic.] It has been almost two years since my last post on attention. Recent progress on new and enhanced versions of Transformer motivates me to write another post on this specific topic, focusing on how the vanilla Transformer can be improved for longer-term attention span, less memory and computation consumption, RL task solving and more....</p>
+  </section>
+  <footer class="entry-footer">Date: April 7, 2020  |  Estimated Reading Time: 25 min  |  Author: Lilian Weng</footer>
+  <a class="entry-link" aria-label="post link to The Transformer Family" href="https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/"></a>
+</article>
+
 <article class="post-entry"> 
   <header class="entry-header">
     <h2>Curriculum for Reinforcement Learning
@@ -436,19 +448,6 @@ <h2>Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS
   <footer class="entry-footer">Date: October 29, 2017  |  Estimated Reading Time: 15 min  |  Author: Lilian Weng</footer>
   <a class="entry-link" aria-label="post link to Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS" href="https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/"></a>
 </article>
-
-<article class="post-entry"> 
-  <header class="entry-header">
-    <h2>Learning Word Embedding
-    </h2>
-  </header>
-  <section class="entry-content">
-    <p>Human vocabulary comes in free text. In order to make a machine learning model understand and process the natural language, we need to transform the free-text words into numeric values. One of the simplest transformation approaches is to do a one-hot encoding in which each distinct word stands for one dimension of the resulting vector and a binary value indicates whether the word presents (1) or not (0).
-However, one-hot encoding is impractical computationally when dealing with the entire vocabulary, as the representation demands hundreds of thousands of dimensions....</p>
-  </section>
-  <footer class="entry-footer">Date: October 15, 2017  |  Estimated Reading Time: 18 min  |  Author: Lilian Weng</footer>
-  <a class="entry-link" aria-label="post link to Learning Word Embedding" href="https://lilianweng.github.io/posts/2017-10-15-word-embedding/"></a>
-</article>
 <footer class="page-footer">
   <nav class="pagination">
     <a class="prev" href="https://lilianweng.github.io/posts/">« </a>