diff --git a/README.md b/README.md
index 8044462..40058ae 100755
--- a/README.md
+++ b/README.md
@@ -34,7 +34,7 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
- [Multimodal Regression](https://towardsdatascience.com/anchors-and-multi-bin-loss-for-multi-modal-target-regression-647ea1974617)
- [Paper Reading in 2019](https://towardsdatascience.com/the-200-deep-learning-papers-i-read-in-2019-7fb7034f05f7?source=friends_link&sk=7628c5be39f876b2c05e43c13d0b48a3)
-## 2024-03 (7)
+## 2024-03 (8)
- [Genie: Generative Interactive Environments](https://arxiv.org/abs/2402.15391) [[Notes](paper_notes/genie.md)] [DeepMind, World Model]
- [DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving](https://arxiv.org/abs/2309.09777) [[Notes](paper_notes/drive_dreamer.md)] [Jiwen Lu, World Model]
- [WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens](https://arxiv.org/abs/2401.09985) [[Notes](paper_notes/world_dreamer.md)] [Jiwen Lu, World Model]
@@ -45,7 +45,7 @@ I regularly update [my blog in Toward Data Science](https://medium.com/@patrickl
- [RPT: Robot Learning with Sensorimotor Pre-training](https://arxiv.org/abs/2306.10007) CoRL 2023 Oral
- [DriveGAN: Towards a Controllable High-Quality Neural Simulation](https://arxiv.org/abs/2104.15060) [[Notes](paper_notes/drive_gan.md)] CVPR 2021 oral [Nvidia, Sanja]
- [VideoGPT: Video Generation using VQ-VAE and Transformers](https://arxiv.org/abs/2104.10157) [[Notes](paper_notes/videogpt.md)] [Pieter Abbeel]
-- [LLM and vision intelligence, by Lu Jiang](https://mp.weixin.qq.com/s/Hamz5XMT1tSZHKdPaCBTKg) [Interview]
+- [LLM, Vision Tokenizer and Vision Intelligence, by Lu Jiang](https://mp.weixin.qq.com/s/Hamz5XMT1tSZHKdPaCBTKg) [[Notes](paper_notes/llm_vision_intel.md)] [Interview Lu Jiang]
- [LVM: Sequential Modeling Enables Scalable Learning for Large Vision Models](https://arxiv.org/abs/2312.00785) [Large Vision Models]
- [OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving](https://arxiv.org/abs/2311.16038) [Jiwen Lu, World Model]
- [GenAD: Generative End-to-End Autonomous Driving](https://arxiv.org/abs/2402.11502)
diff --git a/paper_notes/drive_dreamer.md b/paper_notes/drive_dreamer.md
index 4bf12f4..3814dfd 100644
--- a/paper_notes/drive_dreamer.md
+++ b/paper_notes/drive_dreamer.md
@@ -15,6 +15,8 @@ The dynamics of the world model is actually controlled by a simplistic RNN model
The model is mainly focused on single cam scenarios, but the authors demo'ed in the appendix that it can be easily expanded to multicam scenario. --> The first solid multicam work is [Drive WM (Drive into the Future)](drive_wm.md).
+[WorldDreamer](world_dreamer.md) from the same group seems to be the extension of [DriveDreamer](drive_dreamer.md). Yet disappointingly [WorldDreamer](world_dreamer.md) seems unfinished and rushed to release on Arxiv, without much comparison with contemporary work.
+
#### Key ideas
- Training is multi-stage. --> Seems that this is the norm for all world models, like GAIA-1.
- Stage 1: AutoDM (Autonomous driving diffusion model)
diff --git a/paper_notes/llm_vision_intel.md b/paper_notes/llm_vision_intel.md
new file mode 100644
index 0000000..6b0dd78
--- /dev/null
+++ b/paper_notes/llm_vision_intel.md
@@ -0,0 +1,27 @@
+# [LLM, Vision Tokenizer and Vision Intelligence, by Lu Jiang](https://mp.weixin.qq.com/s/Hamz5XMT1tSZHKdPaCBTKg)
+
+_March 2024_
+
+tl;dr: Summarize the the main idea of the paper with one sentence.
+
+### LLM and Diffusion Models in Video Generation
+* Two primary technological approaches in video gen: diffusion-based techniques and those based on language models.
+ * Very recently, VideoPoet uses language model, while WALT uses a diffusion. Both uses the [MagVit V2](magvit_v2.md) tokenizer.
+* Diffusion iterations: pixel diffusion --> latent diffusion --> latent diffusion with a transformer backbone to replace UNet backbone(DiT).
+* Diffusion now dominates ~90% of research, due to the open-sourced stable diffusion.
+* Language modeling of video actually predate diffusion, with early instances like ImageGPT and subsequent developments like DALL-E, although DALL-E2 transitioned to diffusion.
+* While diffusion and large language models (LLMs) are categorized separately for ease of understanding, the boundaries between them are increasingly blurred. Diffusion technologies are progressively incorporating techniques from language models, making the distinction less clear.
+
+
+### LLMs and True Visual Intelligence
+- A more refined **visual tokenizer** integration with LLMs can be a method to achieve visual intelligence. The text modality (language) already includes a tokenizer, "natural language" system. A language system is needed for the visual domain.
+- Although language models have potential, they don't understand the specific goals of the generation tasks. The presence of a tokenizer establishes connections between tokens to clarify the task at hand for the model, enhancing the LLM's ability to utilize its full potential. Thus, **if a model doesn't understand its current generation task, the issue lies not with the language model itself but in our failure to find a method for it to comprehend the task.** --> Why?
+- [MagVit V2](magvit_v2.md) shows that a good tokenizer connected to a language model can immediately achieve better results than the best diffusion models.
+- To enhance control, such as precise control and generation through conversation, and even achieve the mentioned intelligence, we need to train a **best foundation model** as possible. **Precise control is typically a downstream issue.** The better the foundation model, the better the downstream tasks will be. Different foundation models have unique features that may eliminate certain problems in new foundation models.
+- **Stable diffusion has not yet successfully scaled**. Transformers scale up more easily and have many existing learning recipes. The largest diffusion models have ~7-8B parameters, but the largest transformer models are 2 orders of magnitude larger with ~trillion params.
+
+
+### References
+* MaskGIT enhances image generation efficiency through parallel decoding, significantly accelerating the process while improving image quality.
+* Unlike diffusion or auto-regressive models, Muse operates in a discrete token space, employing masked training that achieved SOTA perf and efficiency at the time.
+* Magvit (Masked Generative Video Transformer): Introduced a 3D tokenizer in research, quantifying videos into spatio-temporal visual tokens.
diff --git a/paper_notes/video_ldm.md b/paper_notes/video_ldm.md
index 3b43c16..b11f6ec 100644
--- a/paper_notes/video_ldm.md
+++ b/paper_notes/video_ldm.md
@@ -13,7 +13,7 @@ Two main advantages of video LDM is the computationally efficiency, and ability
It is also cited by Sora as one comparison baseline. Video LDM is widely used in research projects due to its simplicity and compute efficiency.
-The temporal consistency of the long drive video is still not very good, without fixed appearances for a given object. Similar to that in [Drive into the Future](drive_wm.md).
+The temporal consistency of the long drive video is still NOT good, without fixed appearances for a given object. Similar to that in [Drive into the Future](drive_wm.md). --> This is significantly improved by [SVD: Stable Video Diffusion](https://arxiv.org/abs/2311.15127), which is a native video model.
Video generation displays multimodality, but not controllability (it is conditioned on simple weather conditions, and crowdedness, and optionally bbox). In this sense it is NOT a world model.
diff --git a/paper_notes/world_dreamer.md b/paper_notes/world_dreamer.md
index 61e24bc..9f8c3ce 100644
--- a/paper_notes/world_dreamer.md
+++ b/paper_notes/world_dreamer.md
@@ -9,9 +9,7 @@ The model takes in a variety of modalities such as image/video, text, actions, a
World models hold great promise for learning motion and physics in the genral world, essential for coherent and reasonable video generation.
-> During training, MaskGIT is trained on a similar proxy task to the mask prediction in BERT. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps.
-
-The paper seems unfinished and rushed to release on Arxiv, without much comparison with contemporary work. The paper is also heavily inspired by MaskGIT, especially the masked token prediction and parallel decoding.
+[WorldDreamer](world_dreamer.md) seems to be the extension of [DriveDreamer](drive_dreamer.md). Yet disappointingly [WorldDreamer](world_dreamer.md) seems unfinished and rushed to release on Arxiv, without much comparison with contemporary work. The paper is also heavily inspired by MaskGIT, especially the masked token prediction and parallel decoding.
#### Key ideas
- Architecture
@@ -34,7 +32,7 @@ The paper seems unfinished and rushed to release on Arxiv, without much comparis
- The key assumption underlying the effectiveness of the parallel decoding is a Markovian
property that many tokens are conditionally independent given other tokens. (From [MaskGIT](https://masked-generative-image-transformer.github.io/) and Muse)
- [PySceneDetect](https://github.com/Breakthrough/PySceneDetect) to detect scene switching
-
+- The idea of using masked language model for video prediction is first proposed in MaskGIT, then extended by Muse to text-to-image generation. During training, MaskGIT is trained on a similar proxy task to the mask prediction in BERT. At inference time, MaskGIT adopts a novel non-autoregressive decoding method to synthesize an image in constant number of steps.
#### Notes
- Questions and notes on how to improve/revise the current work