Merge pull request #219 from dusty-nv/20241013-gsheets

fixed charts and tables
NVIDIA-AI-IOT · Oct 13, 2024 · a9c45f5 · a9c45f5
2 parents 4baba94 + 3391635
commit a9c45f5
Show file tree

Hide file tree

Showing 8 changed files with 18 additions and 12 deletions.
diff --git a/docs/images/nano_vlm_benchmarks.svg b/docs/images/nano_vlm_benchmarks.svg
diff --git a/docs/openvla.md b/docs/openvla.md
@@ -10,10 +10,10 @@
     ✅ On-device training with LoRA's on Jetson AGX Orin and full fine-tuning on A100/H100 instances  
     ✅ 85% accuracy on an example block-stacking task with domain randomization  
     ✅ Sample datasets and test models for reproducing results  
-    🟩 sim2real with Isaac Sim and ROS2 integration  
+    <!--🟩 sim2real with Isaac Sim and ROS2 integration  
     🟩 Multi-frame/multi-camera image inputs with prior state  
     🟩 Action windowing across multiple frames for larger timesteps  
-    🟩 Similar test model for UGV rover along with onboard sim environment
+    🟩 Similar test model for UGV rover along with onboard sim environment-->
 
     Thank you to OpenVLA, Open X-Embodiment, MimicGen, Robosuite and many others with related work for sharing their promising research, models, and tools for advancing physical AI and robotics.
 

diff --git a/docs/tutorial-intro.md b/docs/tutorial-intro.md
@@ -22,9 +22,10 @@ Give your locally running LLM an access to vision!
 
 |      |                     |
 | :---------- | :----------------------------------- |
-| **[LLaVA](./tutorial_llava.md)** | [Large Language and Vision Assistant](https://llava-vl.github.io/), multimodal model that combines a vision encoder and LLM for visual and language understanding. |
+| **[LLaVA](./tutorial_llava.md)** | Different ways to run [LLaVa](https://llava-vl.github.io/) vision/language model on Jetson for visual understanding. |
 | **[Live LLaVA](./tutorial_live-llava.md)** | Run multimodal models interactively on live video streams over a repeating set of prompts. |
 | **[NanoVLM](./tutorial_nano-vlm.md)** | Use mini vision/language models and the optimized multimodal pipeline for live streaming. |
+| **[Llama 3.2 Vision](./llama_vlm.md)** | Run Meta's multimodal Llama-3.2-11B-Vision model on Orin with HuggingFace Transformers. |
 
 ### Vision Transformers
 
@@ -42,7 +43,8 @@ Give your locally running LLM an access to vision!
 | :---------- | :----------------------------------- |
 | **[Flux + ComfyUI](./tutorial_comfyui_flux.md)** | Set up and run the ComfyUI with Flux model for image generation on Jetson Orin. |
 | **[Stable Diffusion](./tutorial_stable-diffusion.md)** | Run AUTOMATIC1111's [`stable-diffusion-webui`](https://github.com/AUTOMATIC1111/stable-diffusion-webui) to generate images from prompts |
-| **[Stable Diffusion XL](./tutorial_stable-diffusion-xl.md)** | A newer ensemble pipeline consisting of a base model and refiner that results in significantly enhanced and detailed image generation capabilities. |
+| **[SDXL](./tutorial_stable-diffusion-xl.md)** | Ensemble pipeline consisting of a base model and refiner with enhanced image generation. |
+| **[nerfstudio](./nerf.md)** | Experience neural reconstruction and rendering with nerfstudio and onboard training. |
 
 
 ### Audio

diff --git a/docs/tutorial_live-llava.md b/docs/tutorial_live-llava.md
@@ -2,7 +2,7 @@
 
 !!! abstract "Recommended"
 
-    Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials to familiarize yourself with vision/language models and test the models first.
+    Follow the [NanoVLM](tutorial_nano-vlm.md) tutorial first to familiarize yourself with vision/language models, and see [Agent Studio](agent_studio.md) for in interactive pipeline editor built from live VLMs.
 
 This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:
 
@@ -54,7 +54,7 @@ jetson-containers run $(autotag nano_llm) \
     --video-output webrtc://@:8554/output
 ```
 
-<a href="https://youtu.be/wZq7ynbgRoE" target="_blank"><img width="49%" style="display: inline;"  src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_horse.jpg"> <img width="49%" style="display: inline;" src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_espresso.jpg"></a>
+<!--<a href="https://youtu.be/wZq7ynbgRoE" target="_blank"><img width="49%" style="display: inline;"  src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_horse.jpg"> <img width="49%" style="display: inline;" src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_espresso.jpg"></a>-->
 
 This uses [`jetson_utils`](https://github.com/dusty-nv/jetson-utils) for video I/O, and for options related to protocols and file formats, see [Camera Streaming and Multimedia](https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md).  In the example above, it captures a V4L2 USB webcam connected to the Jetson (under the device `/dev/video0`) and outputs a WebRTC stream.
 

diff --git a/docs/tutorial_nano-llm.md b/docs/tutorial_nano-llm.md
@@ -73,6 +73,7 @@ Here's an index of the various tutorials & examples using NanoLLM on Jetson AI L
 | **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts. |
 | **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot image tagging and RAG support. |
 | **[Agent Studio](./agent_studio.md){:target="_blank"}** | Rapidly design and experiment with creating your own automation agents. |
+| **[OpenVLA](./openvla.md){:target="_blank"}** | Robot learning with Vision/Language Action models and manipulation in simulator. |
 
 <div><iframe width="500" height="280" src="https://www.youtube.com/embed/UOjqF3YCGkY" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
 <iframe width="500" height="280" src="https://www.youtube.com/embed/wZq7ynbgRoE" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>

diff --git a/docs/tutorial_nano-vlm.md b/docs/tutorial_nano-vlm.md
@@ -6,11 +6,11 @@ There are 3 model families currently supported:  [Llava](https://llava-vl.github
 
 ## VLM Benchmarks
 
-<iframe width="719" height="446" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=88720541&amp;format=interactive"></iframe>
+<img src="images/nano_vlm_benchmarks.svg">
 
 This FPS measures the end-to-end pipeline performance for continuous streaming like with [Live Llava](tutorial_live-llava.md) (on yes/no question)  
 
-<iframe width="1000px" height="325px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=642302170&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
+<iframe width="1000px" height="325px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR0o4iAkwm4vRnDy2LdlSNhGf9sn7zzAf4RN7oLOLUSnTyLO5x94BrN8tq_uChRzQR-fSHNYmkZwO8v/pubhtml?gid=642302170&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
 
 ## Multimodal Chat
 	   
@@ -77,7 +77,7 @@ You can also use [`--prompt /data/prompts/images.json`](https://github.com/dusty
 
 ### Results
 
-<iframe width="1325px" height="905px"  src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=816702382&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
+<iframe width="1325px" height="905px"  src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR0o4iAkwm4vRnDy2LdlSNhGf9sn7zzAf4RN7oLOLUSnTyLO5x94BrN8tq_uChRzQR-fSHNYmkZwO8v/pubhtml?gid=816702382&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
 
 <small>• &nbsp; The model responses are with 4-bit quantization enabled, and are truncated to 128 tokens for brevity.</small>  
 <small>• &nbsp; These chat questions and images are from [`/data/prompts/images.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/images.json){:target="_blank"} (found in jetson-containers)</small> 

diff --git a/docs/tutorial_slm.md b/docs/tutorial_slm.md
@@ -10,7 +10,7 @@ This tutorial shows how to run optimized SLMs with quantization using the [`Nano
 
 ![](./svgs/SLM%20Text%20Generation%20Rate.svg)
 
-![alt text](images/Small%20Language%20Models%20(4-bit%20Quantization).png)
+<iframe width="1125px" height="350px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR0o4iAkwm4vRnDy2LdlSNhGf9sn7zzAf4RN7oLOLUSnTyLO5x94BrN8tq_uChRzQR-fSHNYmkZwO8v/pubhtml?gid=921468602&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
 
 > <small>• &nbsp; The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect.</small>  
 > <small>• &nbsp; The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect)</small>  

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -54,6 +54,8 @@ markdown_extensions:
   - md_in_html
   - tables
   - pymdownx.emoji:
+      #emoji_index: !!python/name:materialx.emoji.twemoji
+      #emoji_generator: !!python/name:materialx.emoji.to_svg
       emoji_index: !!python/name:material.extensions.emoji.twemoji
       emoji_generator: !!python/name:material.extensions.emoji.to_svg
   - pymdownx.critic
@@ -88,13 +90,13 @@ nav:
       - ollama: tutorial_ollama.md
       - llamaspeak: tutorial_llamaspeak.md
       - NanoLLM: tutorial_nano-llm.md
-      - Small LLM (SLM) 🆕: tutorial_slm.md
+      - Small LLM (SLM): tutorial_slm.md
       - API Examples: tutorial_api-examples.md
     - Text + Vision (VLM):
       - LLaVA: tutorial_llava.md
       - Live LLaVA: tutorial_live-llava.md
       - NanoVLM: tutorial_nano-vlm.md
-      - Llama 3.2 Vision 🆕: llama_vlm.md
+      - Llama 3.2 Vision: llama_vlm.md
     - Vision Transformers (ViT): 
       - vit/index.md
       - EfficientViT: vit/tutorial_efficientvit.md