From 3391635d0a3235b8b2ad2144c722c88fa9cae4d5 Mon Sep 17 00:00:00 2001 From: Dustin Franklin Date: Sun, 13 Oct 2024 03:27:26 -0400 Subject: [PATCH] fixed charts and tables --- docs/images/nano_vlm_benchmarks.svg | 1 + docs/openvla.md | 4 ++-- docs/tutorial-intro.md | 6 ++++-- docs/tutorial_live-llava.md | 4 ++-- docs/tutorial_nano-llm.md | 1 + docs/tutorial_nano-vlm.md | 6 +++--- docs/tutorial_slm.md | 2 +- mkdocs.yml | 6 ++++-- 8 files changed, 18 insertions(+), 12 deletions(-) create mode 100644 docs/images/nano_vlm_benchmarks.svg diff --git a/docs/images/nano_vlm_benchmarks.svg b/docs/images/nano_vlm_benchmarks.svg new file mode 100644 index 00000000..12f2de5c --- /dev/null +++ b/docs/images/nano_vlm_benchmarks.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/openvla.md b/docs/openvla.md index d1e22551..f3d4ec23 100644 --- a/docs/openvla.md +++ b/docs/openvla.md @@ -10,10 +10,10 @@ ✅ On-device training with LoRA's on Jetson AGX Orin and full fine-tuning on A100/H100 instances ✅ 85% accuracy on an example block-stacking task with domain randomization ✅ Sample datasets and test models for reproducing results - 🟩 sim2real with Isaac Sim and ROS2 integration + Thank you to OpenVLA, Open X-Embodiment, MimicGen, Robosuite and many others with related work for sharing their promising research, models, and tools for advancing physical AI and robotics. diff --git a/docs/tutorial-intro.md b/docs/tutorial-intro.md index 83e0577b..bd91d9d3 100644 --- a/docs/tutorial-intro.md +++ b/docs/tutorial-intro.md @@ -22,9 +22,10 @@ Give your locally running LLM an access to vision! | | | | :---------- | :----------------------------------- | -| **[LLaVA](./tutorial_llava.md)** | [Large Language and Vision Assistant](https://llava-vl.github.io/), multimodal model that combines a vision encoder and LLM for visual and language understanding. | +| **[LLaVA](./tutorial_llava.md)** | Different ways to run [LLaVa](https://llava-vl.github.io/) vision/language model on Jetson for visual understanding. | | **[Live LLaVA](./tutorial_live-llava.md)** | Run multimodal models interactively on live video streams over a repeating set of prompts. | | **[NanoVLM](./tutorial_nano-vlm.md)** | Use mini vision/language models and the optimized multimodal pipeline for live streaming. | +| **[Llama 3.2 Vision](./llama_vlm.md)** | Run Meta's multimodal Llama-3.2-11B-Vision model on Orin with HuggingFace Transformers. | ### Vision Transformers @@ -42,7 +43,8 @@ Give your locally running LLM an access to vision! | :---------- | :----------------------------------- | | **[Flux + ComfyUI](./tutorial_comfyui_flux.md)** | Set up and run the ComfyUI with Flux model for image generation on Jetson Orin. | | **[Stable Diffusion](./tutorial_stable-diffusion.md)** | Run AUTOMATIC1111's [`stable-diffusion-webui`](https://github.com/AUTOMATIC1111/stable-diffusion-webui) to generate images from prompts | -| **[Stable Diffusion XL](./tutorial_stable-diffusion-xl.md)** | A newer ensemble pipeline consisting of a base model and refiner that results in significantly enhanced and detailed image generation capabilities. | +| **[SDXL](./tutorial_stable-diffusion-xl.md)** | Ensemble pipeline consisting of a base model and refiner with enhanced image generation. | +| **[nerfstudio](./nerf.md)** | Experience neural reconstruction and rendering with nerfstudio and onboard training. | ### Audio diff --git a/docs/tutorial_live-llava.md b/docs/tutorial_live-llava.md index edae91e1..be1d8141 100644 --- a/docs/tutorial_live-llava.md +++ b/docs/tutorial_live-llava.md @@ -2,7 +2,7 @@ !!! abstract "Recommended" - Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials to familiarize yourself with vision/language models and test the models first. + Follow the [NanoVLM](tutorial_nano-vlm.md) tutorial first to familiarize yourself with vision/language models, and see [Agent Studio](agent_studio.md) for in interactive pipeline editor built from live VLMs. This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it: @@ -54,7 +54,7 @@ jetson-containers run $(autotag nano_llm) \ --video-output webrtc://@:8554/output ``` - + This uses [`jetson_utils`](https://github.com/dusty-nv/jetson-utils) for video I/O, and for options related to protocols and file formats, see [Camera Streaming and Multimedia](https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md). In the example above, it captures a V4L2 USB webcam connected to the Jetson (under the device `/dev/video0`) and outputs a WebRTC stream. diff --git a/docs/tutorial_nano-llm.md b/docs/tutorial_nano-llm.md index c47ef576..9791edf5 100644 --- a/docs/tutorial_nano-llm.md +++ b/docs/tutorial_nano-llm.md @@ -73,6 +73,7 @@ Here's an index of the various tutorials & examples using NanoLLM on Jetson AI L | **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts. | | **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot image tagging and RAG support. | | **[Agent Studio](./agent_studio.md){:target="_blank"}** | Rapidly design and experiment with creating your own automation agents. | +| **[OpenVLA](./openvla.md){:target="_blank"}** | Robot learning with Vision/Language Action models and manipulation in simulator. |
diff --git a/docs/tutorial_nano-vlm.md b/docs/tutorial_nano-vlm.md index 3c7c09af..1c50bcc2 100644 --- a/docs/tutorial_nano-vlm.md +++ b/docs/tutorial_nano-vlm.md @@ -6,11 +6,11 @@ There are 3 model families currently supported: [Llava](https://llava-vl.github ## VLM Benchmarks - + This FPS measures the end-to-end pipeline performance for continuous streaming like with [Live Llava](tutorial_live-llava.md) (on yes/no question) - + ## Multimodal Chat @@ -77,7 +77,7 @@ You can also use [`--prompt /data/prompts/images.json`](https://github.com/dusty ### Results - + •   The model responses are with 4-bit quantization enabled, and are truncated to 128 tokens for brevity. •   These chat questions and images are from [`/data/prompts/images.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/images.json){:target="_blank"} (found in jetson-containers) diff --git a/docs/tutorial_slm.md b/docs/tutorial_slm.md index 5ac51cfd..e6ba3b11 100644 --- a/docs/tutorial_slm.md +++ b/docs/tutorial_slm.md @@ -10,7 +10,7 @@ This tutorial shows how to run optimized SLMs with quantization using the [`Nano ![](./svgs/SLM%20Text%20Generation%20Rate.svg) -![alt text](images/Small%20Language%20Models%20(4-bit%20Quantization).png) + > •   The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect. > •   The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect) diff --git a/mkdocs.yml b/mkdocs.yml index 7436b0a9..7f632610 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -54,6 +54,8 @@ markdown_extensions: - md_in_html - tables - pymdownx.emoji: + #emoji_index: !!python/name:materialx.emoji.twemoji + #emoji_generator: !!python/name:materialx.emoji.to_svg emoji_index: !!python/name:material.extensions.emoji.twemoji emoji_generator: !!python/name:material.extensions.emoji.to_svg - pymdownx.critic @@ -88,13 +90,13 @@ nav: - ollama: tutorial_ollama.md - llamaspeak: tutorial_llamaspeak.md - NanoLLM: tutorial_nano-llm.md - - Small LLM (SLM) 🆕: tutorial_slm.md + - Small LLM (SLM): tutorial_slm.md - API Examples: tutorial_api-examples.md - Text + Vision (VLM): - LLaVA: tutorial_llava.md - Live LLaVA: tutorial_live-llava.md - NanoVLM: tutorial_nano-vlm.md - - Llama 3.2 Vision 🆕: llama_vlm.md + - Llama 3.2 Vision: llama_vlm.md - Vision Transformers (ViT): - vit/index.md - EfficientViT: vit/tutorial_efficientvit.md