Skip to content

Commit

Permalink
Merge pull request #219 from dusty-nv/20241013-gsheets
Browse files Browse the repository at this point in the history
fixed charts and tables
  • Loading branch information
dusty-nv authored Oct 13, 2024
2 parents 4baba94 + 3391635 commit a9c45f5
Show file tree
Hide file tree
Showing 8 changed files with 18 additions and 12 deletions.
1 change: 1 addition & 0 deletions docs/images/nano_vlm_benchmarks.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/openvla.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@
✅ On-device training with LoRA's on Jetson AGX Orin and full fine-tuning on A100/H100 instances
✅ 85% accuracy on an example block-stacking task with domain randomization
✅ Sample datasets and test models for reproducing results
🟩 sim2real with Isaac Sim and ROS2 integration
<!--🟩 sim2real with Isaac Sim and ROS2 integration
🟩 Multi-frame/multi-camera image inputs with prior state
🟩 Action windowing across multiple frames for larger timesteps
🟩 Similar test model for UGV rover along with onboard sim environment
🟩 Similar test model for UGV rover along with onboard sim environment-->

Thank you to OpenVLA, Open X-Embodiment, MimicGen, Robosuite and many others with related work for sharing their promising research, models, and tools for advancing physical AI and robotics.

Expand Down
6 changes: 4 additions & 2 deletions docs/tutorial-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@ Give your locally running LLM an access to vision!

| | |
| :---------- | :----------------------------------- |
| **[LLaVA](./tutorial_llava.md)** | [Large Language and Vision Assistant](https://llava-vl.github.io/), multimodal model that combines a vision encoder and LLM for visual and language understanding. |
| **[LLaVA](./tutorial_llava.md)** | Different ways to run [LLaVa](https://llava-vl.github.io/) vision/language model on Jetson for visual understanding. |
| **[Live LLaVA](./tutorial_live-llava.md)** | Run multimodal models interactively on live video streams over a repeating set of prompts. |
| **[NanoVLM](./tutorial_nano-vlm.md)** | Use mini vision/language models and the optimized multimodal pipeline for live streaming. |
| **[Llama 3.2 Vision](./llama_vlm.md)** | Run Meta's multimodal Llama-3.2-11B-Vision model on Orin with HuggingFace Transformers. |

### Vision Transformers

Expand All @@ -42,7 +43,8 @@ Give your locally running LLM an access to vision!
| :---------- | :----------------------------------- |
| **[Flux + ComfyUI](./tutorial_comfyui_flux.md)** | Set up and run the ComfyUI with Flux model for image generation on Jetson Orin. |
| **[Stable Diffusion](./tutorial_stable-diffusion.md)** | Run AUTOMATIC1111's [`stable-diffusion-webui`](https://github.com/AUTOMATIC1111/stable-diffusion-webui) to generate images from prompts |
| **[Stable Diffusion XL](./tutorial_stable-diffusion-xl.md)** | A newer ensemble pipeline consisting of a base model and refiner that results in significantly enhanced and detailed image generation capabilities. |
| **[SDXL](./tutorial_stable-diffusion-xl.md)** | Ensemble pipeline consisting of a base model and refiner with enhanced image generation. |
| **[nerfstudio](./nerf.md)** | Experience neural reconstruction and rendering with nerfstudio and onboard training. |


### Audio
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorial_live-llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

!!! abstract "Recommended"

Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials to familiarize yourself with vision/language models and test the models first.
Follow the [NanoVLM](tutorial_nano-vlm.md) tutorial first to familiarize yourself with vision/language models, and see [Agent Studio](agent_studio.md) for in interactive pipeline editor built from live VLMs.

This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:

Expand Down Expand Up @@ -54,7 +54,7 @@ jetson-containers run $(autotag nano_llm) \
--video-output webrtc://@:8554/output
```

<a href="https://youtu.be/wZq7ynbgRoE" target="_blank"><img width="49%" style="display: inline;" src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_horse.jpg"> <img width="49%" style="display: inline;" src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_espresso.jpg"></a>
<!--<a href="https://youtu.be/wZq7ynbgRoE" target="_blank"><img width="49%" style="display: inline;" src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_horse.jpg"> <img width="49%" style="display: inline;" src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava_espresso.jpg"></a>-->

This uses [`jetson_utils`](https://github.com/dusty-nv/jetson-utils) for video I/O, and for options related to protocols and file formats, see [Camera Streaming and Multimedia](https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md). In the example above, it captures a V4L2 USB webcam connected to the Jetson (under the device `/dev/video0`) and outputs a WebRTC stream.

Expand Down
1 change: 1 addition & 0 deletions docs/tutorial_nano-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ Here's an index of the various tutorials & examples using NanoLLM on Jetson AI L
| **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts. |
| **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot image tagging and RAG support. |
| **[Agent Studio](./agent_studio.md){:target="_blank"}** | Rapidly design and experiment with creating your own automation agents. |
| **[OpenVLA](./openvla.md){:target="_blank"}** | Robot learning with Vision/Language Action models and manipulation in simulator. |

<div><iframe width="500" height="280" src="https://www.youtube.com/embed/UOjqF3YCGkY" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<iframe width="500" height="280" src="https://www.youtube.com/embed/wZq7ynbgRoE" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorial_nano-vlm.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ There are 3 model families currently supported: [Llava](https://llava-vl.github

## VLM Benchmarks

<iframe width="719" height="446" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=88720541&amp;format=interactive"></iframe>
<img src="images/nano_vlm_benchmarks.svg">

This FPS measures the end-to-end pipeline performance for continuous streaming like with [Live Llava](tutorial_live-llava.md) (on yes/no question)

<iframe width="1000px" height="325px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=642302170&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
<iframe width="1000px" height="325px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR0o4iAkwm4vRnDy2LdlSNhGf9sn7zzAf4RN7oLOLUSnTyLO5x94BrN8tq_uChRzQR-fSHNYmkZwO8v/pubhtml?gid=642302170&amp;single=true&amp;widget=true&amp;headers=false"></iframe>

## Multimodal Chat
Expand Down Expand Up @@ -77,7 +77,7 @@ You can also use [`--prompt /data/prompts/images.json`](https://github.com/dusty

### Results

<iframe width="1325px" height="905px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=816702382&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
<iframe width="1325px" height="905px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR0o4iAkwm4vRnDy2LdlSNhGf9sn7zzAf4RN7oLOLUSnTyLO5x94BrN8tq_uChRzQR-fSHNYmkZwO8v/pubhtml?gid=816702382&amp;single=true&amp;widget=true&amp;headers=false"></iframe>

<small>• &nbsp; The model responses are with 4-bit quantization enabled, and are truncated to 128 tokens for brevity.</small>
<small>• &nbsp; These chat questions and images are from [`/data/prompts/images.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/images.json){:target="_blank"} (found in jetson-containers)</small>
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial_slm.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This tutorial shows how to run optimized SLMs with quantization using the [`Nano

![](./svgs/SLM%20Text%20Generation%20Rate.svg)

![alt text](images/Small%20Language%20Models%20(4-bit%20Quantization).png)
<iframe width="1125px" height="350px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vR0o4iAkwm4vRnDy2LdlSNhGf9sn7zzAf4RN7oLOLUSnTyLO5x94BrN8tq_uChRzQR-fSHNYmkZwO8v/pubhtml?gid=921468602&amp;single=true&amp;widget=true&amp;headers=false"></iframe>

> <small>• &nbsp; The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect.</small>
> <small>• &nbsp; The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect)</small>
Expand Down
6 changes: 4 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ markdown_extensions:
- md_in_html
- tables
- pymdownx.emoji:
#emoji_index: !!python/name:materialx.emoji.twemoji
#emoji_generator: !!python/name:materialx.emoji.to_svg
emoji_index: !!python/name:material.extensions.emoji.twemoji
emoji_generator: !!python/name:material.extensions.emoji.to_svg
- pymdownx.critic
Expand Down Expand Up @@ -88,13 +90,13 @@ nav:
- ollama: tutorial_ollama.md
- llamaspeak: tutorial_llamaspeak.md
- NanoLLM: tutorial_nano-llm.md
- Small LLM (SLM) 🆕: tutorial_slm.md
- Small LLM (SLM): tutorial_slm.md
- API Examples: tutorial_api-examples.md
- Text + Vision (VLM):
- LLaVA: tutorial_llava.md
- Live LLaVA: tutorial_live-llava.md
- NanoVLM: tutorial_nano-vlm.md
- Llama 3.2 Vision 🆕: llama_vlm.md
- Llama 3.2 Vision: llama_vlm.md
- Vision Transformers (ViT):
- vit/index.md
- EfficientViT: vit/tutorial_efficientvit.md
Expand Down

0 comments on commit a9c45f5

Please sign in to comment.