Skip to content

Commit 3d29acf

Browse files
authored
Upstream changes for v0.3.0 release (#29)
- Detailed changes mentioned in CHANGELOG.md file
1 parent 5c1b121 commit 3d29acf

File tree

126 files changed

+5459
-1033
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+5459
-1033
lines changed

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Python Exclusions
2+
.venv
3+
**__pycache__**
4+
5+
# Helm Exclusions
6+
**/charts/*.tgz
7+
8+
# project temp files
9+
deploy/*.log
10+
deploy/*.txt
11+
12+
# Docker Compose exclusions
13+
volumes/
14+
uploaded_files/

CHANGELOG.md

Lines changed: 33 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,25 +3,52 @@ All notable changes to this project will be documented in this file.
33

44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [0.3.0] - 2024-01-22
7+
8+
### Added
9+
10+
- [New dedicated example](./docs/rag/aiplayground.md) showcasing Nvidia AI Playground based models using Langchain connectors.
11+
- [New example](./RetrievalAugmentedGeneration/README.md#5-qa-chatbot-with-task-decomposition-example----a100h100l40s) demonstrating query decomposition.
12+
- Support for using [PG Vector as a vector database in the developer rag canonical example.](./RetrievalAugmentedGeneration/README.md#deploying-with-pgvector-vector-store)
13+
- Support for using Speech-in Speech-out interface in the sample frontend leveraging RIVA Skills.
14+
- New tool showcasing [RAG observability support.](./tools/observability/)
15+
- Support for on-prem deployment of [TRTLLM based nemotron models.](./RetrievalAugmentedGeneration/README.md#6-qa-chatbot----nemotron-model)
16+
17+
### Changed
18+
19+
- Upgraded Langchain and llamaindex dependencies for all container.
20+
- Restructured [README](./README.md) files for better intuitiveness.
21+
- Added provision to plug in multiple examples using [a common base class](./RetrievalAugmentedGeneration/common/base.py).
22+
- Changed `minio` service's port to `9010`from `9000` in docker based deployment.
23+
- Moved `evaluation` directory from top level to under `tools` and created a [dedicated compose file](./deploy/compose/docker-compose-evaluation.yaml).
24+
- Added an [experimental directory](./experimental/) for plugging in experimental features.
25+
- Modified notebooks to use TRTLLM and Nvidia AI foundation based connectors from langchain.
26+
- Changed `ai-playground` model engine name to `nv-ai-foundation` in configurations.
27+
28+
### Fixed
29+
30+
- [Fixed issue #19](https://github.com/NVIDIA/GenerativeAIExamples/issues/19)
31+
632

733
## [0.2.0] - 2023-12-15
834

935
### Added
1036

11-
- Support for using [Nvidia AI Foundational LLM models](./docs/rag/aiplayground.md#using-nvdia-cloud-based-llms)
12-
- Support for using [Nvidia AI Foundational embedding models](./docs/rag/aiplayground.md#using-nvidia-cloud-based-embedding-models)
37+
- Support for using [Nvidia AI Playground based LLM models](./docs/rag/aiplayground.md)
38+
- Support for using [Nvidia AI Playground based embedding models](./docs/rag/aiplayground.md)
1339
- Support for [deploying and using quantized LLM models](./docs/rag/llm_inference_server.md#quantized-llama2-model-deployment)
14-
- Support for [evaluating RAG pipeline](./evaluation/README.md)
40+
- Support for Kubernetes deployment support using helm charts
41+
- Support for [evaluating RAG pipeline](./tools/evaluation/README.md)
1542

1643
### Changed
1744

1845
- Repository restructing to allow better open source contributions
1946
- [Upgraded dependencies](./RetrievalAugmentedGeneration/Dockerfile) for chain server container
20-
- [Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed now for access.
47+
- [Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed for access.
2148
- Main [README](./README.md) now provides more details.
2249
- Documentation improvements.
23-
- Better error handling and reporting mechanism for corner cases.
24-
- Renamed `triton-inference-server` container and service to `llm-inference-server`
50+
- Better error handling and reporting mechanism for corner cases
51+
- Renamed `triton-inference-server` container to `llm-inference-server`
2552

2653
### Fixed
2754

README.md

Lines changed: 48 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -8,40 +8,67 @@ Generative AI Examples uses resources from the [NVIDIA NGC AI Development Catalo
88

99
Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
1010

11-
- The GPU-optimized NVIDIA containers, models, scripts, and tools used in these examples
12-
- The latest NVIDIA upstream contributions to the respective programming frameworks
13-
- The latest NVIDIA Deep Learning and LLM software libraries
14-
- Release notes for each of the NVIDIA optimized containers
15-
- Links to developer documentation
11+
- GPU-optimized containers used in these examples
12+
- Release notes and developer documentation
1613

1714
## Retrieval Augmented Generation (RAG)
1815

19-
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a Large Language Model. RAG lets users use an LLM to chat with their own data.
16+
A RAG pipeline embeds multimodal data -- such as documents, images, and video -- into a database connected to a LLM. RAG lets users chat with their data!
2017

21-
| Name | Description | LLM | Framework | Multi-GPU | Multi-node | Embedding | TRT-LLM | Triton | VectorDB | K8s |
22-
|---------------|-----------------------|------------|-------------------------|-----------|------------|-------------|---------|--------|----------|-----|
23-
| [Linux developer RAG](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/RetrievalAugmentedGeneration) | Single VM, single GPU | llama2-13b | Langchain + Llama Index | No | No | e5-large-v2 | Yes | Yes | Milvus | No |
24-
| [Windows developer RAG](https://github.com/NVIDIA/trt-llm-rag-windows) | RAG on Windows | llama2-13b | Llama Index | No | No | NA | Yes | No | FAISS | NA |
25-
| [Developer LLM Operator for Kubernetes](./docs/developer-llm-operator/) | Single node, single GPU | llama2-13b | Langchain + Llama Index | No | No | e5-large-v2 | Yes | Yes | Milvus | Yes |
18+
### Developer RAG Examples
2619

20+
The developer RAG examples run on a single VM. They demonstrate how to combine NVIDIA GPU acceleration with popular LLM programming frameworks using NVIDIA's [open source connectors](#open-source-integrations). The examples are easy to deploy via [Docker Compose](https://docs.docker.com/compose/).
2721

28-
## Large Language Models
29-
NVIDIA LLMs are optimized for building enterprise generative AI applications.
22+
Examples support local and remote inference endpoints. If you have a GPU, you can inference locally via [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). If you don't have a GPU, you can inference and embed remotely via [NVIDIA AI Foundations endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/).
3023

31-
| Name | Description | Type | Context Length | Example | License |
32-
|---------------|-----------------------|------------|----------------|---------|---------|
33-
| [nemotron-3-8b-qa-4k](https://huggingface.co/nvidia/nemotron-3-8b-qa-4k) | Q&A LLM customized on knowledge bases | Text Generation | 4096 | No | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) |
34-
| [nemotron-3-8b-chat-4k-steerlm](https://huggingface.co/nvidia/nemotron-3-8b-chat-4k-steerlm) | Best out-of-the-box chat model with flexible alignment at inference | Text Generation | 4096 | No | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) |
35-
| [nemotron-3-8b-chat-4k-rlhf](https://huggingface.co/nvidia/nemotron-3-8b-chat-4k-rlhf) | Best out-of-the-box chat model performance| Text Generation | 4096 | No | [NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license) |
24+
| Model | Embedding | Framework | Description | Multi-GPU | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
25+
|---------------|-----------------------|------------|-------------------------|-----------|------------|-------------|---------|--------|
26+
| llama-2 | e5-large-v2 | Llamaindex | Canonical QA Chatbot | [YES](RetrievalAugmentedGeneration/README.md#3-qa-chatbot-multi-gpu----a100h100l40s) | [YES](RetrievalAugmentedGeneration/README.md#2-qa-chatbot----a100h100l40s-gpu) | No | YES | Milvus/[PGVector]((RetrievalAugmentedGeneration/README.md#2-qa-chatbot----a100h100l40s-gpu))|
27+
| mixtral_8x7b | nvolveqa_40k | Langchain | [Nvidia AI foundation based QA Chatbot](RetrievalAugmentedGeneration/README.md#1-qa-chatbot----nvidia-ai-foundation-inference-endpoint) | No | No | YES | YES | FAISS|
28+
| llama-2 | all-MiniLM-L6-v2 | Llama Index | [QA Chatbot, GeForce, Windows](https://github.com/NVIDIA/trt-llm-rag-windows/tree/release/1.0) | NO | YES | NO | NO | FAISS |
29+
| llama-2 | nvolveqa_40k | Langchain | [QA Chatbot, Task Decomposition Agent](./RetrievalAugmentedGeneration/README.md#5-qa-chatbot-with-task-decomposition-example----a100h100l40s) | No | No | YES | YES | FAISS
30+
| mixtral_8x7b | nvolveqa_40k | Langchain | [Minimilastic example showcasing RAG using Nvidia AI foundation models](./examples/README.md#rag-in-5-minutes-example) | No | No | YES | YES | FAISS|
3631

3732

38-
## Integration Examples
33+
34+
### Enterprise RAG Examples
35+
36+
The enterprise RAG examples run as microservies distributed across multiple VMs and GPUs. They show how RAG pipelines can be orchestrated with [Kubernetes](https://kubernetes.io/) and deployed with [Helm](https://helm.sh/).
37+
38+
Enterprise RAG examples include a [Kubernetes operator](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) for LLM lifecycle management. It is compatible with the [NVIDIA GPU operator](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator) that automates GPU discovery and lifecycle management in a Kubernetes cluster.
39+
40+
Enterprise RAG examples also support local and remote inference via [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [NVIDIA AI Foundations endpoints](https://www.nvidia.com/en-us/ai-data-science/foundation-models/).
41+
42+
| Model | Embedding | Framework | Description | Multi-GPU | Multi-node | TRT-LLM | NVIDIA AI Foundation | Triton | Vector Database |
43+
|---------------|-----------------------|------------|--------|-------------------------|-----------|------------|-------------|---------|--------|
44+
| llama-2 | NV-Embed-QA-003 | Llamaindex | QA Chatbot, Helm, k8s | NO | NO | [YES](./docs/developer-llm-operator/) | NO | YES | Milvus|
45+
46+
## Tools
47+
48+
Example tools and tutorials to enhance LLM development and productivity when using NVIDIA RAG pipelines.
49+
50+
| Name | Description | Deployment | Tutorial |
51+
|------|-------------|------|--------|
52+
| Evaluation | Example open source RAG eval tool that uses synthetic data generation and LLM-as-a-judge | [Docker compose file](./deploy/compose/docker-compose-evaluation.yaml) | [README](./docs/rag/evaluation.md) |]
53+
| Observability | Observability serves as an efficient mechanism for both monitoring and debugging RAG pipelines. | [Docker compose file](./deploy/compose/docker-compose-observability.yaml) | [README](./docs/rag/observability.md) |]
54+
55+
## Open Source Integrations
56+
57+
These are open source connectors for NVIDIA-hosted and self-hosted API endpoints. These open source connectors are maintained and tested by NVIDIA engineers.
58+
59+
| Name | Framework | Chat | Text Embedding | Python | Description |
60+
|------|-----------|------|-----------|--------|-------------|
61+
|[NVIDIA AI Foundation Endpoints](https://python.langchain.com/docs/integrations/providers/nvidia) | [Langchain](https://www.langchain.com/) |[YES](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints)|[YES](https://python.langchain.com/docs/integrations/text_embedding/nvidia_ai_endpoints)|[YES](https://pypi.org/project/langchain-nvidia-ai-endpoints/)|Easy access to NVIDIA hosted models. Supports chat, embedding, code generation, steerLM, multimodal, and RAG.|
62+
|[NVIDIA Triton + TensorRT-LLM](https://github.com/langchain-ai/langchain/tree/master/libs/partners/nvidia-trt) | [Langchain](https://www.langchain.com/) |[YES](https://github.com/langchain-ai/langchain/blob/master/libs/partners/nvidia-trt/docs/llms.ipynb)|[YES](https://github.com/langchain-ai/langchain/blob/master/libs/partners/nvidia-trt/docs/llms.ipynb)|[YES](https://pypi.org/project/langchain-nvidia-trt/)|This connector allows Langchain to remotely interact with a Triton inference server over GRPC or HTTP tfor optimized LLM inference.|
63+
|[NVIDIA Triton Inference Server](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_triton.html) | [LlamaIndex](https://www.llamaindex.ai/) |YES|YES|NO|Triton inference server provides API access to hosted LLM models over gRPC. |
64+
|[NVIDIA TensorRT-LLM](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_tensorrt.html) | [LlamaIndex](https://www.llamaindex.ai/) |YES|YES|NO|TensorRT-LLM provides a Python API to build TensorRT engines with state-of-the-art optimizations for LLM inference on NVIDIA GPUs. |
65+
3966

4067
## NVIDIA support
41-
In each of the READMEs, we indicate the level of support provided.
68+
In each example README we indicate the level of support provided.
4269

4370
## Feedback / Contributions
44-
We're posting these examples on GitHub to better support the community, facilitate feedback, as well as collect and implement contributions using GitHub Issues and pull requests. We welcome all contributions!
71+
We're posting these examples on GitHub to support the NVIDIA LLM community, facilitate feedback. We invite contributions via GitHub Issues or pull requests!
4572

4673
## Known issues
4774
- In each of the READMEs, we indicate any known issues and encourage the community to provide feedback.

RetrievalAugmentedGeneration/.gitattributes

Lines changed: 0 additions & 1 deletion
This file was deleted.

RetrievalAugmentedGeneration/.gitignore

Lines changed: 0 additions & 25 deletions
This file was deleted.
Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
11
ARG BASE_IMAGE_URL=nvcr.io/nvidia/pytorch
22
ARG BASE_IMAGE_TAG=23.08-py3
33

4-
54
FROM ${BASE_IMAGE_URL}:${BASE_IMAGE_TAG}
5+
6+
ARG EXAMPLE_NAME
67
COPY RetrievalAugmentedGeneration/__init__.py /opt/RetrievalAugmentedGeneration/
78
COPY RetrievalAugmentedGeneration/common /opt/RetrievalAugmentedGeneration/common
8-
COPY RetrievalAugmentedGeneration/examples /opt/RetrievalAugmentedGeneration/examples
9+
COPY RetrievalAugmentedGeneration/examples/${EXAMPLE_NAME} /opt/RetrievalAugmentedGeneration/example
910
COPY integrations /opt/integrations
11+
COPY tools /opt/tools
12+
RUN apt-get update && apt-get install -y libpq-dev
1013
RUN --mount=type=bind,source=RetrievalAugmentedGeneration/requirements.txt,target=/opt/requirements.txt \
1114
python3 -m pip install --no-cache-dir -r /opt/requirements.txt
1215

16+
RUN if [ -f "/opt/RetrievalAugmentedGeneration/example/requirements.txt" ] ; then \
17+
python3 -m pip install --no-cache-dir -r /opt/RetrievalAugmentedGeneration/example/requirements.txt ; else \
18+
echo "Skipping example dependency installation, since requirements.txt was not found" ; \
19+
fi
20+
1321
WORKDIR /opt
1422
ENTRYPOINT ["uvicorn", "RetrievalAugmentedGeneration.common.server:app"]

0 commit comments

Comments
 (0)