Skip to content

Deprecate IaaS and TaaS Nvidia experience #2803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions inference-dgx-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ authors:

# Serverless Inference with Hugging Face and NVIDIA NIM

> **Update:** This service is deprecated and no longer available as of April 10th, 2025. For an alternative, you should consider [Inference Providers](https://huggingface.co/docs/inference-providers/en/index)

Today, we are thrilled to announce the launch of **Hugging Face** **NVIDIA NIM API (serverless)**, a new service on the Hugging Face Hub, available to Enterprise Hub organizations. This new service makes it easy to use open models with the accelerated compute platform, of [NVIDIA DGX Cloud](https://www.nvidia.com/en-us/data-center/dgx-cloud) accelerated compute platform for inference serving. We built this solution so that Enterprise Hub users can easily access the latest NVIDIA AI technology in a serverless way to run inference on popular Generative AI models including Llama and Mistral, using standardized APIs and a few lines of code within the[ Hugging Face Hub](https://huggingface.co/models).


Expand All @@ -25,7 +27,7 @@ NVIDIA NIM API (serverless) complements [Train on DGX Cloud](https://huggingface

## How it works

Running serverless inference with Hugging Face models has never been easier. Heres a step-by-step guide to get you started:
Running serverless inference with Hugging Face models has never been easier. Here's a step-by-step guide to get you started:

_Note: You need access to an Organization with a [Hugging Face Enterprise Hub](https://huggingface.co/enterprise) subscription to run Inference._

Expand All @@ -36,15 +38,15 @@ Before you begin, ensure you meet the following requirements:

### Create a Fine-Grained Token

Fine-grained tokens allow users to create tokens with specific permissions for precise access control to resources and namespaces. First, go to[ Hugging Face Access Tokens](https://huggingface.co/settings/tokens) and click on Create new Token and select fine-grained.
Fine-grained tokens allow users to create tokens with specific permissions for precise access control to resources and namespaces. First, go to[ Hugging Face Access Tokens](https://huggingface.co/settings/tokens) and click on "Create new Token" and select "fine-grained".

<div align="center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/inference-dgx-cloud/fine-grained-token-1.png" alt="Create Token">
</div>



Enter a Token name and select your Enterprise organization in org permissions as scope and then click Create token. You dont need to select any additional scopes.
Enter a "Token name" and select your Enterprise organization in "org permissions" as scope and then click "Create token". You don't need to select any additional scopes.


<div align="center">
Expand All @@ -57,9 +59,9 @@ Now, make sure to save this token value to authenticate your requests later.

### **Find your NIM**

You can find NVIDIA NIM API (serverless) on the model page of supported Generative AI models. You can find all supported models in this [NVIDIA NIM Collection](https://huggingface.co/collections/nvidia/nim-66a3c6fcdcb5bbc6e975b508), and in the Pricing section.
You can find "NVIDIA NIM API (serverless)" on the model page of supported Generative AI models. You can find all supported models in this [NVIDIA NIM Collection](https://huggingface.co/collections/nvidia/nim-66a3c6fcdcb5bbc6e975b508), and in the Pricing section.

We will use the `meta-llama/Meta-Llama-3-8B-Instruct`. Go the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model card open Deploy menu, and select NVIDIA NIM API (serverless) - this will open an interface with pre-generated code snippets for Python, Javascript or Curl.
We will use the `meta-llama/Meta-Llama-3-8B-Instruct`. Go the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model card open "Deploy" menu, and select "NVIDIA NIM API (serverless)" - this will open an interface with pre-generated code snippets for Python, Javascript or Curl.



Expand All @@ -70,7 +72,7 @@ We will use the `meta-llama/Meta-Llama-3-8B-Instruct`. Go the [meta-llama/Meta-L

### **Send your requests**

NVIDIA NIM API (serverless) is standardized on the OpenAI API. This allows you to use the `openai` sdk for inference. Replace the `YOUR_FINE_GRAINED_TOKEN_HERE` with your fine-grained token and you are ready to run inference.
NVIDIA NIM API (serverless) is standardized on the OpenAI API. This allows you to use the `openai'` sdk for inference. Replace the `YOUR_FINE_GRAINED_TOKEN_HERE` with your fine-grained token and you are ready to run inference.

```python
from openai import OpenAI
Expand Down Expand Up @@ -159,7 +161,7 @@ The total cost for a request will depend on the model size, the number of GPUs r
</table>


Usage fees accrue to your Enterprise Hub Organizations current monthly billing cycle. You can check your current and past usage at any time within the billing settings of your Enterprise Hub Organization.
Usage fees accrue to your Enterprise Hub Organizations' current monthly billing cycle. You can check your current and past usage at any time within the billing settings of your Enterprise Hub Organization.

**Supported Models**

Expand Down
2 changes: 2 additions & 0 deletions train-dgx-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ authors:

# Easily Train Models with H100 GPUs on NVIDIA DGX Cloud

> **Update:** This service is deprecated and no longer available as of April 10th, 2025.

Today, we are thrilled to announce the launch of **Train on DGX Cloud**, a new service on the Hugging Face Hub, available to Enterprise Hub organizations. Train on DGX Cloud makes it easy to use open models with the accelerated compute infrastructure of NVIDIA DGX Cloud. Together, we built Train on DGX Cloud so that Enterprise Hub users can easily access the latest NVIDIA H100 Tensor Core GPUs, to fine-tune popular Generative AI models like Llama, Mistral, and Stable Diffusion, in just a few clicks within the [Hugging Face Hub](https://huggingface.co/models).

<div align="center">
Expand Down