Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Commit

Permalink
Merge branch 'latest-readme' into 'master'
Browse files Browse the repository at this point in the history
Update to README to include info on MIG and deployment via helm

See merge request nvidia/kubernetes/gpu-feature-discovery!36
  • Loading branch information
nvjmayo committed Jul 14, 2020
2 parents c359dd5 + f03c7fd commit fc3d5bf
Show file tree
Hide file tree
Showing 3 changed files with 386 additions and 50 deletions.
306 changes: 256 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,17 @@

## Overview

NVIDIA GPU Feature Discovery for Kubernetes is a software that allows you to
automatically generate label depending on the GPU available on the node. It uses
the [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
from Kubernetes to label nodes.
NVIDIA GPU Feature Discovery for Kubernetes is a software component that allows
you to automatically generate labels for the set of GPUs available on a node.
It leverages the [Node Feature
Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
to perform this labeling.

## Beta Version

This tool is in beta until it reaches `v1.0.0`, we may break the API. However we will setup a
deprecation policy.
This tool should be considered beta until it reaches `v1.0.0`. As such, we may
break the API before reaching `v1.0.0`, but we will setup a deprecation policy
to ease the transition.

## Prerequisites

Expand All @@ -41,44 +43,19 @@ and it's [prerequisites](https://github.com/nvidia/nvidia-docker/wiki/Installati
* docker configured with nvidia as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime).
* Kubernetes version >= 1.10
* NVIDIA device plugin for Kubernetes (see how to [setup](https://github.com/NVIDIA/k8s-device-plugin))
* NFD deployed on each node you want to label with the local source configured (see how to [setup](https://github.com/kubernetes-sigs/node-feature-discovery))

## Command line interface

Available options:
```
gpu-feature-discovery:
Usage:
gpu-feature-discovery [--mig-strategy=<strategy>] [--oneshot | --sleep-interval=<seconds>] [--output-file=<file> | -o <file>]
gpu-feature-discovery -h | --help
gpu-feature-discovery --version
Options:
-h --help Show this help message and exit
--version Display version and exit
--oneshot Label once and exit
--sleep-interval=<seconds> Time to sleep between labeling [Default: 60s]
--mig-strategy=<strategy> Strategy to use for MIG-related labels [Default: none]
-o <file> --output-file=<file> Path to output file
[Default: /etc/kubernetes/node-feature-discovery/features.d/gfd]
```

You can also use environment variables:

| Env Variable | Option | Example |
| ------------------ | ---------------- | ------- |
| GFD_MIG_STRATEGY | --mig-strategy | none |
| GFD_ONESHOT | --oneshot | TRUE |
| GFD_OUTPUT_FILE | --output-file | output |
| GFD_SLEEP_INTERVAL | --sleep-interval | 10s |

Environment variables override the command line options if they conflict.
* NFD deployed on each node you want to label with the local source configured
* When deploying GPU feature discovery with helm (as described below) we provide a way to automatically deploy NFD for you
* To deploy NFD yourself, please see https://github.com/kubernetes-sigs/node-feature-discovery

## Quick Start

### Node Feature Discovery
The following assumes you have at least one node in your cluster with GPUs and
the standard NVIDIA [drivers](https://www.nvidia.com/Download/index.aspx) have
already been installed on it.

### Node Feature Discovery (NFD)

The first step is to make sure the [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
The first step is to make sure that [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
is running on every node you want to label. NVIDIA GPU Feature Discovery use
the `local` source so be sure to mount volumes. See
https://github.com/kubernetes-sigs/node-feature-discovery for more details.
Expand All @@ -87,25 +64,39 @@ You also need to configure the `Node Feature Discovery` to only expose vendor
IDs in the PCI source. To do so, please refer to the Node Feature Discovery
documentation.

The following command will deploy NFD with the minimum required set of
parameters to run `gpu-feature-discovery`.

```shell
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.2.0-rc.1/deployments/static/nfd.yaml
```

**Note:** This is a simple static daemonset meant to demonstrate the basic
features required of `node-feature-discovery` in order to successfully run
`gpu-feature-discovery`. Please see the instructions below for [Deployment via
`helm`](#deployment-via-helm) when deploying in a production setting.

### Preparing your GPU Nodes

Be sure that [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker) is
installed on your GPU nodes and Docker default runtime is set to `nvidia`. See
https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime.

### Deploy NVIDIA GPU Feature Discovery
### Deploy NVIDIA GPU Feature Discovery (GFD)

The next step is to run NVIDIA GPU Feature Discovery on each node as a Deamonset
or as a Job.

#### Deamonset

```shell
$ kubectl apply -f gpu-feature-discovery-daemonset.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.2.0-rc.1/deployments/static/gpu-feature-discovery-daemonset.yaml
```

The GPU Feature Discovery should be running on each nodes and generating labels
for the Node Feature Discovery.
**Note:** This is a simple static daemonset meant to demonstrate the basic
features required of `gpu-feature-discovery`. Please see the instructions below
for [Deployment via `helm`](#deployment-via-helm) when deploying in a
production setting.

#### Job

Expand All @@ -114,14 +105,79 @@ node you want to label:

```shell
$ export NODE_NAME=<your-node-name>
$ sed "s/NODE_NAME/${NODE_NAME}/" gpu-feature-discovery-job.yaml.template > gpu-feature-discovery-job.yaml
$ curl https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.2.0-rc.1/deployments/static/gpu-feature-discovery-job.yaml.template \
| sed "s/NODE_NAME/${NODE_NAME}/" > gpu-feature-discovery-job.yaml
$ kubectl apply -f gpu-feature-discovery-job.yaml
```

The GPU Feature Discovery should be running on the node and generating labels
for the Node Feature Discovery.
**Note:** This method should only be used for testing and not deployed in a
productions setting.

### Verifying Everything Works

With both NFD and GFD deployed and running, you should now be able to see GPU
related labels appearing on any nodes that have GPUs installed on them.

## Labels
```
$ kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
...
labels:
nvidia.com/cuda.driver.major: "455"
nvidia.com/cuda.driver.minor: "06"
nvidia.com/cuda.driver.rev: ""
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "1"
nvidia.com/gfd.timestamp: "1594644571"
nvidia.com/gpu.compute.major: "8"
nvidia.com/gpu.compute.minor: "0"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.family: ampere
nvidia.com/gpu.machine: NVIDIA DGX-2H
nvidia.com/gpu.memory: "39538"
nvidia.com/gpu.product: A100-SXM4-40GB
...
...
```

## The GFD Command line interface

Available options:
```
gpu-feature-discovery:
Usage:
gpu-feature-discovery [--mig-strategy=<strategy>] [--oneshot | --sleep-interval=<seconds>] [--output-file=<file> | -o <file>]
gpu-feature-discovery -h | --help
gpu-feature-discovery --version
Options:
-h --help Show this help message and exit
--version Display version and exit
--oneshot Label once and exit
--sleep-interval=<seconds> Time to sleep between labeling [Default: 60s]
--mig-strategy=<strategy> Strategy to use for MIG-related labels [Default: none]
-o <file> --output-file=<file> Path to output file
[Default: /etc/kubernetes/node-feature-discovery/features.d/gfd]
```

You can also use environment variables:

| Env Variable | Option | Example |
| ------------------ | ---------------- | ------- |
| GFD_MIG_STRATEGY | --mig-strategy | none |
| GFD_ONESHOT | --oneshot | TRUE |
| GFD_OUTPUT_FILE | --output-file | output |
| GFD_SLEEP_INTERVAL | --sleep-interval | 10s |

Environment variables override the command line options if they conflict.

## Generated Labels

This is the list of the labels generated by NVIDIA GPU Feature Discovery and
their meaning:
Expand All @@ -142,7 +198,152 @@ their meaning:
| nvidia.com/gpu.memory | Integer | Memory of the GPU in Mb | 2048 |
| nvidia.com/gpu.product | String | Model of the GPU | GeForce-GT-710 |

## Run locally
Depending on the MIG strategy used, the following set of labels may also be
available (or override the default values for some of the labels listed above):

### MIG 'single' strategy

With this strategy, the single `nvidia.com/gpu` label is overloaded to provide
information about MIG devices on the node, rather than full GPUs. This assumes
all GPUs on the node have been divided into identical partitions of the same
size. The example below shows info for a system with 8 full GPUs, each of which
is partitioned into 7 equal sized MIG devices (56 total).

| Label Name | Value Type | Meaning | Example |
| ----------------------------------- | ---------- | ---------------------------------------- | ------------------------- |
| nvidia.com/mig.strategy | String | MIG strategy in use | single |
| nvidia.com/gpu.product (overridden) | String | Model of the GPU (with MIG info added) | A100-SXM4-40GB-MIG-1g.5gb |
| nvidia.com/gpu.count (overridden) | Integer | Number of MIG devices | 56 |
| nvidia.com/gpu.memory (overridden) | Integer | Memory of each MIG device in Mb | 5120 |
| nvidia.com/gpu.multiprocessors | Integer | Number of Multiprocessors for MIG device | 14 |
| nvidia.com/gpu.slices.gi | Integer | Number of GPU Instance slices | 1 |
| nvidia.com/gpu.slices.ci | Integer | Number of Compute Instance slices | 1 |
| nvidia.com/gpu.engines.copy | Integer | Number of DMA engines for MIG device | 1 |
| nvidia.com/gpu.engines.decoder | Integer | Number of decoders for MIG device | 1 |
| nvidia.com/gpu.engines.encoder | Integer | Number of encoders for MIG device | 1 |
| nvidia.com/gpu.engines.jpeg | Integer | Number of JPEG engines for MIG device | 0 |
| nvidia.com/gpu.engines.ofa | Integer | Number of OfA engines for MIG device | 0 |

### MIG 'mixed' strategy

With this strategy, a separate set of labels for each MIG device type is
generated. The name of each MIG device type is defines as follows:
```
MIG_TYPE=mig-<slice_count>g.<memory_size>.gb
e.g. MIG_TYPE=mig-3g.20gb
```

| Label Name | Value Type | Meaning | Example |
| ------------------------------------ | ---------- | ---------------------------------------- | -------------- |
| nvidia.com/mig.strategy | String | MIG strategy in use | mixed |
| nvidia.com/MIG\_TYPE.count | Integer | Number of MIG devices of this type | 2 |
| nvidia.com/MIG\_TYPE.memory | Integer | Memory of MIG device type in Mb | 10240 |
| nvidia.com/MIG\_TYPE.multiprocessors | Integer | Number of Multiprocessors for MIG device | 14 |
| nvidia.com/MIG\_TYPE.slices.ci | Integer | Number of GPU Instance slices | 1 |
| nvidia.com/MIG\_TYPE.slices.gi | Integer | Number of Compute Instance slices | 1 |
| nvidia.com/MIG\_TYPE.engines.copy | Integer | Number of DMA engines for MIG device | 1 |
| nvidia.com/MIG\_TYPE.engines.decoder | Integer | Number of decoders for MIG device | 1 |
| nvidia.com/MIG\_TYPE.engines.encoder | Integer | Number of encoders for MIG device | 1 |
| nvidia.com/MIG\_TYPE.engines.jpeg | Integer | Number of JPEG engines for MIG device | 0 |
| nvidia.com/MIG\_TYPE.engines.ofa | Integer | Number of OfA engines for MIG device | 0 |

## Deployment via `helm`

The preferred method to deploy `gpu-feature-discovery` is as a daemonset using `helm`.
Instructions for installing `helm` can be found
[here](https://helm.sh/docs/intro/install/).

The `helm` chart for the latest release of GFD (`v0.2.0-rc.1`) includes a number
of customizable values. The most commonly overridden ones are:

```
sleepInterval:
time to sleep between labeling (default "60s")
migStrategy:
pass the desired strategy for labeling MIG devices on GPUs that support it
[none | single | mixed] (default "none)
nfd.deploy:
When set to true, deploy NFD as a subchart with all of the proper
parameters set for it (default "true")
```

**Note:** The following document provides more information on the available MIG
strategies and how they should be used [Supporting Multi-Instance GPUs (MIG) in
Kubernetes](https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g).

Please take a look in the following `values.yaml` files to see the full set of
overridable parameters for both the top-level `gpu-feature-discovery` chart and
the `node-feature-discovery` subchart.

* https://github.com/NVIDIA/gpu-feature-discovery/blob/v0.2.0-rc.1/deployments/helm/gpu-feature-discovery/values.yaml
* https://github.com/NVIDIA/gpu-feature-discovery/blob/v0.2.0-rc.1/deployments/helm/gpu-feature-discovery/charts/node-feature-discovery/values.yaml

#### Installing via `helm install`from the `gpu-feature-discovery` `helm` repository

The preferred method of deployment is with `helm install` via the
`gpu-feature-discovery` `helm` repository.

This repository can be installed as follows:
```shell
$ helm repo add nvgfd https://nvidia.github.io/gpu-feature-discovery
$ helm repo update
```

Once this repo is updated, you can begin installing packages from it to depoloy
the `gpu-feature-discovery` daemonset and (optionally) the
`node-feature-discovery` daemonset. Below are some examples of deploying these
components with the various flags from above.

**Note:** Since this is a pre-release version, you will need to pass the
`--devel` flag to `helm search repo` in order to see this release listed.

Using the default values for all flags:
```shell
$ helm install \
--version=0.2.0-rc.1 \
--generate-name \
nvgfd/gpu-feature-discovery
```

Disabling auto-deployment of NFD and running with a MIG stratey of 'mixed' in
the default namespace.
```shell
$ helm install \
--version=0.2.0-rc.1 \
--generate-name \
--set nfd.deploy=false \
--set migStrategy=mixed
--set namespace=default \
nvgfd/gpu-feature-discovery
```

#### Deploying via `helm install` with a direct URL to the `helm` package

If you prefer not to install from the `gpu-feature-discovery` `helm` repo, you can
run `helm install` directly against the tarball of the components `helm` package.
The examples below install the same daemonsets as the method above, except that
they use direct URLs to the `helm` package instead of the `helm` repo.

Using the default values for the flags:
```shell
$ helm install \
--generate-name \
https://nvidia.github.com/gpu-feature-discovery/stable/gpu-feature-discovery-0.2.0-rc.1.tgz
```

Disabling auto-deployment of NFD and running with a MIG stratey of 'mixed' in
the default namespace.
```shell
$ helm install \
--generate-name \
--set nfd.deploy=false \
--set migStrategy=mixed
--set namespace=default \
https://nvidia.github.com/gpu-feature-discovery/stable/gpu-feature-discovery-0.2.0-rc.1.tgz
```

## Building and running locally with Docker

Download the source code:
```shell
Expand All @@ -167,7 +368,7 @@ you can also use the `--runtime=nvidia` option:
docker run --runtime=nvidia nvidia/gpu-feature-discovery:${GFD_VERSION}
```

## Building from source
## Building and running locally on your native machine

Download the source code:
```shell
Expand All @@ -191,3 +392,8 @@ docker build . -f Dockerfile.devel -t gfd-devel
docker run -it gfd-devel
go build -ldflags "-X main.Version=devel"
```

Run it:
```
./gpu-feature-discovery --output=$(pwd)/gfd
```
19 changes: 19 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Release Process

The `gpu-feature-discovery` component consists in two artifacts:
- The `gpu-feature-discovery` container
- The `gpu-feature-discovery` helm chart

Publishing the container is automated through gitlab-ci and only requires one to tag the commit and push it to gitlab.
Publishing the helm chart is currently manual, and we should move to an automated process ASAP

# Release Process Checklist
- [ ] Update the README to change occurances of the old version (e.g: `v0.1.0`) with the new version
- [ ] Commit, Tag and Push to Gitlab
- [ ] Build a new helm package with `helm package ./deployments/helm/gpu-feature-discovery`
- [ ] Switch to the `gh-pages` branch and move the newly generated package to either the `stable` helm repo
- [ ] Run the `./build-index.sh` script to rebuild the indices for each repo
- [ ] Commit and push the `gh-pages` branch to GitHub
- [ ] Wait for the [CI job associated with your tag] (https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/pipelines) to complete
- [ ] Update the [README on dockerhub](https://hub.docker.com/r/nvidia/gpu-feature-discovery) with the latest tag information
- [ ] Create a [new release](https://github.com/NVIDIA/gpu-feature-discovery/releases) on Github with the changelog
Loading

0 comments on commit fc3d5bf

Please sign in to comment.