Skip to content

Commit

Permalink
- [Docs] Added TPUs docs
Browse files Browse the repository at this point in the history
- [Docs] Minor improvements
  • Loading branch information
peterschmidt85 committed Jun 27, 2024
1 parent f6395c6 commit 4aabb50
Show file tree
Hide file tree
Showing 15 changed files with 100 additions and 54 deletions.
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,14 @@

</div>

`dstack` is an open-source container orchestration engine designed for running AI workloads across any cloud or data center.
dstack is an open-source container orchestration engine designed for running AI workloads across any cloud or data
center. It simplifies dev environments, running tasks on clusters, and deployment.

The supported cloud providers include AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, and CUDO.
You can also use `dstack` ro run workloads on on-prem servers.
You can also use `dstack` ro run workloads on on-prem clusters.

`dstack` natively supports NVIDIA GPU, and Google Cloud TPU accelerator chips.

## Latest news ✨

- [2024/05] [dstack 0.18.3: OCI, and more](https://github.com/dstackai/dstack/releases/tag/0.18.3) (Release)
Expand Down
8 changes: 5 additions & 3 deletions docs/assets/stylesheets/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -1119,7 +1119,7 @@ html .md-footer-meta.md-typeset a:is(:focus,:hover) {
visibility: visible;
}

.md-tabs__item:nth-child(3) .md-tabs__link:after, .md-tabs__item:nth-child(4) .md-tabs__link:after, .md-tabs__item:nth-child(8) .md-tabs__link:after {
.md-tabs__item:nth-child(4) .md-tabs__link:after, .md-tabs__item:nth-child(8) .md-tabs__link:after {
content: url('data:image/svg+xml,<svg fill="black" xmlns="http://www.w3.org/2000/svg" width="20px" height="20px" viewBox="0 0 16 16"><polygon points="5 4.31 5 5.69 9.33 5.69 2.51 12.51 3.49 13.49 10.31 6.67 10.31 11 11.69 11 11.69 4.31 5 4.31" data-v-e1bdab2c=""></polygon></svg>');
line-height: 14px;
margin-left: 4px;
Expand Down Expand Up @@ -1400,10 +1400,12 @@ html .md-footer-meta.md-typeset a:is(:focus,:hover) {
*/

[dir=ltr] .md-typeset blockquote {
border: 1px solid black;
/*border: 1px solid black;*/
border: none;
color: var(--md-default-fg-color);
padding: 8px 25px;
border-radius: 12px;
border-radius: 6px;
background: -webkit-linear-gradient(45deg, rgba(0, 42, 255, 0.1), rgb(0 114 255 / 1%), rgba(0, 42, 255, 0.05));
}

a.md-go-to-action.secondary {
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/concepts/pools.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ For more details on policies and their defaults, refer to [`.dstack/profiles.yml
??? info "Limitations"
The `dstack pool add` command is not supported for Kubernetes, VastAI, and RunPod backends yet.

### Adding on-prem servers
### Adding on-prem clusters

Any on-prem server that can be accessed via SSH can be added to a pool and used to run workloads.

Expand All @@ -73,7 +73,7 @@ The command accepts the same arguments as the standard `ssh` command.

Once the instance is provisioned, you'll see it in the pool and will be able to run workloads on it.

#### Network
#### Clusters

If you want on-prem instances to run multi-node tasks, ensure these on-prem servers share the same private network.
Additionally, you need to pass the `--network` option to `dstack pool add-ssh`:
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ model:
If you don't specify your Docker image, `dstack` uses the [base](https://hub.docker.com/r/dstackai/base/tags) image
(pre-configured with Python, Conda, and essential CUDA drivers).

!!! info "Replicas and scaling"
!!! info "Auto-scaling"
By default, the service is deployed to a single instance. However, you can specify the
[number of replicas and scaling policy](../reference/dstack.yml/service.md#replicas-and-auto-scaling).
In this case, `dstack` auto-scales it based on the load.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ If you don't specify your Docker image, `dstack` uses the [base](https://hub.doc
(pre-configured with Python, Conda, and essential CUDA drivers).


!!! info "Nodes"
!!! info "Distributed tasks"
By default, tasks run on a single instance. However, you can specify
the [number of nodes](../reference/dstack.yml/task.md#_nodes).
In this case, `dstack` provisions a cluster of instances.
Expand Down
21 changes: 11 additions & 10 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,32 @@
# What is dstack?

`dstack` is an open-source container orchestration engine designed for running AI workloads across any cloud or data center.
It simplifies dev environments, running tasks on clusters, and deployment.

> The supported cloud providers include AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, and CUDO.
> You can also use `dstack` ro run workloads on on-prem servers.
`dstack` supports dev environements, running tasks on clusters, and deployment with auto-scaling and
authorization out of the box.
> `dstack` is easy to use with any cloud provider (e.g. AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, CUDO, etc.)
> as well as on-prem clusters.
>
> `dstack` natively supports NVIDIA GPU, and Google Cloud TPU accelerator chips.
## Why use dstack?

1. Simplifies development, training, and deployment of AI
1. Simplifies development, training, and deployment for AI teams
2. Can be used with any cloud providers and data centers
3. Leverages the open-source AI ecosystem of libraries, frameworks, and models
4. Reduces GPU costs and improves workload efficiency
3. Very easy to use with any training or serving open-source frameworks
4. Reduces compute costs and improves workload efficiency
5. Much simpler compared to Kubernetes

## How does it work?

!!! info "Installation"
Before using `dstack`, either set up the open-source server, or sign up
with [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}.
with `dstack Sky`.
See [Installation](installation/index.md) for more details.

1. Define configurations such as [dev environments](concepts/dev-environments.md), [tasks](concepts/tasks.md),
and [services](concepts/services.md).
2. Run configurations via `dstack`'s CLI or API.
3. Use [pools](concepts/pools.md) to manage cloud instances and on-prem servers.
3. Use [pools](concepts/pools.md) to manage cloud instances and on-prem clusters.

## Where do I start?

Expand Down
10 changes: 8 additions & 2 deletions docs/docs/installation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@ projects:
</div>
> See the [server/config.yml reference](../reference/server/config.yml.md#examples)
> for details on how to configure backends for all supported cloud providers.
> Go to the [server/config.yml reference](../reference/server/config.yml.md#examples)
> for details on how to configure backends for AWS, GCP, Azure, OCI, Lambda,
> TensorDock, Vast.ai, RunPod, CUDO, Kubernetes, etc.
### Start the server
Expand Down Expand Up @@ -93,6 +94,11 @@ Configuration is updated at ~/.dstack/config.yml

This configuration is stored in `~/.dstack/config.yml`.

### Adding on-prem clusters

If you'd like to use `dstack` to run workloads on your on-prem clusters,
check out the [dstack pool add-ssh](../concepts/pools.md#adding-on-prem-clusters) command.

## dstack Sky

### Set up the CLI
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,8 +99,8 @@ or `train.dstack.yml` are both acceptable).

## Run configuration

Run a configuration using the [`dstack run`](reference/cli/index.md#dstack-run) command, followed by the working directory path (e.g., `.`), the path to the
configuration file, and run options (e.g., configuring hardware resources, spot policy, etc.)
Run a configuration using the [`dstack run`](reference/cli/index.md#dstack-run) command, followed by the working directory path (e.g., `.`),
and the path to the configuration file.

<div class="termy">

Expand Down
12 changes: 12 additions & 0 deletions docs/docs/reference/dstack.yml/dev-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,18 @@ and their quantity. Examples: `A100` (one A100), `A10G,A100` (either A10G or A10
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).

??? info "Google Cloud TPU"
To use TPUs, specify its architecture prefixed by `tpu-` via the `gpu` property.

```yaml
type: dev-environment
ide: vscode
resources:
gpu: tpu-v2-8
```

??? info "Shared memory"
If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
`shm_size`, e.g. set it to `16GB`.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/reference/dstack.yml/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ and `openai` (if you are using Text Generation Inference or vLLM with OpenAI-com
If you encounter any other issues, please make sure to file a [GitHub issue](https://github.com/dstackai/dstack/issues/new/choose).


### Replicas and auto-scaling
### Auto-scaling

By default, `dstack` runs a single replica of the service.
You can configure the number of replicas as well as the auto-scaling rules.
Expand Down
22 changes: 21 additions & 1 deletion docs/docs/reference/dstack.yml/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,26 @@ and their quantity. Examples: `A100` (one A100), `A10G,A100` (either A10G or A10
`A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB),
`A100:40GB:2` (two A100 GPUs of 40GB).

??? info "Google Cloud TPU"
To use TPUs, specify its architecture prefixed by `tpu-` via the `gpu` property.

```yaml
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
```

!!! info "Limitations"
Multi-node tasks aren't supported yet with TPU, but this support is coming soon.

??? info "Shared memory"
If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure
`shm_size`, e.g. set it to `16GB`.
Expand Down Expand Up @@ -167,7 +187,7 @@ The following environment variables are available in any run and are passed by `
| `DSTACK_NODE_RANK` | The rank of the node |
| `DSTACK_MASTER_NODE_IP` | The internal IP address the master node |

### Nodes { #_nodes }
### Distributed tasks { #_nodes }

By default, the task runs on a single node. However, you can run it on a cluster of nodes.

Expand Down
54 changes: 28 additions & 26 deletions docs/docs/reference/server/config.yml.md
Original file line number Diff line number Diff line change
Expand Up @@ -668,31 +668,32 @@ In case of a self-managed cluster, also specify the IP address of any node in th
backends:
- type: kubernetes
kubeconfig:
filename: ~/.kube/config
filename: ~/.kube/config
networking:
ssh_host: localhost # The external IP address of any node
ssh_port: 32000 # Any port accessible outside of the cluster
ssh_host: localhost # The external IP address of any node
ssh_port: 32000 # Any port accessible outside of the cluster
```

</div>

The port specified to `ssh_port` must be accessible outside of the cluster.

For example, if you are using Kind, make sure to add it via `extraPortMappings`:

<div editor-title="installation/kind-config.yml">

```yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 32000 # Must be same as `ssh_port`
hostPort: 32000 # Must be same as `ssh_port`
```
</div>
??? info "Kind"
For example, if you are using Kind, make sure to add it via `extraPortMappings`:

<div editor-title="installation/kind-config.yml">

```yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 32000 # Must be same as `ssh_port`
hostPort: 32000 # Must be same as `ssh_port`
```
</div>
[//]: # (TODO: Elaborate on the Kind's IP address on Linux)
Expand All @@ -707,21 +708,22 @@ In case of a self-managed cluster, also specify the IP address of any node in th
backends:
- type: kubernetes
kubeconfig:
filename: ~/.kube/config
filename: ~/.kube/config
networking:
ssh_port: 32000 # Any port accessible outside of the cluster
ssh_port: 32000 # Any port accessible outside of the cluster
```
</div>
The port specified to `ssh_port` must be accessible outside of the cluster.

For example, if you are using EKS, make sure to add it via an ingress rule
of the corresponding security group:

```shell
aws ec2 authorize-security-group-ingress --group-id <cluster-security-group-id> --protocol tcp --port 32000 --cidr 0.0.0.0/0
```
??? info "EKS"
For example, if you are using EKS, make sure to add it via an ingress rule
of the corresponding security group:

```shell
aws ec2 authorize-security-group-ingress --group-id <cluster-security-group-id> --protocol tcp --port 32000 --cidr 0.0.0.0/0
```

[//]: # (TODO: Elaborate on gateways, and what backends allow configuring them)

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
template: home.html
title: Orchestrate AI workloads in any cloud
title: AI container orchestration platform for everyone
hide:
- navigation
- toc
Expand Down
2 changes: 1 addition & 1 deletion docs/pricing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
template: pricing.html
title: Orchestrate AI workloads in any cloud
title: AI container orchestration platform for everyone
hide:
- navigation
- toc
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ nav:
- API:
- Python API: docs/reference/api/python/index.md
- REST API: docs/reference/api/rest/index.md
- Examples: https://github.com/dstackai/dstack/tree/master/examples" target="_blank
- Examples: /#examples
- Changelog: https://github.com/dstackai/dstack/releases" target="_blank
- Blog:
- blog/index.md
Expand Down

0 comments on commit 4aabb50

Please sign in to comment.