- [Docs] Added TPUs docs

- [Docs] Minor improvements
dstackai · Jun 27, 2024 · 4aabb50 · 4aabb50
1 parent f6395c6
commit 4aabb50
Show file tree

Hide file tree

Showing 15 changed files with 100 additions and 54 deletions.
diff --git a/README.md b/README.md
@@ -14,11 +14,14 @@
 
 </div>
 
-`dstack` is an open-source container orchestration engine designed for running AI workloads across any cloud or data center.
+dstack is an open-source container orchestration engine designed for running AI workloads across any cloud or data
+center. It simplifies dev environments, running tasks on clusters, and deployment.
 
 The supported cloud providers include AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, and CUDO.
-You can also use `dstack` ro run workloads on on-prem servers.
+You can also use `dstack` ro run workloads on on-prem clusters.
 
+`dstack` natively supports NVIDIA GPU, and Google Cloud TPU accelerator chips.
+
 ## Latest news ✨
 
 - [2024/05] [dstack 0.18.3: OCI, and more](https://github.com/dstackai/dstack/releases/tag/0.18.3) (Release)

diff --git a/docs/assets/stylesheets/extra.css b/docs/assets/stylesheets/extra.css
@@ -1119,7 +1119,7 @@ html .md-footer-meta.md-typeset a:is(:focus,:hover) {
         visibility: visible;
     }
 
-    .md-tabs__item:nth-child(3) .md-tabs__link:after, .md-tabs__item:nth-child(4) .md-tabs__link:after, .md-tabs__item:nth-child(8) .md-tabs__link:after {
+    .md-tabs__item:nth-child(4) .md-tabs__link:after, .md-tabs__item:nth-child(8) .md-tabs__link:after {
         content: url('data:image/svg+xml,<svg fill="black" xmlns="http://www.w3.org/2000/svg" width="20px" height="20px" viewBox="0 0 16 16"><polygon points="5 4.31 5 5.69 9.33 5.69 2.51 12.51 3.49 13.49 10.31 6.67 10.31 11 11.69 11 11.69 4.31 5 4.31" data-v-e1bdab2c=""></polygon></svg>');
         line-height: 14px;
         margin-left: 4px;
@@ -1400,10 +1400,12 @@ html .md-footer-meta.md-typeset a:is(:focus,:hover) {
 */
 
 [dir=ltr] .md-typeset blockquote {
-    border: 1px solid black;
+    /*border: 1px solid black;*/
+    border: none;
     color: var(--md-default-fg-color);
     padding: 8px 25px;
-    border-radius: 12px;
+    border-radius: 6px;
+    background: -webkit-linear-gradient(45deg, rgba(0, 42, 255, 0.1), rgb(0 114 255 / 1%), rgba(0, 42, 255, 0.05));
 }
 
 a.md-go-to-action.secondary {

diff --git a/docs/docs/concepts/pools.md b/docs/docs/concepts/pools.md
@@ -51,7 +51,7 @@ For more details on policies and their defaults, refer to [`.dstack/profiles.yml
 ??? info "Limitations"
     The `dstack pool add` command is not supported for Kubernetes, VastAI, and RunPod backends yet.
 
-### Adding on-prem servers
+### Adding on-prem clusters
 
 Any on-prem server that can be accessed via SSH can be added to a pool and used to run workloads.
 
@@ -73,7 +73,7 @@ The command accepts the same arguments as the standard `ssh` command.
 
 Once the instance is provisioned, you'll see it in the pool and will be able to run workloads on it.
 
-#### Network
+#### Clusters
 
 If you want on-prem instances to run multi-node tasks, ensure these on-prem servers share the same private network.
 Additionally, you need to pass the `--network` option to `dstack pool add-ssh`:

diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md
@@ -42,7 +42,7 @@ model:
 If you don't specify your Docker image, `dstack` uses the [base](https://hub.docker.com/r/dstackai/base/tags) image
 (pre-configured with Python, Conda, and essential CUDA drivers).
 
-!!! info "Replicas and scaling"
+!!! info "Auto-scaling"
     By default, the service is deployed to a single instance. However, you can specify the
     [number of replicas and scaling policy](../reference/dstack.yml/service.md#replicas-and-auto-scaling).
     In this case, `dstack` auto-scales it based on the load.

diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md
@@ -36,7 +36,7 @@ If you don't specify your Docker image, `dstack` uses the [base](https://hub.doc
 (pre-configured with Python, Conda, and essential CUDA drivers).
 
 
-!!! info "Nodes"
+!!! info "Distributed tasks"
     By default, tasks run on a single instance. However, you can specify
     the [number of nodes](../reference/dstack.yml/task.md#_nodes).
     In this case, `dstack` provisions a cluster of instances.

diff --git a/docs/docs/index.md b/docs/docs/index.md
@@ -1,31 +1,32 @@
 # What is dstack?
 
 `dstack` is an open-source container orchestration engine designed for running AI workloads across any cloud or data center.
+It simplifies dev environments, running tasks on clusters, and deployment.
 
-> The supported cloud providers include AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, and CUDO.
-> You can also use `dstack` ro run workloads on on-prem servers.
-
-`dstack` supports dev environements, running tasks on clusters, and deployment with auto-scaling and
-authorization out of the box.
+> `dstack` is easy to use with any cloud provider (e.g. AWS, GCP, Azure, OCI, Lambda, TensorDock, Vast.ai, RunPod, CUDO, etc.)
+> as well as on-prem clusters.
+>
+> `dstack` natively supports NVIDIA GPU, and Google Cloud TPU accelerator chips.
 
 ## Why use dstack?
 
-1. Simplifies development, training, and deployment of AI
+1. Simplifies development, training, and deployment for AI teams
 2. Can be used with any cloud providers and data centers
-3. Leverages the open-source AI ecosystem of libraries, frameworks, and models
-4. Reduces GPU costs and improves workload efficiency
+3. Very easy to use with any training or serving open-source frameworks
+4. Reduces compute costs and improves workload efficiency
+5. Much simpler compared to Kubernetes
 
 ## How does it work?
 
 !!! info "Installation"
     Before using `dstack`, either set up the open-source server, or sign up
-    with [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"}.
+    with `dstack Sky`.
     See [Installation](installation/index.md) for more details.
 
 1. Define configurations such as [dev environments](concepts/dev-environments.md), [tasks](concepts/tasks.md), 
    and [services](concepts/services.md).
 2. Run configurations via `dstack`'s CLI or API.
-3. Use [pools](concepts/pools.md) to manage cloud instances and on-prem servers.
+3. Use [pools](concepts/pools.md) to manage cloud instances and on-prem clusters.
 
 ## Where do I start?
 

diff --git a/docs/docs/installation/index.md b/docs/docs/installation/index.md
@@ -29,8 +29,9 @@ projects:
 
 </div>
 
-> See the [server/config.yml reference](../reference/server/config.yml.md#examples)
-> for details on how to configure backends for all supported cloud providers.
+> Go to the [server/config.yml reference](../reference/server/config.yml.md#examples)
+> for details on how to configure backends for AWS, GCP, Azure, OCI, Lambda, 
+> TensorDock, Vast.ai, RunPod, CUDO, Kubernetes, etc.
 
 ### Start the server
 
@@ -93,6 +94,11 @@ Configuration is updated at ~/.dstack/config.yml
 
 This configuration is stored in `~/.dstack/config.yml`.
 
+### Adding on-prem clusters
+
+If you'd like to use `dstack` to run workloads on your on-prem clusters,
+check out the [dstack pool add-ssh](../concepts/pools.md#adding-on-prem-clusters) command.
+
 ## dstack Sky
 
 ### Set up the CLI

diff --git a/docs/docs/quickstart.md b/docs/docs/quickstart.md
@@ -99,8 +99,8 @@ or `train.dstack.yml` are both acceptable).
 
 ## Run configuration
 
-Run a configuration using the [`dstack run`](reference/cli/index.md#dstack-run) command, followed by the working directory path (e.g., `.`), the path to the
-configuration file, and run options (e.g., configuring hardware resources, spot policy, etc.)
+Run a configuration using the [`dstack run`](reference/cli/index.md#dstack-run) command, followed by the working directory path (e.g., `.`), 
+and the path to the configuration file.
 
 <div class="termy">
 

diff --git a/docs/docs/reference/dstack.yml/dev-environment.md b/docs/docs/reference/dstack.yml/dev-environment.md
@@ -92,6 +92,18 @@ and their quantity. Examples: `A100` (one A100), `A10G,A100` (either A10G or A10
 `A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB), 
 `A100:40GB:2` (two A100 GPUs of 40GB).
 
+??? info "Google Cloud TPU"
+    To use TPUs, specify its architecture prefixed by `tpu-` via the `gpu` property.
+
+    ```yaml
+    type: dev-environment
+    
+    ide: vscode
+    
+    resources:
+      gpu:  tpu-v2-8
+    ```
+
 ??? info "Shared memory"
     If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure 
     `shm_size`, e.g. set it to `16GB`.

diff --git a/docs/docs/reference/dstack.yml/service.md b/docs/docs/reference/dstack.yml/service.md
@@ -144,7 +144,7 @@ and `openai` (if you are using Text Generation Inference or vLLM with OpenAI-com
     If you encounter any other issues, please make sure to file a [GitHub issue](https://github.com/dstackai/dstack/issues/new/choose).
 
 
-### Replicas and auto-scaling
+### Auto-scaling
 
 By default, `dstack` runs a single replica of the service.
 You can configure the number of replicas as well as the auto-scaling rules.

diff --git a/docs/docs/reference/dstack.yml/task.md b/docs/docs/reference/dstack.yml/task.md
@@ -125,6 +125,26 @@ and their quantity. Examples: `A100` (one A100), `A10G,A100` (either A10G or A10
 `A100:80GB` (one A100 of 80GB), `A100:2` (two A100), `24GB..40GB:2` (two GPUs between 24GB and 40GB), 
 `A100:40GB:2` (two A100 GPUs of 40GB).
 
+??? info "Google Cloud TPU"
+    To use TPUs, specify its architecture prefixed by `tpu-` via the `gpu` property.
+
+    ```yaml
+    type: task
+    
+    python: "3.11"
+    
+    commands:
+      - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
+      - git clone --recursive https://github.com/pytorch/xla.git
+      - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
+
+    resources:
+      gpu:  tpu-v2-8
+    ```
+
+    !!! info "Limitations"
+        Multi-node tasks aren't supported yet with TPU, but this support is coming soon.
+
 ??? info "Shared memory"
     If you are using parallel communicating processes (e.g., dataloaders in PyTorch), you may need to configure 
     `shm_size`, e.g. set it to `16GB`.
@@ -167,7 +187,7 @@ The following environment variables are available in any run and are passed by `
 | `DSTACK_NODE_RANK`      | The rank of the node                    |
 | `DSTACK_MASTER_NODE_IP` | The internal IP address the master node |
 
-### Nodes { #_nodes }
+### Distributed tasks { #_nodes }
 
 By default, the task runs on a single node. However, you can run it on a cluster of nodes.
 

diff --git a/docs/docs/reference/server/config.yml.md b/docs/docs/reference/server/config.yml.md
@@ -668,31 +668,32 @@ In case of a self-managed cluster, also specify the IP address of any node in th
       backends:
         - type: kubernetes
           kubeconfig:
-          filename: ~/.kube/config
+            filename: ~/.kube/config
           networking:
-          ssh_host: localhost # The external IP address of any node
-          ssh_port: 32000 # Any port accessible outside of the cluster
+            ssh_host: localhost # The external IP address of any node
+            ssh_port: 32000 # Any port accessible outside of the cluster
     ```
 
     </div>
 
     The port specified to `ssh_port` must be accessible outside of the cluster.
 
-    For example, if you are using Kind, make sure to add it via `extraPortMappings`:
-
-    <div editor-title="installation/kind-config.yml">
-
-    ```yaml
-    kind: Cluster
-    apiVersion: kind.x-k8s.io/v1alpha4
-    nodes:
-      - role: control-plane
-        extraPortMappings:
-      - containerPort: 32000 # Must be same as `ssh_port`
-        hostPort: 32000 # Must be same as `ssh_port`
-    ```
-
-    </div>
+    ??? info "Kind"
+        For example, if you are using Kind, make sure to add it via `extraPortMappings`:
+
+        <div editor-title="installation/kind-config.yml">
+
+        ```yaml
+        kind: Cluster
+        apiVersion: kind.x-k8s.io/v1alpha4
+        nodes:
+          - role: control-plane
+            extraPortMappings:
+          - containerPort: 32000 # Must be same as `ssh_port`
+            hostPort: 32000 # Must be same as `ssh_port`
+        ```
+    
+        </div>
 
 [//]: # (TODO: Elaborate on the Kind's IP address on Linux)
 
@@ -707,21 +708,22 @@ In case of a self-managed cluster, also specify the IP address of any node in th
         backends:
           - type: kubernetes
             kubeconfig:
-            filename: ~/.kube/config
+              filename: ~/.kube/config
             networking:
-            ssh_port: 32000 # Any port accessible outside of the cluster
+              ssh_port: 32000 # Any port accessible outside of the cluster
     ```
 
     </div>
 
     The port specified to `ssh_port` must be accessible outside of the cluster.
 
-    For example, if you are using EKS, make sure to add it via an ingress rule
-    of the corresponding security group:
-
-    ```shell
-    aws ec2 authorize-security-group-ingress --group-id <cluster-security-group-id> --protocol tcp --port 32000 --cidr 0.0.0.0/0
-    ```
+    ??? info "EKS"
+        For example, if you are using EKS, make sure to add it via an ingress rule
+        of the corresponding security group:
+
+        ```shell
+        aws ec2 authorize-security-group-ingress --group-id <cluster-security-group-id> --protocol tcp --port 32000 --cidr 0.0.0.0/0
+        ```
 
 [//]: # (TODO: Elaborate on gateways, and what backends allow configuring them)
 

diff --git a/docs/index.md b/docs/index.md
@@ -1,6 +1,6 @@
 ---
 template: home.html
-title: Orchestrate AI workloads in any cloud
+title: AI container orchestration platform for everyone
 hide:
    - navigation
    - toc

diff --git a/docs/pricing.md b/docs/pricing.md
@@ -1,6 +1,6 @@
 ---
 template: pricing.html
-title: Orchestrate AI workloads in any cloud
+title: AI container orchestration platform for everyone
 hide:
    - navigation
    - toc

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -207,7 +207,7 @@ nav:
           - API:
               - Python API: docs/reference/api/python/index.md
               - REST API: docs/reference/api/rest/index.md
-  - Examples: https://github.com/dstackai/dstack/tree/master/examples" target="_blank
+  - Examples: /#examples
   - Changelog: https://github.com/dstackai/dstack/releases" target="_blank
   - Blog:
       - blog/index.md