Releases: dstackai/dstack
0.18.6
Major fixes
- Support for GitLab's authorization when the repo is using HTTP/HTTPS by @jvstme in #1412
- Add a multi-node example to the Hugging Alignment Handbook example by @deep-diver in #1409
- Fix the issue where idle instances weren't offered (occurred when a GPU name was in upper case). by @jvstme in #1417
- Fix the issue where an exception is thrown for non-standard Git repo host URLs using HTTP/HTTPS @jvstme in #1410
- Support
H100
with thegcp
backend by @jvstme in #1405
Warning
If you have idle instances in your pool, it is recommended to re-create them after upgrading to version 0.18.6. Otherwise, there is a risk that these instances won't be able to execute jobs.
Other
- [Internal] Add script for checking OCI images by @jvstme in #1408
- Fix repos migration on PostgreSQL by @jvstme in #1414
- [Internal] Fix
dstack-runner
repo tests by @jvstme in #1418 - Fix OCI listing not found errors by @jvstme in #1407
Full changelog: 0.18.5...0.18.6
0.18.5
Read below about its new features and bug-fixes.
Volumes
When you run anything with dstack
, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5
, we're adding support for network volumes that allow data to persist across runs.
Once you've created a volume (e.g. named my-new-volume
), you can attach it to a dev environment, task, or service.
type: dev-environment
ide: vscode
volumes:
- name: my-new-volume
path: /volume_data
The data stored in the volume will persist across runs.
dstack
allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.
Important
Volumes are currently experimental and only work with the aws
backend. Support for other backends is coming soon.
PostgreSQL
By default, dstack
stores its state in ~/.dstack/server/data
using SQLite. With this update, it's now possible to configure dstack
to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL
environment variable.
DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server
Important
Despite PostgreSQL support, dstack
still requires that you run only one instance of the dstack
server. However, this requirement will be lifted in a future update.
On-prem clusters
Previously, dstack
didn't allow the use of on-prem clusters (added via dstack pool add-ssh
) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.
Supported GPUs
Previously, dstack
didn't support L4
and H100
GPUs with AWS. Now you can use them.
Full changelog
- Support dstack volumes by @r4victor in #1364
- Filter pool instances with respect to volumes availability zone by @r4victor in #1368
- Support AWS L4 GPU by @jvstme in #1365
- Add Concepts->Volumes by @r4victor in #1370
- Improve Overview page by @r4victor in #1377
- Add volumes prices by @r4victor in #1382
- Wait for GCP VM no capacity error by @r4victor in #1387
- Disallow mounting volumes inside /workflow by @r4victor in #1388
- Support NVIDIA NVSwitch in
dstack
VM images by @jvstme in #1389 - Optimize loading
dstack
Docker images by @jvstme in #1391 - Improve Contributing by @r4victor in #1392
- Support running dstack server with Postgres by @r4victor in #1398
- Support H100 GPU on AWS by @jvstme in #1394
- Fix possible server freeze after
pool add-ssh
by @jvstme in #1396 - Add OCI eu-milan-1 region by @jvstme in #1400
- Prepare future OCI spot instances support by @jvstme in #1401
- Remove if backends configured check by @r4victor in #1404
- Include project_name in Instance and Volume by @r4victor in #1390
See more: 0.18.4...0.18.5
0.18.5rc1
This is a release candidate build of the upcoming 0.18.5
release. Read below to learn about its new features and bug-fixes.
Volumes
When you run anything with dstack
, it allows you to configure the disk size. However, once the run is finished, if you haven't stored your data in any external storage, all the data on disk will be erased. With 0.18.5
, we're adding support for network volumes that allow data to persist across runs.
Once you've created a volume (e.g. named my-new-volume
), you can attach it to a dev environment, task, or service.
type: dev-environment
ide: vscode
volumes:
- name: my-new-volume
path: /volume_data
The data stored in the volume will persist across runs.
dstack
allows you to create new volumes and register existing ones. To learn more about how volumes work, check out the docs.
Important
Volumes are currently experimental and only work with the aws
backend. Support for other backends is coming soon.
PostgreSQL
By default, dstack
stores its state in /root/.dstack/server/data
using SQLite. With this update, it's now possible to configure dstack
to store its state in PostgreSQL. Just pass the DSTACK_DATABASE_URL
environment variable.
DSTACK_DATABASE_URL="postgresql+asyncpg://myuser:mypassword@localhost:5432/mydatabase" dstack server
Important
Despite PostgreSQL support, dstack
still requires that you run only one instance of the dstack
server. However, this requirement will be lifted in a future update.
On-prem clusters
Previously, dstack
didn't allow the use of on-prem clusters (added via dstack pool add-ssh
) if there were no backends configured. This update fixes that bug. Now, you don't have to configure any backends if you only plan to use on-prem clusters.
Supported GPUs
Previously, dstack
didn't support L4
and H100
GPUs with AWS. Now you can use them.
Full changelog
- Support dstack volumes by @r4victor in #1364
- Filter pool instances with respect to volumes availability zone by @r4victor in #1368
- Support AWS L4 GPU by @jvstme in #1365
- Add Concepts->Volumes by @r4victor in #1370
- Improve Overview page by @r4victor in #1377
- Add volumes prices by @r4victor in #1382
- Wait for GCP VM no capacity error by @r4victor in #1387
- Disallow mounting volumes inside /workflow by @r4victor in #1388
- Support NVIDIA NVSwitch in
dstack
VM images by @jvstme in #1389 - Optimize loading
dstack
Docker images by @jvstme in #1391 - Improve Contributing by @r4victor in #1392
- Support running dstack server with Postgres by @r4victor in #1398
- Support H100 GPU on AWS by @jvstme in #1394
- Fix possible server freeze after
pool add-ssh
by @jvstme in #1396 - Add OCI eu-milan-1 region by @jvstme in #1400
- Prepare future OCI spot instances support by @jvstme in #1401
- Remove if backends configured check by @r4victor in #1404
- Include project_name in Instance and Volume by @r4victor in #1390
See more: 0.18.4...0.18.5rc1
0.18.4
Google Cloud TPU
This update introduces initial support for Google Cloud TPU.
To request a TPU, specify the TPU architecture prefixed by tpu-
(in gpu
under resources
):
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
Important
Currently, you can't specify other than 8 TPU cores. This means only single TPU device workloads are supported. Support for multiple TPU devices is coming soon.
Private subnets with GCP
Additionally, the update allows configuring the gcp
backend to use only private subnets. To achieve this, set public_ips
to false
.
projects:
- name: main
backends:
- type: gcp
creds:
type: default
public_ips: false
Major bug-fixes
Besides TPU, the update fixes a few important bugs.
- Fix
cudo
backend stuck && Improve docs forcudo
by @smokfyz in #1347 - Fix
nvidia-smi
not available onlambda
by @r4victor in #1357 - Respect
registry_auth
for RunPod by @smokfyz in #1333 - Support multi-node tasks on
oci
by @jvstme in #1334
Other
- Show warning on required
ssh
version by @loghijiaha in #1313 - Add OCI packer templates by @jvstme in #1316
- Support
oci
Bare Metal instances by @jvstme in #1325 - Support
oci
BM.Optimized3.36
instance by @jvstme in #1328 - [Docs] Update
dstack pool
docs by @jvstme in #1329 - Add TPU support in
gcp
by @Bihan in #1323 - Fix failing
runner-test
workflow by @r4victor in #1336 - Document OCI permissions by @jvstme in #1338
- Limit the gateway's open ports to
22
,80
, and443
by @smokfyz in #1335 - Update
serve.dstack.yml
- infinity by @michaelfeil in #1340 - Support instances without public IP for GCP by @smokfyz in #1341
- [Internal] Automate OCI images publishing by @jvstme in #1346
- Fix slow
/api/pools/list_instances
by @r4victor in #1320 - Respect
gcp
VPC config when provisioning TPUs by @r4victor in #1332 - [Internal] Fix linter errors by @jvstme in #1322
- TPU support enhancements by @r4victor in #1330
- TPU initial release by @Bihan in #1354
- TPUs fixes by @r4victor in #1360
- Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
- Even more flexible OCI client credentials by @jvstme in #1317
New contributors
- @loghijiaha made their first contribution in #1313
- @smokfyz made their first contribution in #1333
- @michaelfeil made their first contribution in #1340
Full changelog: 0.18.3...0.18.4
0.18.4rc3
This is a preview build of the upcoming 0.18.4
release. See below to see what's new.
TPU
One of the major new features in this update is the initial support for Google Cloud TPU.
To request a TPU, you simply need to specify the system architecture of the required TPU prefixed by tpu-
in gpu
:
type: task
python: "3.11"
commands:
- pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- git clone --recursive https://github.com/pytorch/xla.git
- python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
resources:
gpu: tpu-v2-8
Important
You cannot request multiple nodes (for running parallel on multiple TPU devices) for tasks. This feature is coming soon.
You're very welcome to try the initial support and share your feedback.
Major bug-fixes
Besides TPU, the update fixes a few important bugs.
- Fix
cudo
backend stuck && Improve docs forcudo
by @smokfyz in #1347 - Fix
nvidia-smi
not available onlambda
by @r4victor in #1357 - Respect
registry_auth
for RunPod by @smokfyz in #1333 - Support multi-node tasks on
oci
by @jvstme in #1334
Other
- Show warning on required
ssh
version by @loghijiaha in #1313 - Add OCI packer templates by @jvstme in #1316
- Support
oci
Bare Metal instances by @jvstme in #1325 - Support
oci
BM.Optimized3.36
instance by @jvstme in #1328 - [Docs] Update
dstack pool
docs by @jvstme in #1329 - Add TPU support in
gcp
by @Bihan in #1323 - Fix failing
runner-test
workflow by @r4victor in #1336 - Document OCI permissions by @jvstme in #1338
- Limit the gateway's open ports to
22
,80
, and443
by @smokfyz in #1335 - Update
serve.dstack.yml
- infinity by @michaelfeil in #1340 - Support instances without public IP for GCP by @smokfyz in #1341
- [Internal] Automate OCI images publishing by @jvstme in #1346
- Fix slow
/api/pools/list_instances
by @r4victor in #1320 - Respect
gcp
VPC config when provisioning TPUs by @r4victor in #1332 - [Internal] Fix linter errors by @jvstme in #1322
- TPU support enhancements by @r4victor in #1330
- TPU initial release by @Bihan in #1354
- TPUs fixes by @r4victor in #1360
- Minor refactoring to support custom backends in dstack Sky by @r4victor in #1319
- Even more flexible OCI client credentials by @jvstme in #1317
New contributors
- @loghijiaha made their first contribution in #1313
- @smokfyz made their first contribution in #1333
- @michaelfeil made their first contribution in #1340
Full changelog: 0.18.3...0.18.4rc3
0.18.3
Oracle Cloud Infrastructure
With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci
and can be configured as follows:
projects:
- name: main
backends:
- type: oci
creds:
type: default
The supported credential types include default
and client
. In case default
is used, dstack
automatically picks the default OCI credentials from ~/.oci/config
.
Just like other backends, oci
supports dev environments, tasks, and services:
Note
Support for spot instances, multi-node tasks, and gateways is coming soon.
Find more documentation on using Oracle Cloud Infrastructure on the reference page.
Retry policy
We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:
type: task
commands:
- python train.py
retry:
on_events: [no-capacity]
duration: 2h
Now, if you run such a task, dstack
will keep trying to find capacity within 2 hours. Once capacity is found, dstack
will run the task.
The on_events
property also supports error
(in case the run fails with an error) and interruption
(if the run is using a spot instance and it was interrupted).
Previously, dstack
only allowed retries when spot instances were interrupted.
RunPod
Previously, the runpod
backend only allowed the use of Docker images with /bin/bash
or /bin/sh
as the entrypoint. Thanks to the fix on the RunPod's side, dstack
now allows the use of any Docker images.
Additionally, the runpod
backend now also supports spot instances.
GCP
The gcp
backend now also allows configuring VPCs:
projects:
- name: main
backends:
- type: gcp
project_id: my-awesome-project
creds:
type: default
vpc_name: my-custom-vpc
The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id
.
AWS
Last but not least, for the aws
backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:
projects:
- name: main
backends:
- type: aws
creds:
type: default
vpc_ids:
us-east-1: vpc-0a2b3c4d5e6f7g8h
default_vpcs: true
You just need to set default_vpcs
to true
.
Other changes
- Fix reverse server-gateway ssh tunnel by @r4victor in #1303
- Respect run filters for the
ssh
backend by @r4victor in #1278 - Support resubmitted runs in
dstack run
attached mode by @r4victor in #1285 - Do not run jobs on
unreachable
instances by @r4victor in #1286 - Show job termination reason in
dstack ps -v
by @r4victor in #1301 - Rename
dstack destroy
todstack delete
by @r4victor in #1275 - Prepare OCI backend for release by @jvstme in #1308
- [Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in #1295
- [Docs] Fix Authorization header by @jvstme in #1305
0.18.3rc1
OCI
With the new update, it is now possible to run workloads with your Oracle Cloud Infrastructure (OCI) account. The backend is called oci
and can be configured as follows:
projects:
- name: main
backends:
- type: oci
creds:
type: default
The supported credential types include default
and client
. In case default
is used, dstack
automatically picks the default OCI credentials from ~/.oci/config
.
Warning
OCI support does not yet include spot instances, multi-node tasks, and gateways. These features will be added in upcoming updates.
Retry policy
We have reworked how to configure the retry policy and how it is applied to runs. Here's an example:
type: task
commands:
- python train.py
retry:
on_events: [no-capacity]
duration: 2h
Now, if you run such a task, dstack
will keep trying to find capacity within 2 hours. Once capacity is found, dstack
will run the task.
The on_events
property also supports error
(in case the run fails with an error) and interruption
(if the run is using a spot instance and it was interrupted).
Previously, dstack
only allowed retries when spot instances were interrupted.
VPC
GCP
The gcp
backend now also allows configuring VPCs:
projects:
- name: main
backends:
- type: gcp
project_id: my-awesome-project
creds:
type: default
vpc_name: my-custom-vpc
The VPC should belong to the same project. If you would like to use a shared VPC from another project, you can also specify vpc_project_id
.
AWS
Last but not least, for the aws
backend, it is now possible to configure VPCs for selected regions and use the default VPC in other regions:
projects:
- name: main
backends:
- type: aws
creds:
type: default
vpc_ids:
us-east-1: vpc-0a2b3c4d5e6f7g8h
default_vpcs: true
You just need to set default_vpcs
to true
.
Other changes
- Fix reverse server-gateway ssh tunnel by @r4victor in #1303
- Respect run filters for the
ssh
backend by @r4victor in #1278 - Support resubmitted runs in
dstack run
attached mode by @r4victor in #1285 - Do not run jobs on
unreachable
instances by @r4victor in #1286 - Show job termination reason in
dstack ps -v
by @r4victor in #1301 - Rename
dstack destroy
todstack delete
by @r4victor in #1275 - Prepare OCI backend for release by @jvstme in #1308
- [Docs] Improve the documentation of the Pydantic models #1293 by @peterschmidt85 in #1295
Full changelog: 0.18.2...0.18.3rc1
Warning
This is an RC build. Please report any bugs to the issue tracker. The final release is planned for later this week, and the official documentation and examples will be updated then.
0.18.2
On-prem clusters
Network
The dstack pool add-ssh
command now supports the --network
argument. Use this argument if you want to use multiple instances that share the same private network as a cluster to run multi-node tasks.
The --network
argument accepts the IP address range (CIDR) of the private network of the instance.
Example:
dstack pool add-ssh -i ~/.ssh/id_rsa [email protected] --network 10.0.0.0/24
Once you've added multiple instances with the same network value, you'll be able to use them as a cluster to run multi-node tasks.
Private subnets
By default, dstack
uses public IPs for SSH access to running instances, requiring public subnets in the VPC. The new update allows AWS instances to use private subnets instead.
To create instances only in private subnets, set public_ips
to false
in the AWS backend settings:
type: aws
creds:
type: default
vpc_ids:
...
public_ips: false
Note
- Both
dstack server
and thedstack
CLI should have access to the private subnet to access instances. - If you want running instances to access the Internet, the private subnets need to have a NAT gateway.
Gateways
dstack apply
Previously, to create or update gateways, one had to use the dstack gateway create
or dstack gateway update
commands.
Now, it's possible to define a gateway configuration via YAML and create or update it using the dstack apply
command.
Example:
type: gateway
name: example-gateway
backend: gcp
region: europe-west1
domain: example.com
dstack apply -f examples/deployment/gateway.dstack.yml
For now, the dstack apply
command only supports the gateway
configuration type. Soon, it will also support dev-environment
, task
, and service
, replacing the dstack run
command.
The dstack destroy
command can be used to delete resources.
Private gateways
By default, gateways are deployed using public subnets. Since 0.18.2
, it is now possible to deploy gateways using private subnets. To do this, you need to set public_ips
to false
and specify the ARN of a certificate from AWS Certificate Manager.
type: gateway
name: example-gateway
backend: aws
region: eu-west-1
domain: "example.com"
public_ip: false
certificate:
type: acm
arn: "arn:aws:acm:eu-west-1:3515152512515:certificate/3251511125--1241-1224-121251515125"
In this case, dstack
will deploy the gateway in a private subnet behind a load balancer using the specified certificate.
Note
Private gateways are currently supported only for AWS.
What's changed
- Support multi-node tasks with
dstack pool add-ssh
instances by @TheBits in #1189 - Fixed the JSON schema errors by @r4victor in #1193
- Support spot instances with
runpod
by @Bihan in #1119 - Speed up AWS VPC validation by @r4victor in #1196
- [Internal] Optimize
ProjectModel
loading by @r4victor in #1199 - Support provisioning instances without public IPs on AWS by @r4victor in #1203
- Minor improvements of
dstack pool add-ssh
by @TheBits in #1202 - Instances cannot be reused by other users by @TheBits in #1204
- Do not create AWS instance profile when launching instances by @r4victor in #1212
- Allow running services without
https
by @r4victor in #1217 - Implement
dstack apply
for gateways by @r4victor in #1223 - Support gateways without public IPs on AWS by @r4victor in #1224
- Support
--network
withdstack pool add-ssh
by @TheBits in #1225 - [Internal] Make gateway creation async by @r4victor in #1236
- Using a more resourceful VM type by default for GCP gateway by @r4victor in #1237
- Handle properly if the
network
passed todstack pool add-ssh
is not correct by @TheBits in #1233 - Use valid GCP resource names by @r4victor in #1248
- Always try to restart
dstack-shim.service
withdstack pool add-ssh
by @TheBits in #1253 - [Internal] Improve instance processing by @r4victor in #1251
- Changed
dstack pool remove
torm
by @muddi900 in #1258 - Support gateways behind ALB with ACM certificate by @r4victor in #1264
- Support IP addresses with
--network
by @TheBits in #1263 - [Internal] Fix double unlocking when processing runs and instances by @r4victor in #1268
- Add dstack destroy command and improve dstack apply by @r4victor in #1271
- Fix instances from pools ignoring regions by @r4victor in #1272
- Add the
axolotl
example by @deep-diver in #1187
New Contributors
Full Changelog: 0.18.1...0.18.2
0.18.1
On-prem servers
Now you can add your own servers as pool instances:
dstack pool add-ssh -i ~/.ssh/id_rsa [email protected]
Note
The server should be pre-installed with CUDA 12.1 and NVIDIA Docker.
Configuration
All .dstack/profiles.yml
properties now can be specified via run configurations:
type: dev-environment
ide: vscode
spot_policy: auto
backends: ["aws"]
regions: ["eu-west-1", "eu-west-2"]
instance_types: ["p3.8xlarge", "p3.16xlarge"]
max_price: 2.0
max_duration: 1d
New examples 🔥🔥
Thanks to the contribution from @deep-diver, we got two new examples:
Other
- Configuring VPCs using their IDs (via
vpc_ids
inserver/config.yml
) - Support for global profiles (via
~/.dstack/profiles.yml
) - Updated the default environment variables (
DSTACK_RUN_NAME
,DSTACK_GPUS_NUM
,DSTACK_NODES_NUM
,DSTACK_NODE_RANK
, andDSTACK_MASTER_NODE_IP
) - It’s now possible to use NVIDIA
A10
GPU on Azure - More granular permissions for Azure
What's changed
- Fix server freeze on terminate instance by @jvstme in #1132
- Support profile params in run configurations by @r4victor in #1131
- Support global
.dstack/profiles.yml
by @r4victor in #1134 - Fix
No such profile: None
when missing.dstack/profiles.yml
by @r4victor in #1135 - Make Azure permissions more granular by @r4victor in #1139
- Validate min disk size by @r4victor in #1146
- Fix unexpected error if system Python version is unknown by @r4victor in #1147
- Add request timeouts to prevent code freezes by @jvstme in #1140
- Refactor backends to wait for instance IP address outside
run_job/create_instance
by @r4victor in #1149 - Fix provisioning Azure instances with A10 GPU by @jvstme in #1150
- [Internal] Move
packer
->scripts/packer
by @jvstme in #1153 - Added the ability of adding own instances by @TheBits in #1115
- An issue with the
executor_error
check being falsely positive by @TheBits in #1160 - Make user project quota configurable by @r4victor in #1161
- Configure CORS headers on gateway by @r4victor in #1166
- Allow to configure AWS
vpc_ids
by @r4victor in #1170 - [Internal] Show dstack version in Sentry issues by @jvstme in #1167
- Fix
KeyError: 'IpPermissions'
when using AWS by @jvstme in #1169 - Create public ssh key is it not exist in
dstack pool add-ssh
by @TheBits in #1173 - Fixed is the environment file upload by @TheBits in #1175
- Updated shim status processing by @TheBits in #1174
- Fix bugs in
dstack pool add-ssh
by @TheBits in #1178 - Fix Cudo Create VM response error by @Bihan in #1179
- Implement API for configuring backends via yaml by @r4victor in #1181
- Allow running gated models with
HUGGING_FACE_HUB_TOKEN
by @r4victor in #1184 - Pass all dstack runner envs as
DSTACK_*
by @r4victor in #1185 - Improve the retries in the get_host_info and get_shim_healthcheck by @TheBits in #1183
- Example/h4alignment handbook by @deep-diver in #1180
- The deploy is launched in ThreadPoolExecutor by @TheBits in #1186
Full Changelog: 0.18.0...0.18.1rc2
0.18.0
RunPod
The update adds the long-awaited integration with RunPod, a distributed GPU cloud that offers GPUs at affordable prices.
To use RunPod, specify your RunPod API key in ~/.dstack/server/config.yml
:
projects:
- name: main
backends:
- type: runpod
creds:
type: api_key
api_key: US9XTPDIV8AR42MMINY8TCKRB8S4E7LNRQ6CAUQ9
Once the server is restarted, go ahead and run workloads.
Clusters
Another major change with the update is the ability to run multi-node tasks over an interconnected cluster of instances.
type: task
nodes: 2
commands:
- git clone https://github.com/r4victor/pytorch-distributed-resnet.git
- cd pytorch-distributed-resnet
- mkdir -p data
- cd data
- wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
- tar -xvzf cifar-10-python.tar.gz
- cd ..
- pip3 install -r requirements.txt torch
- mkdir -p saved_models
- torchrun --nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008 resnet_ddp.py
--num_epochs 20
resources:
gpu: 1
Currently supported providers for this feature include AWS, GCP, and Azure.
Other
- The
commands
property is now not required for tasks and services if you use animage
that has a default entrypoint configured. - The permissions required for using
dstack
with GCP are more granular.
What's changed
- Add
username
filter to/api/runs/list
by @r4victor in #1068 - Inherit core models from DualBaseModel by @r4victor in #967
- Fixed the YAML schema validation for
replicas
by @peterschmidt85 in #1055 - Improve the
server/config.yml
reference documentation by @peterschmidt85 in #1077 - Add the
runpod
backend by @Bihan in #1063 - Support JSON log handler by @TheBits in #1085
- Added lock to the
terminate_idle_instance
by @TheBits in #1081 dstack init
doesn't work with a remote Git repo by @peterschmidt85 in #1090- Minor improvements of
dstack server
output by @peterschmidt85 in #1088 - Return an error information from
dstack-shim
by @TheBits in #1061 - Replace
RetryPolicy.limit
toRetryPolicy.duration
by @TheBits in #1074 - Make
dstack version
configurable when deploying docs by @peterschmidt85 in #1095 dstack init
doesn't work with a local Git repo by @peterschmidt85 in #1096- Fix infinite
create_instance()
on thecudo
provider by @r4victor in #1082 - Do not update the
latest
Docker image and YAML scheme for pre-release builds by @peterschmidt85 in #1099 - Support multi-node tasks by @r4victor in #1103
- Make
commands
optional in run configurations by @jvstme in #1104 - Allow the
cudo
backend use non-gpu instances by @Bihan in #1092 - Make GCP permissions more granular by @r4victor in #1107
Full changelog: 0.17.0...0.18.0