Activity
fix: more EOF protection fixes
fix: more EOF protection fixes
Pull request merge
fix: update s3fs pvc to prevent its caching mechanism from consuming …
fix: update s3fs pvc to prevent its caching mechanism from consuming …
Pull request merge
fix: improve cpu utilization metrics, and retry on ray head initConta…
fix: improve cpu utilization metrics, and retry on ray head initConta…
Pull request merge
fix: multinic detection was broken; also was hard-wiring name of reso…
fix: multinic detection was broken; also was hard-wiring name of reso…
Pull request merge
fix: custodian pods can linger forever
fix: custodian pods can linger forever
Pull request merge
fix: avoid downloading helm chart on every run (cache it)
fix: avoid downloading helm chart on every run (cache it)
Pull request merge
fix: improve torchx support for running multiple gpus per pod
fix: improve torchx support for running multiple gpus per pod
Pull request merge
fix: codeflare top may fail due to Array(fractionalnumber)
fix: codeflare top may fail due to Array(fractionalnumber)
Pull request merge
feat: improve multinic and NCCL performance
feat: improve multinic and NCCL performance
Pull request merge
fix: use initContainer to wait for ray workers
fix: use initContainer to wait for ray workers
Pull request merge
fix: increase ray gcs rpc timeout to 30s
fix: increase ray gcs rpc timeout to 30s
Pull request merge
fix: increase resilience to network disconnects for torchx
fix: increase resilience to network disconnects for torchx
Pull request merge
fix: wait for ray workers prior to server-side job submit
fix: wait for ray workers prior to server-side job submit
Pull request merge
fix: increase resilience to network disconnects, restore helm delete …
fix: increase resilience to network disconnects, restore helm delete …
Pull request merge
fix: add websocat to custodian to avoid having to wget it every time
fix: add websocat to custodian to avoid having to wget it every time
Pull request merge
fix: avoid helm delete in custodian for now
fix: avoid helm delete in custodian for now
Pull request merge