Skip to content

Activity

Release 4.12.6

starpitpushed 1 commit to main • 1506ef2…d5eb230 • 
on May 25, 2023

fix: more EOF protection fixes

Pull request merge
starpitpushed 1 commit to main • edcb654…1506ef2 • 
on May 25, 2023

Release 4.12.5

starpitpushed 1 commit to main • a88602b…edcb654 • 
on May 13, 2023

fix: update s3fs pvc to prevent its caching mechanism from consuming …

Pull request merge
starpitpushed 1 commit to main • 16de598…a88602b • 
on May 13, 2023

Release 4.12.4

starpitpushed 1 commit to main • 9f35d12…16de598 • 
on May 11, 2023

fix: improve cpu utilization metrics, and retry on ray head initConta…

Pull request merge
starpitpushed 1 commit to main • fd0556a…9f35d12 • 
on May 11, 2023

Release 4.12.3

starpitpushed 1 commit to main • da9341b…fd0556a • 
on May 9, 2023

fix: multinic detection was broken; also was hard-wiring name of reso…

Pull request merge
starpitpushed 1 commit to main • 8dca76c…da9341b • 
on May 9, 2023

Release 4.12.2

starpitpushed 1 commit to main • 9276869…8dca76c • 
on May 5, 2023

fix: custodian pods can linger forever

Pull request merge
starpitpushed 1 commit to main • 17ac5da…9276869 • 
on May 5, 2023

fix: avoid downloading helm chart on every run (cache it)

Pull request merge
starpitpushed 1 commit to main • 6a8f5e4…17ac5da • 
on May 3, 2023

Release 4.12.1

starpitpushed 1 commit to main • 70d522e…6a8f5e4 • 
on May 3, 2023

fix: improve torchx support for running multiple gpus per pod

Pull request merge
starpitpushed 1 commit to main • 1c0b69b…70d522e • 
on May 3, 2023

Release 4.12.0

starpitpushed 1 commit to main • cee32ec…1c0b69b • 
on May 3, 2023

fix: codeflare top may fail due to Array(fractionalnumber)

Pull request merge
starpitpushed 1 commit to main • f2f9a56…cee32ec • 
on May 3, 2023

feat: improve multinic and NCCL performance

Pull request merge
starpitpushed 1 commit to main • af05d1b…f2f9a56 • 
on May 2, 2023

Release 4.11.12

starpitpushed 1 commit to main • 3abe1e3…af05d1b • 
on May 2, 2023

fix: use initContainer to wait for ray workers

Pull request merge
starpitpushed 1 commit to main • e8fbe02…3abe1e3 • 
on May 2, 2023

Release 4.11.11

starpitpushed 1 commit to main • 1c62309…e8fbe02 • 
on May 2, 2023

fix: increase ray gcs rpc timeout to 30s

Pull request merge
starpitpushed 1 commit to main • 409fd7d…1c62309 • 
on May 1, 2023

Release 4.11.10

starpitpushed 1 commit to main • 9097cd0…409fd7d • 
on May 1, 2023

fix: increase resilience to network disconnects for torchx

Pull request merge
starpitpushed 1 commit to main • fc12056…9097cd0 • 
on May 1, 2023

Release 4.11.9

starpitpushed 1 commit to main • 0d8731f…fc12056 • 
on May 1, 2023

fix: wait for ray workers prior to server-side job submit

Pull request merge
starpitpushed 1 commit to main • 4f7ae57…0d8731f • 
on May 1, 2023

Release 4.11.8

starpitpushed 1 commit to main • 5a47eb7…4f7ae57 • 
on Apr 28, 2023

fix: increase resilience to network disconnects, restore helm delete …

Pull request merge
starpitpushed 1 commit to main • 2334b1d…5a47eb7 • 
on Apr 28, 2023

fix: add websocat to custodian to avoid having to wget it every time

Pull request merge
starpitpushed 1 commit to main • aefc221…2334b1d • 
on Apr 28, 2023

Release 4.11.7

starpitpushed 1 commit to main • 6a12f44…aefc221 • 
on Apr 27, 2023

fix: avoid helm delete in custodian for now

Pull request merge
starpitpushed 1 commit to main • c7f2f19…6a12f44 • 
on Apr 27, 2023

fix: increase memory for runtime-env custodian pod

Pull request merge
starpitpushed 1 commit to main • 85385f1…c7f2f19 • 
on Apr 27, 2023