Skip to content

v0.6.0

Compare
Choose a tag to compare
@danielvegamyhre danielvegamyhre released this 20 Aug 16:20
· 33 commits to main since this release
d66f1d5

Highlights

  • New JobSet Failure Policy API - allows users to configure different behavior for different types of errors, enabling them to use compute resources more efficiently and improve ML training goodput.
  • Add Coordinator field to JobSet spec, enabling user to define a global coordinator pod for distributed ML/HPC workloads. The stable network endpoint for this pod will be added as a label and annotation to every Job and Pod in the JobSet for easy use in application code. A common use case for this is TPU Multislice training with multiple different Job templates. See linked issue for details.
  • Add global Job index label/annotation to every Job and Pod, which is needed to support TPU Multislice training with multiple different Job templates. See linked issue for details.
  • Added new metrics
  • Improved test coverage
  • Bug fixes
  • New examples and documentation

What's Changed

New Contributors

Full Changelog: v0.6.0-devel...v0.6.0