v0.6.0
Highlights
- New JobSet Failure Policy API - allows users to configure different behavior for different types of errors, enabling them to use compute resources more efficiently and improve ML training goodput.
- Add Coordinator field to JobSet spec, enabling user to define a global coordinator pod for distributed ML/HPC workloads. The stable network endpoint for this pod will be added as a label and annotation to every Job and Pod in the JobSet for easy use in application code. A common use case for this is TPU Multislice training with multiple different Job templates. See linked issue for details.
- Add global Job index label/annotation to every Job and Pod, which is needed to support TPU Multislice training with multiple different Job templates. See linked issue for details.
- Added new metrics
- Improved test coverage
- Bug fixes
- New examples and documentation
What's Changed
- feat: add e2e test for ttl seconds after finished in jobset by @dejanzele in #511
- add publish not ready headless service to jobset by @kannon92 in #505
- use kube-openapi rather than code generator openapi-gen by @kannon92 in #522
- Allow passing args to ginkgo for integration tests by @danielvegamyhre in #525
- Refactor create jobs by @danielvegamyhre in #516
- Do not default the managedBy field by @mimowo in #528
- feat: add event recorder event by @googs1025 in #507
- use t.Errorf instead of t.Fatalf by @googs1025 in #532
- Fix path for the error when attempting to mutate managedBy by @mimowo in #527
- Fix bug when checking if a JobSet is active during tests. by @jedwins1998 in #531
- Correct typo in configurable failure policy KEP. by @jedwins1998 in #539
- fix: fix ci error caused by typo by @googs1025 in #544
- Bump the kubernetes group with 4 updates by @dependabot in #542
- Bump github.com/onsi/gomega from 1.32.0 to 1.33.0 by @dependabot in #543
- docs: fix site url not found by @googs1025 in #541
- use hugo param to define variables in md language by @googs1025 in #540
- add unit tests for createHeadlessSvcIfNecessary by @dejanzele in #526
- test: add pod controller unit test by @googs1025 in #490
- Add comment explaining why we don't unconditionally compute firstFailedJob by @danielvegamyhre in #549
- Bump github.com/onsi/ginkgo/v2 from 2.17.1 to 2.17.2 by @dependabot in #552
- Track which features in roadmap have been released by @danielvegamyhre in #554
- docs: using kustomize for adjusting resources by @omerap12 in #558
- Bump github.com/onsi/gomega from 1.33.0 to 1.33.1 by @dependabot in #560
- Don't reconcile JobSets with deletion timestamp set by @danielvegamyhre in #562
- Improve the API generated docs for managedBy by @mimowo in #565
- chore: Upgrade e2e local image by @googs1025 in #567
- Bump github.com/onsi/ginkgo/v2 from 2.17.2 to 2.17.3 by @dependabot in #569
- Add support for feature gates by @googs1025 in #557
- Implement configurable failure policy. by @jedwins1998 in #537
- Update the JobSet version to 0.5.1 for installation by @mimowo in #577
- Bump github.com/onsi/ginkgo/v2 from 2.17.3 to 2.19.0 by @dependabot in #581
- Relax validation on ReplicatedJob PodTemplates of suspended JobSets by @danielvegamyhre in #580
- update makefile kind version to v1.30.0 by @googs1025 in #589
- Propagate Job pod template updates to suspended jobs when resuming by @danielvegamyhre in #590
- docs: update to v0.5.2 by @googs1025 in #593
- fix: fix log to avoid panic by @googs1025 in #595
- avoid log panic by @googs1025 in #598
- Add omitempty to annotation of OnJobFailureReasons. by @jedwins1998 in #596
- update readme docs e2e test version to v1.30 by @googs1025 in #602
- Update _index.md
MASTER_ADDR
by @song-william in #604 - Add client-go example by @danielvegamyhre in #606
- Wait for the webhook service to be listening before advertising the Jobset replica as ready. by @mbobrovskyi in #608
- docs: add simple example for network field by @googs1025 in #550
- feat: add terminalState to jobset status by @googs1025 in #594
- Integration test improvement: rename "update" to "step" by @danielvegamyhre in #610
- docs: add argo workflow example for jobset by @googs1025 in #612
- docs: add JobSet API reference by @googs1025 in #611
- docs: fix typo, Github -> GitHub by @highpon in #615
- Allow mutating schedulingGates when the Jobset is suspended by @mimowo in #623
- Add Coordinator field to JobSet spec by @danielvegamyhre in #618
- Validation for Coordinator field by @danielvegamyhre in #627
- Add example for coordinator by @danielvegamyhre in #628
- docs: add prometheus-operator example for jobset by @googs1025 in #629
- Bump github.com/onsi/gomega from 1.33.1 to 1.34.0 by @dependabot in #631
- Bump github.com/onsi/ginkgo/v2 from 2.19.0 to 2.19.1 by @dependabot in #632
- feat: add metrics for jobset by @googs1025 in #614
- docs: update metrics info for site by @googs1025 in #633
- chore: add github issue, pr template by @googs1025 in #634
- Bump github.com/onsi/gomega from 1.34.0 to 1.34.1 by @dependabot in #638
- fix error output by @googs1025 in #636
- Bump k8s dependencies to 1.30 dependencies and modify update-codegen.sh to be compatible with new code-generator by @danielvegamyhre in #641
- Fix bug in replicatedJobByName by @danielvegamyhre in #645
- Allow to update JobSets on suspend by @mimowo in #644
- Refactor jobset webhook by @danielvegamyhre in #646
- add the unparam linter to golangci and fix those issues flagged by @kannon92 in #643
- drop job-name from labels as it is not used by @kannon92 in #642
- Bump github.com/onsi/ginkgo/v2 from 2.19.1 to 2.20.0 by @dependabot in #647
- Add new job-id annotation to assign globally unique job index to each job by @danielvegamyhre in #650
- Bump github.com/prometheus/client_golang from 1.19.1 to 1.20.0 by @dependabot in #653
- update to k8s 0.30.4 by @kannon92 in #654
New Contributors
- @mimowo made their first contribution in #528
- @omerap12 made their first contribution in #558
- @song-william made their first contribution in #604
- @mbobrovskyi made their first contribution in #608
- @highpon made their first contribution in #615
Full Changelog: v0.6.0-devel...v0.6.0