Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Cell as SKU in intra-vc scheduler #34

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b8e332b
Add new interface for pod group
abuccts Jan 28, 2021
76fd537
Add conversion, defauling, validation for v2 spec
abuccts Jan 28, 2021
e56fe3b
Convert affinity group to pod group
abuccts Mar 4, 2021
3b69b82
Add cell type to level mapping for intra-vc
abuccts Mar 12, 2021
c509e4c
Fix pod group placement type in scheduling status
abuccts Mar 12, 2021
9ce58b4
Convert old test cases to v2 schema (#37)
hzy46 Mar 18, 2021
d041610
Update placements, binding info, and webserver
abuccts Apr 6, 2021
7d1275d
Update intra vc scheduler
abuccts Apr 7, 2021
3a7d752
Fix legacy unit tests
abuccts Apr 20, 2021
ae26632
Fix bugs in new v2 test cases
abuccts May 12, 2021
0f86361
Fix comments
abuccts May 12, 2021
fb90972
Update according to comments
abuccts Jul 14, 2021
8080214
Revert .vscode
abuccts Jul 14, 2021
f906dfd
Update comment for PodMinNumber and PodMaxNumber
abuccts Jul 14, 2021
5dd3201
Rename PodGroup and BindInfo
abuccts Jul 14, 2021
94272a2
Rename PodPlacementsInfo and PodPlacementsInfoList
abuccts Jul 14, 2021
ccb6561
Fix bugs in corner cases
abuccts Jul 21, 2021
e6c74e1
Add backward compatibility for bind info
abuccts Aug 2, 2021
91121c0
Add new test cases for V2 schema (#38)
hzy46 Aug 2, 2021
c5e1bf1
Update request spec, examples, and design docs
abuccts Aug 10, 2021
ed7577c
Update according to comments
abuccts Nov 8, 2021
35949b6
Fix GitHub Action config
abuccts Nov 8, 2021
e71a5f6
Change `PodGroup.Pods` to `PodGroup.Pod`
abuccts Nov 8, 2021
8a7f52d
Add early return, sort cells, and sorting comments
abuccts Nov 8, 2021
cdaeea3
Update config in examples accordingly
abuccts Nov 9, 2021
5c1e5c8
Skip build for legacy topo aware scheduler.
abuccts Nov 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 64 additions & 64 deletions doc/design/state-machine.md

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions example/feature/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,9 @@ This is similar to [K8S Labels and Selectors](https://kubernetes.io/docs/concept
### Description
A set of pods is scheduled as a gang, i.e. in an all-or-nothing fashion.

The gang is treated as an `AffinityGroup`, the scheduling unit of HiveD.
The gang is treated as an `PodGroup`, the scheduling unit of HiveD.

A job can specify all its pods are in the same `AffinityGroup`, so the whole job is gang scheduled.
A job can specify all its pods are in the same `PodGroup`, so the whole job is gang scheduled.

This is useful for jobs that cannot perform any useful work, such as making progress or serving, until all pods are running. A typical example in deep learning workloads is [distributed training](#TensorFlow-Distributed-Training).

Expand All @@ -76,7 +76,7 @@ This is useful for jobs that cannot perform any useful work, such as making prog
### Description
A set of pods is scheduled regardless of each other, i.e. does not require [Gang Scheduling](#Gang-Scheduling).

A job can specify its pods in different `AffinityGroups`, so the whole job is incrementally scheduled (one `AffinityGroup` each time).
A job can specify its pods in different `PodGroups`, so the whole job is incrementally scheduled (one `PodRootGroup` each time).

This is used for jobs that can still perform useful works, such as making progress or serving, even if only one pod is running.

Expand Down Expand Up @@ -138,11 +138,11 @@ One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic

## Topology-Aware Intra-VC Scheduling
### Description
Within one VC, HiveD chooses nearest leaf cells for one `AffinityGroup` in best effort.
Within one VC, HiveD chooses nearest leaf cells for one `PodGroup` in best effort.

### Reproduce Steps
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-buddy](file/itc-buddy.yaml), which requests for 2 single GPU tasks in the same `AffinityGroup`, tasks will be allocated to 2 buddy GPUs.
2. Submit job [itc-buddy](file/itc-buddy.yaml), which requests for 2 single GPU tasks in the same `PodGroup`, tasks will be allocated to 2 buddy GPUs.

<img src="file/itc-buddy-1.png" width="600"/>
<img src="file/itc-buddy-2.png" width="600"/>
Expand Down
Loading