diff --git a/.github/workflows/build.yaml b/.github/workflows/build.yaml index f9e469e..b03542f 100644 --- a/.github/workflows/build.yaml +++ b/.github/workflows/build.yaml @@ -11,7 +11,7 @@ on: jobs: build: name: Build and Test - runs-on: ubuntu-16.04 + runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - run: build/hivedscheduler/docker-build.sh test @@ -21,7 +21,7 @@ jobs: strategy: matrix: go: [1.12.6] - os: [ubuntu-16.04] + os: [ubuntu-latest] steps: - name: Checkout uses: actions/checkout@v2 diff --git a/doc/design/img/AG-state-machine.png b/doc/design/img/PG-state-machine.png similarity index 100% rename from doc/design/img/AG-state-machine.png rename to doc/design/img/PG-state-machine.png diff --git a/doc/design/state-machine.md b/doc/design/state-machine.md index a3bbef7..652ef78 100644 --- a/doc/design/state-machine.md +++ b/doc/design/state-machine.md @@ -1,36 +1,36 @@ # HiveD State Machines This document presents the state machines of HiveD, and explains the life cycles of jobs and resources in HiveD. -We will describe the state machines of our scheduling unit, Affinity Group, and that of our resource unit, Cell, respectively. +We will describe the state machines of our scheduling unit, Pod Group, and that of our resource unit, Cell, respectively. -## Affinity Group (AG) State Machine +## Pod Group (PG) State Machine -An affinity group (AG) is a set of gang-scheduled pods. It is the basic scheduling unit in HiveD. The figure below shows the state machine of an affinity group. +A pod group (PG) is a set of gang-scheduled pods. It is the basic scheduling unit in HiveD. The figure below shows the state machine of a pod group. -Note that the AG state machine has interactions with the cell state machine (elaborated later). In our design, AGs influence each other only via their overlapping cells: for example, an event sent to an AG may trigger an event sent to a cell, and that cell may further trigger another event sent to another AG which is currently associated with the cell. -Therefore, the state machines of multiple AGs are effectively bridged by the state machines of their overlapping cells. +Note that the PG state machine has interactions with the cell state machine (elaborated later). In our design, PGs influence each other only via their overlapping cells: for example, an event sent to an PG may trigger an event sent to a cell, and that cell may further trigger another event sent to another PG which is currently associated with the cell. +Therefore, the state machines of multiple PGs are effectively bridged by the state machines of their overlapping cells.

- AG + PG

### States -__`Pending`__: the AG is waiting to be scheduled. +__`Pending`__: the PG is waiting to be scheduled. -__`Preempting`__: the AG has reserved cells but is waiting for the completion of preemptions of other AGs. +__`Preempting`__: the PG has reserved cells but is waiting for the completion of preemptions of other PGs. -__`Allocated`__: the AG is fully allocated cells. +__`Allocated`__: the PG is fully allocated cells. -__`Being preempted`__: the AG is being preempted by other AGs via their overlapping cells (the preemption is still ongoing). +__`Being preempted`__: the PG is being preempted by other PGs via their overlapping cells (the preemption is still ongoing). -__`Deleted`__: the AG is fully deleted and all of its cells are released. +__`Deleted`__: the PG is fully deleted and all of its cells are released. -Note that only ``Pending``, `Allocated`, and `Deleted` are persistent, thus they are the recovery points of HiveD. While the other AG states (`Preempting`, `Being preempted`) are volatile, so they will transition to others after scheduler crash and restart (i.e., ec in the state machine). +Note that only ``Pending``, `Allocated`, and `Deleted` are persistent, thus they are the recovery points of HiveD. While the other PG states (`Preempting`, `Being preempted`) are volatile, so they will transition to others after scheduler crash and restart (i.e., ec in the state machine). Also note that `Allocated` state includes updating pod annotation (pod binding) and pod running. We assume once pod annotation has been updated (pod bound to a node), the pod running is handled by K8s. We hence only describe the cell allocation state in the state machine, and do not care about the pods' real running state. -### Common AG life cycles +### Common PG life cycles __No preemption involved__: `Pending` -> `Allocated` -> `Deleted` @@ -48,85 +48,85 @@ Operation: none. __e0__: -Condition: all cells that the cell allocation algorithm decides to allocate to the AG are in `Free` or `Reserved` states (we assume the cell allocation algorithm has ensured that for every `Reserved` cell this AG's priority is higher than that of the AG currently associated with the cell, e.g., a `Preempting` AG). +Condition: all cells that the cell allocation algorithm decides to allocate to the PG are in `Free` or `Reserved` states (we assume the cell allocation algorithm has ensured that for every `Reserved` cell this PG's priority is higher than that of the PG currently associated with the cell, e.g., a `Preempting` PG). Operation: -For all cells allocated to this AG: +For all cells allocated to this PG: -`Free` -> `Used` (by this AG) (e0 in cell state machine); +`Free` -> `Used` (by this PG) (e0 in cell state machine); -`Reserved` (by another AG) -> `Used` (by this AG) (e8 in cell state machine). +`Reserved` (by another PG) -> `Used` (by this PG) (e8 in cell state machine). __e1__: -Condition: there is at least one cell among those the algorithm decides to allocate to this AG in `Used` or `Reserving` states (we assume the cell allocation algorithm has ensured that for every `Used` or `Reserving` cell this AG's priority is higher than that of the AG currently associated with the cell). +Condition: there is at least one cell among those the algorithm decides to allocate to this PG in `Used` or `Reserving` states (we assume the cell allocation algorithm has ensured that for every `Used` or `Reserving` cell this PG's priority is higher than that of the PG currently associated with the cell). Operation: -For all cells currently associated with other AGs: +For all cells currently associated with other PGs: -`Used` (by other AGs) -> `Reserving` (by this AG) (e2 in cell state machine); +`Used` (by other PGs) -> `Reserving` (by this PG) (e2 in cell state machine); -`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e3/e6 in cell state machine); +`Reserving`/`Reserved` (by other PGs) -> `Reserving`/`Reserved` (by this PG) (e3/e6 in cell state machine); For free cells: -`Free` -> `Reserved` (by this AG) (e5 in cell state machine). +`Free` -> `Reserved` (by this PG) (e5 in cell state machine). __e2__: -Condition: all the cells that the cell allocation algorithm decided to allocate to this AG are `Reserved`. +Condition: all the cells that the cell allocation algorithm decided to allocate to this PG are `Reserved`. -Operation: for all cells of this AG: +Operation: for all cells of this PG: -`Reserved` (by this AG) -> `Used` (by this AG) (e8 in cell state machine). +`Reserved` (by this PG) -> `Used` (by this PG) (e8 in cell state machine). __e3__: -Condition: all pods of this AG are deleted. +Condition: all pods of this PG are deleted. -Operation: for all cells of this AG: +Operation: for all cells of this PG: -`Used` (by this AG) -> `Free` (e1 in cell state machine). +`Used` (by this PG) -> `Free` (e1 in cell state machine). __e4__: -Condition: all pods of this AG are deleted. +Condition: all pods of this PG are deleted. Operation: -all cells `Used` (by this AG) -> `Free` (e1 in cell state machine). +all cells `Used` (by this PG) -> `Free` (e1 in cell state machine). __e5__: -Condition: a `Reserving` or `Reserved` cell in this AG is being overwritten by another AG (e3 or e6 in cell state machine). +Condition: a `Reserving` or `Reserved` cell in this PG is being overwritten by another PG (e3 or e6 in cell state machine). Operation: -All the other `Reserving` cells (by this AG) -> `Used` (by the `Being preempted` AG currently associated with the cell) (e4 in cell state machine); +All the other `Reserving` cells (by this PG) -> `Used` (by the `Being preempted` PG currently associated with the cell) (e4 in cell state machine); -All the other `Reserved` cells (by this AG) -> `Free` (e7 in cell state machine). +All the other `Reserved` cells (by this PG) -> `Free` (e7 in cell state machine). __e6__: -Condition: a cell allocated to this AG from `Used` (by this AG) to `Reserving` (by another AG) (e2 in cell state machine) +Condition: a cell allocated to this PG from `Used` (by this PG) to `Reserving` (by another PG) (e2 in cell state machine) Operation: none. __e7__: -Condition: all pods of this AG are deleted. +Condition: all pods of this PG are deleted. Operation: -All the `Reserving` cells (by this AG) -> `Used` (by the `Being preempted` AG currently associated with the cell) (e4 in cell state machine). +All the `Reserving` cells (by this PG) -> `Used` (by the `Being preempted` PG currently associated with the cell) (e4 in cell state machine). -All the `Reserved` cells (by this AG) -> `Free` (e7 in cell state machine); +All the `Reserved` cells (by this PG) -> `Free` (e7 in cell state machine); __e8__: -Condition: all pods of this AG are deleted. +Condition: all pods of this PG are deleted. Operation: none. @@ -140,17 +140,17 @@ Cell is the resource unit in HiveD. The figure below shows the state machine of ### States -__`Free`__: no AG is associated with this cell. +__`Free`__: no PG is associated with this cell. -__`Used`__: only an `Allocated` or a `Being preempted` AG is associated with the cell. +__`Used`__: only an `Allocated` or a `Being preempted` PG is associated with the cell. -__`Reserved`__: only a `Preempting` AG is associated with the cell. +__`Reserved`__: only a `Preempting` PG is associated with the cell. -__`Reserving`__: a `Preempting` and a `Being preempted` AG are associated with the cell. +__`Reserving`__: a `Preempting` and a `Being preempted` PG are associated with the cell. -Note that all states are volatile; the `Free` and `Used` states are derived from the AG state machine. +Note that all states are volatile; the `Free` and `Used` states are derived from the PG state machine. -Also note that the reservation of cells (`Reserved` and `Reserving` states) is not necessarily designed for preemptions (i.e., reserving resources for the `Preempting` AGs), despite the state definitions involving preemptions above. In the future it is possible that we extend this mechanism to support other features that need reservation, such as reservation during waiting to achieve strict FIFO and fairness for larger AGs. +Also note that the reservation of cells (`Reserved` and `Reserving` states) is not necessarily designed for preemptions (i.e., reserving resources for the `Preempting` PGs), despite the state definitions involving preemptions above. In the future it is possible that we extend this mechanism to support other features that need reservation, such as reservation during waiting to achieve strict FIFO and fairness for larger PGs. ### Common cell life cycles @@ -164,7 +164,7 @@ __Preemption involved__: Note: "Allocate/Reserve/Release a cell" in the below descriptions means modifying the in-memory data structures for scheduling, e.g., free cell list, cell bindings, cell priorities. Allocating/reserving or releasing a cell in the cell view will modify the free cell list, and split or merge the cell, create or destroy cell bindings, and set or reset the cell priority. -These changes are immediately visible to the cell allocation algorithm when scheduling subsequent AGs. +These changes are immediately visible to the cell allocation algorithm when scheduling subsequent PGs. __ec__: @@ -174,71 +174,71 @@ Operation: none. __e0__: -Condition: triggered by AG from `Pending` to `Allocated` (e0 in AG state machine). +Condition: triggered by PG from `Pending` to `Allocated` (e0 in PG state machine). -Operation: allocate the cell to the AG. +Operation: allocate the cell to the PG. __e1__: -Condition: the `Allocated` AG on this is deleted (e3 in AG state machine). +Condition: the `Allocated` PG on this is deleted (e3 in PG state machine). Operation: release the cell. __e2__: -Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG is preempting the `Allocated` AG currently associated with this cell) (e1 in AG state machine). +Condition: triggered by another PG from `Pending` to `Preempting` (i.e., that PG is preempting the `Allocated` PG currently associated with this cell) (e1 in PG state machine). Operation: -The `Allocated` AG on this cell -> `Being preempted` (e6 in AG state machine); +The `Allocated` PG on this cell -> `Being preempted` (e6 in PG state machine); -Release the cell, and then reserve it for the `Preempting` AG. +Release the cell, and then reserve it for the `Preempting` PG. __e3__: -Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG cancels the preemption of the `Preempting` AG currently associated with this cell, and continues to preempt the `Being preempted` AG associated with this cell) (e1 in AG state machine). +Condition: triggered by another PG from `Pending` to `Preempting` (i.e., that PG cancels the preemption of the `Preempting` PG currently associated with this cell, and continues to preempt the `Being preempted` PG associated with this cell) (e1 in PG state machine). Operation: -The original `Preempting` AG on this cell -> `Pending` (e5 in AG state machine); +The original `Preempting` PG on this cell -> `Pending` (e5 in PG state machine); -Release the cell, and then reserve it for the new `Preempting` AG. +Release the cell, and then reserve it for the new `Preempting` PG. __e4__: -Condition: triggered by the `Preempting` AG currently associated with this cell to `Pending` (e5 in AG state machine) or to `Deleted` (e7 in AG state machine). +Condition: triggered by the `Preempting` PG currently associated with this cell to `Pending` (e5 in PG state machine) or to `Deleted` (e7 in PG state machine). -Operation: release the cell, and then allocate it to the `Being preempted` AG on this cell (i.e., the preemption victim). +Operation: release the cell, and then allocate it to the `Being preempted` PG on this cell (i.e., the preemption victim). __e5__: -Condition: triggered by AG from `Pending` to `Preempting` (e1 in AG state machine). +Condition: triggered by PG from `Pending` to `Preempting` (e1 in PG state machine). -Operation: reserve the cell for the `Preempting` AG. +Operation: reserve the cell for the `Preempting` PG. __e6__: -Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG cancels the preemption of the `Preempting` AG currently associated with this cell) (e1 in AG state machine). +Condition: triggered by another PG from `Pending` to `Preempting` (i.e., that PG cancels the preemption of the `Preempting` PG currently associated with this cell) (e1 in PG state machine). Operation: -The original `Preempting` AG on this cell -> `Pending` (e5 in AG state machine). +The original `Preempting` PG on this cell -> `Pending` (e5 in PG state machine). -Release the cell, and then reserve it for the new `Preempting` AG. +Release the cell, and then reserve it for the new `Preempting` PG. __e7__: -Condition: triggered by the `Preempting` AG currently associated with this cell to `Pending` (e5 in AG state machine) or to `Deleted` (e7 in AG state machine). +Condition: triggered by the `Preempting` PG currently associated with this cell to `Pending` (e5 in PG state machine) or to `Deleted` (e7 in PG state machine). Operation: release the cell. __e8__: -Condition: triggered by (i) there is currently a `Preempting` AG on this cell but another `Allocated` AG is now associated with the cell (e0 in AG state machine); OR (ii) the `Preempting` AG currently associated with this cell transitions to `Allocated` (e2 in AG state machine). +Condition: triggered by (i) there is currently a `Preempting` PG on this cell but another `Allocated` PG is now associated with the cell (e0 in PG state machine); OR (ii) the `Preempting` PG currently associated with this cell transitions to `Allocated` (e2 in PG state machine). Operation: -For (i): the `Preempting` AG on this cell -> `Pending` (e5 in AG state machine); release the cell and then allocate it to the new `Allocated` AG. +For (i): the `Preempting` PG on this cell -> `Pending` (e5 in PG state machine); release the cell and then allocate it to the new `Allocated` PG. For (ii): none. diff --git a/example/feature/README.md b/example/feature/README.md index 866a229..0ccddd6 100644 --- a/example/feature/README.md +++ b/example/feature/README.md @@ -53,9 +53,9 @@ This is similar to [K8S Labels and Selectors](https://kubernetes.io/docs/concept ### Description A set of pods is scheduled as a gang, i.e. in an all-or-nothing fashion. -The gang is treated as an `AffinityGroup`, the scheduling unit of HiveD. +The gang is treated as an `PodGroup`, the scheduling unit of HiveD. -A job can specify all its pods are in the same `AffinityGroup`, so the whole job is gang scheduled. +A job can specify all its pods are in the same `PodGroup`, so the whole job is gang scheduled. This is useful for jobs that cannot perform any useful work, such as making progress or serving, until all pods are running. A typical example in deep learning workloads is [distributed training](#TensorFlow-Distributed-Training). @@ -76,7 +76,7 @@ This is useful for jobs that cannot perform any useful work, such as making prog ### Description A set of pods is scheduled regardless of each other, i.e. does not require [Gang Scheduling](#Gang-Scheduling). -A job can specify its pods in different `AffinityGroups`, so the whole job is incrementally scheduled (one `AffinityGroup` each time). +A job can specify its pods in different `PodGroups`, so the whole job is incrementally scheduled (one `PodRootGroup` each time). This is used for jobs that can still perform useful works, such as making progress or serving, even if only one pod is running. @@ -138,11 +138,11 @@ One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic ## Topology-Aware Intra-VC Scheduling ### Description -Within one VC, HiveD chooses nearest leaf cells for one `AffinityGroup` in best effort. +Within one VC, HiveD chooses nearest leaf cells for one `PodGroup` in best effort. ### Reproduce Steps 1. Use [hived-config-2](file/hived-config-2.yaml). -2. Submit job [itc-buddy](file/itc-buddy.yaml), which requests for 2 single GPU tasks in the same `AffinityGroup`, tasks will be allocated to 2 buddy GPUs. +2. Submit job [itc-buddy](file/itc-buddy.yaml), which requests for 2 single GPU tasks in the same `PodGroup`, tasks will be allocated to 2 buddy GPUs. diff --git a/example/request/basic/request.yaml b/example/request/basic/request.yaml index bb1f3df..73ef311 100644 --- a/example/request/basic/request.yaml +++ b/example/request/basic/request.yaml @@ -7,9 +7,10 @@ jobPriorityClass: PROD taskRoles: a: taskNumber: 5 - leafCellType: K80 - leafCellNumber: 1 - affinityGroupName: null + resourcePerInstance: + skuNum: 1 + skuType: K80 + withinOne: null --- jobVC: VC2 jobName: demo2nopinned @@ -17,9 +18,10 @@ jobPriorityClass: PROD taskRoles: a: taskNumber: 5 - leafCellType: K80 - leafCellNumber: 1 - affinityGroupName: null + resourcePerInstance: + skuNum: 1 + skuType: K80 + withinOne: null --- jobVC: VC2 jobName: demo2pinned @@ -28,8 +30,31 @@ taskRoles: a: taskNumber: 5 pinnedCellId: VC2-K80 - leafCellNumber: 1 - affinityGroupName: null + resourcePerInstance: + skuNum: 1 + withinOne: null +--- +jobVC: VC3 +jobName: demo3within +jobPriorityClass: PROD +taskRoles: + a: + taskNumber: 1 + resourcePerInstance: + skuNum: 2 + skuType: K80 + withinOne: K80-SWITCH + b: + taskNumber: 1 + resourcePerInstance: + skuNum: 1 + skuType: K80 + withinOne: null +taskRoleGroups: + - taskRoles: + - a + - b + withinOne: K80-NODE --- ################################################################################ @@ -69,11 +94,12 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: K80 - leafCellNumber: 1 - affinityGroup: null + cellType: K80 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler restartPolicy: Never @@ -116,11 +142,12 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC2 priority: 1000 - leafCellType: K80 - leafCellNumber: 1 - affinityGroup: null + cellType: K80 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler restartPolicy: Never @@ -163,11 +190,141 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC2 priority: 1000 pinnedCellId: VC2-K80 - leafCellNumber: 1 - affinityGroup: null + cellNumber: 1 + podRootGroup: null + spec: + schedulerName: hivedscheduler + restartPolicy: Never + priority: 1000 + containers: + - name: ubuntu + image: ubuntu:trusty + command: ["sh", "-c", "nvidia-smi -L ; printenv ; sleep infinity"] + resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 4 + memory: 8Gi + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] +--- +apiVersion: frameworkcontroller.microsoft.com/v1 +kind: Framework +metadata: + name: demo3within +spec: + executionType: Start + retryPolicy: + fancyRetryPolicy: true + maxRetryCount: 0 + taskRoles: + - name: a + taskNumber: 1 + frameworkAttemptCompletionPolicy: + minFailedTaskCount: 1 + minSucceededTaskCount: 1 + task: + retryPolicy: + fancyRetryPolicy: false + maxRetryCount: 0 + pod: + metadata: + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC3 + priority: 1000 + cellType: K80 + cellNumber: 2 + podRootGroup: + name: demo3within/group_0 + withinOneCell: K80-NODE + pod: null + childGroups: + - withinOneCell: K80-SWITCH + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 2 + containsCurrentPod: true + childGroups: null + - withinOneCell: null + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 1 + containsCurrentPod: false + childGroups: null + spec: + schedulerName: hivedscheduler + restartPolicy: Never + priority: 1000 + containers: + - name: ubuntu + image: ubuntu:trusty + command: ["sh", "-c", "nvidia-smi -L ; printenv ; sleep infinity"] + resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 4 + memory: 8Gi + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] + - name: b + taskNumber: 1 + frameworkAttemptCompletionPolicy: + minFailedTaskCount: 1 + minSucceededTaskCount: 1 + task: + retryPolicy: + fancyRetryPolicy: false + maxRetryCount: 0 + pod: + metadata: + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC3 + priority: 1000 + cellType: K80 + cellNumber: 1 + podRootGroup: + name: demo3within/group_0 + withinOneCell: K80-NODE + pod: null + childGroups: + - withinOneCell: K80-SWITCH + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 2 + containsCurrentPod: false + childGroups: null + - withinOneCell: null + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 1 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler restartPolicy: Never @@ -198,11 +355,12 @@ metadata: name: demo1-a-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: K80 - leafCellNumber: 1 - affinityGroup: null + cellType: K80 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler restartPolicy: Never @@ -229,11 +387,12 @@ metadata: name: demo2nopinned-a-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC2 priority: 1000 - leafCellType: K80 - leafCellNumber: 1 - affinityGroup: null + cellType: K80 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler restartPolicy: Never @@ -260,11 +419,118 @@ metadata: name: demo2pinned-a-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC2 priority: 1000 pinnedCellId: VC2-K80 - leafCellNumber: 1 - affinityGroup: null + cellNumber: 1 + podRootGroup: null +spec: + schedulerName: hivedscheduler + restartPolicy: Never + priority: 1000 + containers: + - name: ubuntu + image: ubuntu:trusty + command: ["sh", "-c", "nvidia-smi -L ; printenv ; sleep infinity"] + resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 4 + memory: 8Gi + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] +--- +apiVersion: v1 +kind: Pod +metadata: + name: demo3within-a-0 + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC3 + priority: 1000 + cellType: K80 + cellNumber: 2 + podRootGroup: + name: demo3within/group_0 + withinOneCell: K80-NODE + pod: null + childGroups: + - withinOneCell: K80-SWITCH + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 2 + containsCurrentPod: true + childGroups: null + - withinOneCell: null + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 1 + containsCurrentPod: false + childGroups: null +spec: + schedulerName: hivedscheduler + restartPolicy: Never + priority: 1000 + containers: + - name: ubuntu + image: ubuntu:trusty + command: ["sh", "-c", "nvidia-smi -L ; printenv ; sleep infinity"] + resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 4 + memory: 8Gi + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] +--- +apiVersion: v1 +kind: Pod +metadata: + name: demo3within-b-0 + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC3 + priority: 1000 + cellType: K80 + cellNumber: 1 + podRootGroup: + name: demo3within/group_0 + withinOneCell: K80-NODE + pod: null + childGroups: + - withinOneCell: K80-SWITCH + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 2 + containsCurrentPod: false + childGroups: null + - withinOneCell: null + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 1 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler restartPolicy: Never diff --git a/example/request/design/request.yaml b/example/request/design/request.yaml index 43f9ffc..870ce3f 100644 --- a/example/request/design/request.yaml +++ b/example/request/design/request.yaml @@ -2,21 +2,20 @@ # [Optional]: Job User -> RestServer Request # # Constrains: -# 1. For one task, only need to specify leafCellType or pinnedCellId, not both. -# 2. All leafCellTypes or pinnedCellIds under the same affinityGroup must be the same. +# 1. For one task, only need to specify cellType or pinnedCellId, not both. +# 2. All cellTypes or pinnedCellIds under the same podGroup must be the same. # -# affinityGroupName: -# An affinityGroup forms a cell request and scheduler will try all candidate +# A podGroup forms a cell request and scheduler will try all candidate # cellTypes and physicalCells for the cell to allocate. # 1. All candidate cellTypes: -# All the sufficient cellTypes with the smallest cellLevel. +# All the sufficient cellTypes with the cellLevel no higher than node level. # 2. All candidate physicalCells: # If pinnedCellId not specified: # All the sufficient physicalCells, except for all pinnedCells. # Else: # All the sufficient physicalCells, only within the specified pinned cell. # -# Allocate task within its affinityGroup cell: +# Allocate task within its podGroup cell: # 1. Avoid allocating one task across multiple nodes: # Using buddy allocation. ################################################################################ @@ -24,67 +23,105 @@ jobVC: VC1 jobName: JOBX jobPriorityClass: PROD taskRoles: - # All tasks in role A, B, C should be within the same cell named PCN-ABC. + # All tasks in role A, B, C should be within the same podGroup, referred by JOBX/GROUP-ABC. # - # Total request of PCN-ABC: - # leafCellType: DGX2-V100 - # leafCellNumber: 1 * 16 + 3 * 8 + 1 * 4 = 44 GPUs = 2.75 DGX2 nodes + # Total request of JOBX/GROUP-ABC: + # cellType: DGX2-V100 + # cellNumber: 1 * 16 + 3 * 8 + 1 * 4 = 44 GPUs = 2.75 DGX2 nodes # Candidate cellTypes: # 3-DGX2-NODE, 4-DGX2-NODE, 4-DIRECT-DGX2-NODE, 5-DGX2-NODE. A: taskNumber: 1 - leafCellType: DGX2-V100 - leafCellNumber: 16 - affinityGroupName: PCN-ABC + resourcePerInstance: + skuNum: 16 + skuType: DGX2-V100 B: taskNumber: 3 - leafCellType: DGX2-V100 - leafCellNumber: 8 - affinityGroupName: PCN-ABC + resourcePerInstance: + skuNum: 8 + skuType: DGX2-V100 C: taskNumber: 1 - leafCellType: DGX2-V100 - leafCellNumber: 4 - affinityGroupName: PCN-ABC + resourcePerInstance: + skuNum: 4 + skuType: DGX2-V100 +taskRoleGroups: + - taskRoles: + - A + - B + - C + withinOne: null - # All tasks in role D should be within the same cell named PCN-D. + # All tasks in role D should be within the same podGroup, referred by JOBX/GROUP-D. # - # Total request of PCN-D: - # leafCellType: null -> any leafCellType - # leafCellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs + # Total request of JOBX/GROUP-D: + # cellType: null -> any leaf cellType + # cellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs # Candidate cellTypes: # DGX1-P100-NODE, DGX1-V100-NODE, DGX2-NODE-8-GPU, IB-DGX2-NODE-8-GPU. D: taskNumber: 2 - leafCellType: null # null, empty or not specified -> any leafCellType - leafCellNumber: 3 - affinityGroupName: PCN-D + resourcePerInstance: + skuNum: 3 + skuType: null # null, empty or not specified -> any cellType + within: null # null, empty or not specified -> one podGroup by default and gang schedule - # Tasks in role E is not required to be within the same cell. + # Tasks in role E is not required to be within the same cell if `gang: false` is specified. # # Each task forms a cell request: - # leafCellType: DGX2-V100 - # leafCellNumber: 1 * 16 = 16 GPUs = 1 DGX2 node + # cellType: DGX2-V100 + # cellNumber: 1 * 16 = 16 GPUs = 1 DGX2 node # Candidate cellTypes: # DGX2-NODE. E: taskNumber: 2 - leafCellType: DGX2-V100 - leafCellNumber: 16 - affinityGroupName: null # null, empty or not specified -> no affinityGroup + resourcePerInstance: + skuNum: 16 + skuType: DGX2-V100 + gang: false - # All tasks in role F should be within the same cell named PCN-F. + # All tasks in role F should be within the same podGroup, referred by JOBX/GROUP-F. # - # Total request of PCN-F: + # Total request of JOBX/GROUP-F: # pinnedCellId: VC1-YQW-IB-DGX2 - # leafCellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs + # cellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs # Candidate physicalCells: # VC1-YQW-IB-DGX2. F: taskNumber: 2 pinnedCellId: VC1-YQW-IB-DGX2 - leafCellNumber: 3 - affinityGroupName: PCN-F + resourcePerInstance: + skuNum: 3 + skuType: null + + # All tasks in role G, H should be within the same podGroup, referred by JOBX/GROUP-GH. + # All tasks in role G should be within one DGX2 node, all tasks in role H should also be within one DGX2 node, + # role G and role H should be within one DGX2 rack at the same time. + # Cell types requested from role G, H can be different from leaf cell type DGX2-V100, + # role G requests 2 8-GPU cells while role H requests 4 4-GPU cells. + # + # Total request of JOBX/GROUP-GH: + # cellType: DGX2-V100-NODE-8-GPU and DGX2-V100-NODE-4-GPU + # cellNumber: 2 * 8 + 4 * 4 = 32 GPUs = 2 DGX2 nodes + # Candidate cellTypes: + # 4-DGX2-NODE, 4-DIRECT-DGX2-NODE + G: + taskNumber: 2 + resourcePerInstance: + skuNum: 1 + skuType: DGX2-V100-NODE-8-GPU # 8-GPU inside one board in DGX2 + within: DGX2-NODE # 2x 8-GPU boards should be within one DGX2 node + H: + taskNumber: 4 + resourcePerInstance: + skuNum: 1 + skuType: DGX2-V100-NODE-4-GPU # 4-GPU inside one socket in DGX2 + within: DGX2-NODE # 4x 4-GPU socket should be within one DGX2 node +taskRoleGroups: + - taskRoles: + - G + - H + withinOne: 2-DGX2-V100-NODE # role G and role H should be within one 2-node DGX2 rack --- ################################################################################ @@ -94,10 +131,10 @@ taskRoles: # things in advance. # For example: # Pod Spec cpu, memory. -# 1. Given leafCellType or pinnedCellId, just pick the corresponding cpu, memory unit. -# 2. No leafCellType or pinnedCellId is given, choose the minimal cpu, memory unit. +# 1. Given cellType or pinnedCellId, just pick the corresponding cpu, memory unit. +# 2. No cellType or pinnedCellId is given, choose the minimal cpu, memory unit. physicalCluster: - leafCellTypes: + cellTypes: # Check resource value format in # k8s.io/apimachinery/pkg/api/resource/quantity.go @@ -141,22 +178,43 @@ spec: pod: metadata: annotations: - # Format of affinityGroup.name : - # {jobName}/{affinityGroupName} + # Format of podRootGroup.name : + # {jobName}/group_{podRootGroupIndex} hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 16 - affinityGroup: - name: JOBX/PCN-ABC - members: - - podNumber: 1 - leafCellNumber: 16 - - podNumber: 3 - leafCellNumber: 8 - - podNumber: 1 - leafCellNumber: 4 + cellType: DGX2-V100 + cellNumber: 16 + podRootGroup: + name: JOBX/GROUP-ABC + withinOneCell: null + pod: null + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: null + - pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 4 + containsCurrentPod: false + childGroups: null spec: # See ../../run/deploy.yaml for why and how to specify the schedulerName. schedulerName: hivedscheduler @@ -198,19 +256,40 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 8 - affinityGroup: - name: JOBX/PCN-ABC - members: - - podNumber: 1 - leafCellNumber: 16 - - podNumber: 3 - leafCellNumber: 8 - - podNumber: 1 - leafCellNumber: 4 + cellType: DGX2-V100 + cellNumber: 8 + podRootGroup: + name: JOBX/GROUP-ABC + withinOneCell: null + pod: null + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: true + childGroups: null + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 4 + containsCurrentPod: false + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -232,19 +311,40 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 4 - affinityGroup: - name: JOBX/PCN-ABC - members: - - podNumber: 1 - leafCellNumber: 16 - - podNumber: 3 - leafCellNumber: 8 - - podNumber: 1 - leafCellNumber: 4 + cellType: DGX2-V100 + cellNumber: 4 + podRootGroup: + name: JOBX/GROUP-ABC + withinOneCell: null + pod: null + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 4 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -266,15 +366,22 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: null - leafCellNumber: 3 - affinityGroup: - name: JOBX/PCN-D - members: - - podNumber: 2 - leafCellNumber: 3 + cellType: null + cellNumber: 3 + podRootGroup: + name: JOBX/GROUP-D + withinOneCell: null + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: null + cellNumber: 3 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -298,11 +405,12 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 16 - affinityGroup: null + cellType: DGX2-V100 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler priority: 1000 @@ -324,15 +432,22 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 pinnedCellId: VC1-YQW-IB-DGX2 - leafCellNumber: 3 - affinityGroup: - name: JOBX/PCN-F - members: - - podNumber: 2 - leafCellNumber: 3 + cellNumber: 3 + podRootGroup: + name: JOBX/GROUP-F + withinOneCell: null + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: null + cellNumber: 3 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -347,6 +462,100 @@ spec: valueFrom: fieldRef: fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] + - name: G + taskNumber: 2 + task: + pod: + metadata: + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC1 + priority: 1000 + cellType: DGX2-V100-NODE-8-GPU + cellNumber: 1 + podRootGroup: + name: JOBX/GROUP-GH + withinOneCell: 2-DGX2-V100-NODE + pod: null + childGroups: + - pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100-NODE-8-GPU + cellNumber: 1 + containsCurrentPod: true + childGroups: null + - pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: DGX2-V100-NODE-4-GPU + cellNumber: 1 + containsCurrentPod: false + childGroups: null + spec: + schedulerName: hivedscheduler + priority: 1000 + containers: + - resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 3 * 8 + memory: 96Gi * 8 + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] + - name: H + taskNumber: 4 + task: + pod: + metadata: + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC1 + priority: 1000 + cellType: DGX2-V100-NODE-4-GPU + cellNumber: 1 + podRootGroup: + name: JOBX/GROUP-GH + withinOneCell: 2-DGX2-V100-NODE + pod: null + childGroups: + - pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100-NODE-8-GPU + cellNumber: 1 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: DGX2-V100-NODE-4-GPU + cellNumber: 1 + containsCurrentPod: true + childGroups: null + spec: + schedulerName: hivedscheduler + priority: 1000 + containers: + - resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 3 * 4 + memory: 96Gi * 4 + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] --- ################################################################################ @@ -358,19 +567,40 @@ metadata: name: JOBX-A-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 16 - affinityGroup: - name: JOBX/PCN-ABC - members: - - podNumber: 1 - leafCellNumber: 16 - - podNumber: 3 - leafCellNumber: 8 - - podNumber: 1 - leafCellNumber: 4 + cellType: DGX2-V100 + cellNumber: 16 + podRootGroup: + name: JOBX/GROUP-ABC + withinOneCell: null + pod: null + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: null + - pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 4 + containsCurrentPod: false + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -393,19 +623,40 @@ metadata: name: JOBX-B-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 8 - affinityGroup: - name: JOBX/PCN-ABC - members: - - podNumber: 1 - leafCellNumber: 16 - - podNumber: 3 - leafCellNumber: 8 - - podNumber: 1 - leafCellNumber: 4 + cellType: DGX2-V100 + cellNumber: 8 + podRootGroup: + name: JOBX/GROUP-ABC + withinOneCell: null + pod: null + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: true + childGroups: null + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 4 + containsCurrentPod: false + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -427,19 +678,40 @@ metadata: name: JOBX-C-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 4 - affinityGroup: - name: JOBX/PCN-ABC - members: - - podNumber: 1 - leafCellNumber: 16 - - podNumber: 3 - leafCellNumber: 8 - - podNumber: 1 - leafCellNumber: 4 + cellType: DGX2-V100 + cellNumber: 4 + podRootGroup: + name: JOBX/GROUP-ABC + withinOneCell: null + pod: null + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 4 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -462,15 +734,22 @@ metadata: name: JOBX-D-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: null - leafCellNumber: 3 - affinityGroup: - name: JOBX/PCN-D - members: - - podNumber: 2 - leafCellNumber: 3 + cellType: null + cellNumber: 3 + podRootGroup: + name: JOBX/GROUP-D + withinOneCell: null + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: null + cellNumber: 3 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -493,11 +772,12 @@ metadata: name: JOBX-E-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 - leafCellType: DGX2-V100 - leafCellNumber: 16 - affinityGroup: null + cellType: DGX2-V100 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler priority: 1000 @@ -520,15 +800,22 @@ metadata: name: JOBX-F-0 annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC1 priority: 1000 pinnedCellId: VC1-YQW-IB-DGX2 - leafCellNumber: 3 - affinityGroup: - name: JOBX/PCN-F - members: - - podNumber: 2 - leafCellNumber: 3 + cellNumber: 3 + podRootGroup: + name: JOBX/GROUP-F + withinOneCell: null + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: null + cellNumber: 3 + containsCurrentPod: true + childGroups: null spec: schedulerName: hivedscheduler priority: 1000 @@ -543,3 +830,99 @@ spec: valueFrom: fieldRef: fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] +--- +apiVersion: v1 +kind: Pod +metadata: + # JOBX-G-1 is the same + name: JOBX-G-0 + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC1 + priority: 1000 + cellType: DGX2-V100-NODE-8-GPU + cellNumber: 1 + podRootGroup: + name: JOBX/GROUP-GH + withinOneCell: 2-DGX2-V100-NODE + pod: null + childGroups: + - pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100-NODE-8-GPU + cellNumber: 1 + containsCurrentPod: true + childGroups: null + - pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: DGX2-V100-NODE-4-GPU + cellNumber: 1 + containsCurrentPod: false + childGroups: null +spec: + schedulerName: hivedscheduler + priority: 1000 + containers: + - resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 3 * 8 + memory: 96Gi * 8 + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] +--- +apiVersion: v1 +kind: Pod +metadata: + # JOBX-H-1, JOBX-H-2, JOBX-H-3 are the same + name: JOBX-H-0 + annotations: + hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 + virtualCluster: VC1 + priority: 1000 + cellType: DGX2-V100-NODE-4-GPU + cellNumber: 1 + podRootGroup: + name: JOBX/GROUP-GH + withinOneCell: 2-DGX2-V100-NODE + pod: null + childGroups: + - pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100-NODE-8-GPU + cellNumber: 1 + containsCurrentPod: false + childGroups: null + - pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: DGX2-V100-NODE-4-GPU + cellNumber: 1 + containsCurrentPod: true + childGroups: null +spec: + schedulerName: hivedscheduler + priority: 1000 + containers: + - resources: + limits: + hivedscheduler.microsoft.com/pod-scheduling-enable: 1 + cpu: 3 * 4 + memory: 96Gi * 4 + env: + - name: NVIDIA_VISIBLE_DEVICES + valueFrom: + fieldRef: + fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation'] diff --git a/example/request/tf/request.yaml b/example/request/tf/request.yaml index 9890a39..a76060a 100644 --- a/example/request/tf/request.yaml +++ b/example/request/tf/request.yaml @@ -24,11 +24,12 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC2 priority: 1000 - leafCellType: K80 - leafCellNumber: 1 - affinityGroup: null + cellType: K80 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler restartPolicy: Never @@ -93,11 +94,12 @@ spec: metadata: annotations: hivedscheduler.microsoft.com/pod-scheduling-spec: |- + version: v2 virtualCluster: VC2 priority: 1000 - leafCellType: K80 - leafCellNumber: 1 - affinityGroup: null + cellType: K80 + cellNumber: 1 + podRootGroup: null spec: schedulerName: hivedscheduler restartPolicy: Never diff --git a/pkg/algorithm/cell.go b/pkg/algorithm/cell.go index 7b5647c..d93b07c 100644 --- a/pkg/algorithm/cell.go +++ b/pkg/algorithm/cell.go @@ -129,13 +129,13 @@ func (c *GenericCell) IncreaseUsedLeafCellNumAtPriority(p CellPriority, delta in // PhysicalCell defines a cell in the physical cluster. type PhysicalCell struct { GenericCell - nodes []string // node names inside the cell - leafCellIndices []int32 // [-1] for cells at levels higher than node - usingGroup *AlgoAffinityGroup // affinity group using this cell - reservingOrReservedGroup *AlgoAffinityGroup // affinity group that is reserving, or has reserved the cell (e.g., waiting for preemption) - virtualCell *VirtualCell // points to the bound virtual cell - split bool // true when the cell has been split - pinned bool // true when this is a pinned cell + nodes []string // node names inside the cell + leafCellIndices []int32 // [-1] for cells at levels higher than node + usingGroup *PodGroupSchedulingStatus // pod group using this cell + reservingOrReservedGroup *PodGroupSchedulingStatus // pod group that is reserving, or has reserved the cell (e.g., waiting for preemption) + virtualCell *VirtualCell // points to the bound virtual cell + split bool // true when the cell has been split + pinned bool // true when this is a pinned cell // This status only contains the statuses that need to be exposed to external, // and should not be used for internal status management apiStatus *api.PhysicalCellStatus @@ -216,46 +216,46 @@ func (c *PhysicalCell) SetPhysicalResources(nodes []string, leafCellIndices []in c.leafCellIndices = leafCellIndices } -func (c *PhysicalCell) AddUsingGroup(g *AlgoAffinityGroup) { +func (c *PhysicalCell) AddUsingGroup(g *PodGroupSchedulingStatus) { if c.usingGroup != nil { - klog.Errorf("Found another using affinity group %v when adding "+ - "using affinity group %v to cell %v", c.usingGroup.name, g.name, c.address) + klog.Errorf("Found another using pod group %v when adding "+ + "using pod group %v to cell %v", c.usingGroup.name, g.name, c.address) } c.usingGroup = g - klog.Infof("Cell %v is now used by affinity group %v", c.address, g.name) + klog.Infof("Cell %v is now used by pod group %v", c.address, g.name) } -func (c *PhysicalCell) DeleteUsingGroup(g *AlgoAffinityGroup) { +func (c *PhysicalCell) DeleteUsingGroup(g *PodGroupSchedulingStatus) { if c.usingGroup == nil || c.usingGroup.name != g.name { - klog.Errorf("Using affinity group %v not found when deleting it from cell %v", g.name, c.address) + klog.Errorf("Using pod group %v not found when deleting it from cell %v", g.name, c.address) } c.usingGroup = nil - klog.Infof("Cell %v is no longer used by affinity group %v", c.address, g.name) + klog.Infof("Cell %v is no longer used by pod group %v", c.address, g.name) } -func (c *PhysicalCell) GetUsingGroup() *AlgoAffinityGroup { +func (c *PhysicalCell) GetUsingGroup() *PodGroupSchedulingStatus { return c.usingGroup } -func (c *PhysicalCell) AddReservingOrReservedGroup(g *AlgoAffinityGroup) { +func (c *PhysicalCell) AddReservingOrReservedGroup(g *PodGroupSchedulingStatus) { if c.reservingOrReservedGroup != nil { - klog.Errorf("Found another reserving or reserved affinity group %v when adding "+ - "reserving or reserved affinity group %v to cell %v", c.reservingOrReservedGroup.name, g.name, c.address) + klog.Errorf("Found another reserving or reserved pod group %v when adding "+ + "reserving or reserved pod group %v to cell %v", c.reservingOrReservedGroup.name, g.name, c.address) } c.reservingOrReservedGroup = g - klog.Infof("Cell %v is now reserved (or being reserved) by affinity group %v", c.address, g.name) + klog.Infof("Cell %v is now reserved (or being reserved) by pod group %v", c.address, g.name) } -func (c *PhysicalCell) DeleteReservingOrReservedGroup(g *AlgoAffinityGroup) { +func (c *PhysicalCell) DeleteReservingOrReservedGroup(g *PodGroupSchedulingStatus) { if c.reservingOrReservedGroup == nil || c.reservingOrReservedGroup.name != g.name { - klog.Errorf("Reserving or reserved affinity group %v not found when deleting it from cell %v", + klog.Errorf("Reserving or reserved pod group %v not found when deleting it from cell %v", g.name, c.address) } c.reservingOrReservedGroup = nil - klog.Infof("Cell %v is no longer reserved by affinity group %v", c.address, g.name) + klog.Infof("Cell %v is no longer reserved by pod group %v", c.address, g.name) } -func (c *PhysicalCell) GetReservingOrReservedGroup() *AlgoAffinityGroup { +func (c *PhysicalCell) GetReservingOrReservedGroup() *PodGroupSchedulingStatus { return c.reservingOrReservedGroup } diff --git a/pkg/algorithm/config.go b/pkg/algorithm/config.go index ba62f82..0714d1c 100644 --- a/pkg/algorithm/config.go +++ b/pkg/algorithm/config.go @@ -417,10 +417,14 @@ func parseCellChainInfo( chains []CellChain) ( map[CellChain]map[CellLevel]int32, map[CellChain]map[CellLevel]api.CellType, + map[CellChain]map[api.CellType]CellLevel, + map[string][]CellChain, map[string][]CellChain) { cellLevelToLeafCellNum := map[CellChain]map[CellLevel]int32{} cellLevelToType := map[CellChain]map[CellLevel]api.CellType{} + cellTypeToLevel := map[CellChain]map[api.CellType]CellLevel{} + cellTypeToChain := map[string][]CellChain{} leafCellTypeToChain := map[string][]CellChain{} for _, chain := range chains { ce := cellChainElements[api.CellType(chain)] @@ -428,15 +432,19 @@ func parseCellChainInfo( cellLevelToLeafCellNum[chain] = map[CellLevel]int32{} cellLevelToType[chain] = map[CellLevel]api.CellType{} + cellTypeToLevel[chain] = map[api.CellType]CellLevel{} ce, ok := cellChainElements[api.CellType(chain)] for ok { cellLevelToLeafCellNum[chain][ce.level] = ce.leafCellNumber cellLevelToType[chain][ce.level] = ce.cellType + cellTypeToLevel[chain][ce.cellType] = ce.level + if !ce.isMultiNodes { + cellTypeToChain[string(ce.cellType)] = append(cellTypeToChain[string(ce.cellType)], chain) + } ce, ok = cellChainElements[ce.childCellType] } } - return cellLevelToLeafCellNum, cellLevelToType, leafCellTypeToChain - + return cellLevelToLeafCellNum, cellLevelToType, cellTypeToLevel, cellTypeToChain, leafCellTypeToChain } func ParseConfig(sConfig *api.Config) ( @@ -449,7 +457,9 @@ func ParseConfig(sConfig *api.Config) ( physicalPinnedCells map[api.VirtualClusterName]map[api.PinnedCellId]*PhysicalCell, // vc:pinnedCellId:PhysicalCell cellLevelToLeafCellNum map[CellChain]map[CellLevel]int32, // chain:level:leafCellNumber leafCellTypeToChain map[string][]CellChain, // leafCellType:[]chain + cellTypeToChain map[string][]CellChain, // cellType:[]chain cellLevelToType map[CellChain]map[CellLevel]api.CellType, // chain:level:cellType + cellTypeToLevel map[CellChain]map[api.CellType]CellLevel, // chain:cellType:level ) { cellTypes := sConfig.PhysicalCluster.CellTypes @@ -471,7 +481,8 @@ func ParseConfig(sConfig *api.Config) ( for k := range physicalFullList { cellChains = append(cellChains, k) } - cellLevelToLeafCellNum, cellLevelToType, leafCellTypeToChain = parseCellChainInfo(cellChainElements, cellChains) + cellLevelToLeafCellNum, cellLevelToType, cellTypeToLevel, cellTypeToChain, leafCellTypeToChain = + parseCellChainInfo(cellChainElements, cellChains) return } diff --git a/pkg/algorithm/constants.go b/pkg/algorithm/constants.go index 9479bbc..451f2bb 100644 --- a/pkg/algorithm/constants.go +++ b/pkg/algorithm/constants.go @@ -23,8 +23,9 @@ package algorithm import ( - "github.com/microsoft/hivedscheduler/pkg/api" "math" + + "github.com/microsoft/hivedscheduler/pkg/api" ) const ( @@ -57,15 +58,15 @@ const ( // will respect the group that reserved the cell, i.e., a group with a non-higher priority cannot get this cell. cellReserved CellState = "Reserved" - // internal affinity group states + // internal pod group states - // The affinity group has been allocated cells. + // The pod group has been allocated cells. // All cells in the group must be in Used state. - groupAllocated AffinityGroupState = "Allocated" - // The affinity group is preempting other groups to get free resource. + podGroupAllocated PodGroupState = "Allocated" + // The pod group is preempting other groups to get free resource. // Cells in the group must be in either Reserving or Reserved states. - groupPreempting AffinityGroupState = "Preempting" - // The affinity group is being preempted by some other groups. + podGroupPreempting PodGroupState = "Preempting" + // The pod group is being preempted by some other groups. // Cells in the group must be in either Used or Reserving states. - groupBeingPreempted AffinityGroupState = "BeingPreempted" + podGroupBeingPreempted PodGroupState = "BeingPreempted" ) diff --git a/pkg/algorithm/hived_algorithm.go b/pkg/algorithm/hived_algorithm.go index 8ad6b69..bfa0561 100644 --- a/pkg/algorithm/hived_algorithm.go +++ b/pkg/algorithm/hived_algorithm.go @@ -27,6 +27,7 @@ import ( "sync" "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" "github.com/microsoft/hivedscheduler/pkg/common" "github.com/microsoft/hivedscheduler/pkg/internal" core "k8s.io/api/core/v1" @@ -34,20 +35,20 @@ import ( "k8s.io/klog" ) -// HivedAlgorithm implements an internal.SchedulerAlgorithm. It schedules affinity groups using the algorithm of HiveD. -// Note that the topologyAwareScheduler used in this struct is not another implementation of SchedulerAlgorithm; +// HivedAlgorithm implements an internal.SchedulerAlgorithm. It schedules pod groups using the algorithm of HiveD. +// Note that the topologyGuaranteeScheduler used in this struct is not another implementation of SchedulerAlgorithm; // that is a specific algorithm for pod placement, used in intra-VC scheduling and opportunistic pod scheduling. type HivedAlgorithm struct { // scheduler in each VC vcSchedulers map[api.VirtualClusterName]intraVCScheduler // scheduler for opportunistic pods - opportunisticSchedulers map[CellChain]*topologyAwareScheduler + opportunisticSchedulers map[CellChain]*topologyGuaranteeScheduler // ChainCellLists of physical cells of each cell chain (including the children of the free cells) fullCellList map[CellChain]ChainCellList // ChainCellLists of free physical cells of each cell chain (used in buddy alloc) freeCellList map[CellChain]ChainCellList - // all affinity groups that have been allocated or are preempting other groups - affinityGroups map[string]*AlgoAffinityGroup + // all pod root groups that have been allocated or are preempting other groups + podGroups map[string]*PodGroupSchedulingStatus // vcFreeCellNum, allVCFreeCellNum, and totalLeftCellNum are used to track cell usage of the VCs. // Note that these numbers count both healthy and bad cells. @@ -63,7 +64,7 @@ type HivedAlgorithm struct { totalLeftCellNum map[CellChain]map[CellLevel]int32 // badFreeCells, vcDoomedBadCells, and allVCDoomedBadCellNum are used to track bad cells. - // Note that a cell is bad if ANY of its children is bad; so a cell may also contain healthy children. + // Note that a cell is bad if ANY of its children is bad; so a bad cell may also contain healthy children. // A preassigned cell in a VC is "doomed to be bad" when the healthy free cells in the physical cluster // is fewer than the VC's free cells (thus certain free cells in the VC will be inevitably bound @@ -82,22 +83,28 @@ type HivedAlgorithm struct { // number of doomed bad cells in all the VCs of each cell type allVCDoomedBadCellNum map[CellChain]map[CellLevel]int32 // Besides bad nodes, we also avoid using nodes not suggested by K8s (configured in - // ignoreK8sSuggestedNodes of an affinity group). + // ignoreK8sSuggestedNodes, which is true by default). // But note that we do NOT mark virtual cells as "doomed to be bound to non-suggested nodes" // (like what we do for "doomed to be bound to bad cells"), // because suggested nodes are not stable: they are provided by K8s during scheduling *each pod*, - // and may change across different pods. The consequence is that, even if ignoreK8sSuggestedNodes is false - // for an affinity group, the intra-VC scheduler may choose some placements that + // and may change across different pods. The consequence is that, even if ignoreK8sSuggestedNodes is false, + // the intra-VC scheduler may choose some placements that // cannot be mapped to a physical placement fully within the suggested nodes. // TODO: introduce randomization in intra-VC scheduling to avoid always choosing the same placement // that cannot be mapped to suggested nodes // bad nodes in the physical cluster badNodes common.Set + // map each level in a chain to the leaf cell number + leafCellNums map[CellChain]map[CellLevel]int32 // map each leaf cell type to all chains that contain this type + leafCellChains map[string][]CellChain + // map each within cell type to all chains that contain this type cellChains map[string][]CellChain - // map each level in a chain to the specific cell type name + // map each level in a chain to the specific cell cell type cellTypes map[CellChain]map[CellLevel]api.CellType + // map each cell type in a chain to the specific cell level + cellLevels map[CellChain]map[api.CellType]CellLevel // cluster status exposed to external apiClusterStatus api.ClusterStatus // lock @@ -107,11 +114,11 @@ type HivedAlgorithm struct { // NewHivedAlgorithm initializes a HivedAlgorithm from the config file. func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm { fullPcl, freePcl, vcFreeCellNum, nonPinnedFullVcl, nonPinnedFreeVcl, pinnedVcl, pinnedPcl, - leafCellNums, chains, cellTypes := ParseConfig(sConfig) + leafCellNums, leafCellChains, cellChains, cellTypes, cellLevels := ParseConfig(sConfig) h := &HivedAlgorithm{ vcSchedulers: map[api.VirtualClusterName]intraVCScheduler{}, - opportunisticSchedulers: map[CellChain]*topologyAwareScheduler{}, + opportunisticSchedulers: map[CellChain]*topologyGuaranteeScheduler{}, fullCellList: fullPcl, freeCellList: freePcl, vcFreeCellNum: vcFreeCellNum, @@ -121,9 +128,12 @@ func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm { vcDoomedBadCells: map[api.VirtualClusterName]map[CellChain]ChainCellList{}, allVCDoomedBadCellNum: map[CellChain]map[CellLevel]int32{}, badNodes: common.NewSet(), - cellChains: chains, + leafCellNums: leafCellNums, + leafCellChains: leafCellChains, + cellChains: cellChains, cellTypes: cellTypes, - affinityGroups: map[string]*AlgoAffinityGroup{}, + cellLevels: cellLevels, + podGroups: map[string]*PodGroupSchedulingStatus{}, apiClusterStatus: api.ClusterStatus{ PhysicalCluster: api.PhysicalClusterStatus{}, VirtualClusters: map[api.VirtualClusterName]api.VirtualClusterStatus{}, @@ -132,10 +142,10 @@ func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm { for vcName := range nonPinnedFullVcl { // TODO: Support per-VC configurable intra VC scheduling algo. h.vcSchedulers[vcName] = newDefaultIntraVCScheduler( - nonPinnedFullVcl[vcName], nonPinnedFreeVcl[vcName], pinnedVcl[vcName], leafCellNums) + nonPinnedFullVcl[vcName], nonPinnedFreeVcl[vcName], pinnedVcl[vcName], leafCellNums, cellLevels) } for chain, ccl := range h.fullCellList { - h.opportunisticSchedulers[chain] = NewTopologyAwareScheduler(ccl, leafCellNums[chain], false) + h.opportunisticSchedulers[chain] = NewTopologyGuaranteeScheduler(ccl, leafCellNums[chain], cellLevels[chain], false) } h.initCellNums() h.initAPIClusterStatus() @@ -186,39 +196,42 @@ func (h *HivedAlgorithm) Schedule( defer h.algorithmLock.Unlock() klog.Infof("[%v]: Scheduling pod in %v phase...", internal.Key(pod), phase) - s := internal.ExtractPodSchedulingSpec(pod) + podSchedSpec := internal.ExtractPodSchedulingSpec(pod) suggestedNodeSet := common.NewSet() for _, n := range suggestedNodes { suggestedNodeSet.Add(n) } var ( - groupPhysicalPlacement groupPhysicalPlacement // leaf cell number -> a set of pods -> a set of leaf cells of each pod - groupVirtualPlacement groupVirtualPlacement // leaf cell number -> a set of pods -> a set of leaf cells of each pod - preemptionVictims map[string]common.Set // node -> pods - waitReason string - podIndex int32 // index of current pod among those of the same leaf cell number in the group, 0 by default + physicalPlacement PodGroupPhysicalPlacement + virtualPlacement PodGroupVirtualPlacement + preemptionVictims map[string]common.Set // node -> pods + waitReason string + podGroupIndex int32 // index of child pod group for current pod + podIndex int32 // index of current pod in its pod group, 0 by default ) - if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil { - groupPhysicalPlacement, groupVirtualPlacement, preemptionVictims, podIndex = - h.schedulePodFromExistingGroup(g, s, suggestedNodeSet, phase, pod) + podGroupSchedStatus := h.podGroups[podSchedSpec.PodRootGroup.Name] + if podGroupSchedStatus != nil { + physicalPlacement, virtualPlacement, preemptionVictims, podGroupIndex, podIndex = + h.schedulePodFromExistingGroup(podGroupSchedStatus, podSchedSpec, suggestedNodeSet, phase, pod) } // we need to re-evaluate the existence of the group here (instead of an "else") because it is // possible that the group was a preempting group and deleted in h.schedulePodFromExistingGroup - if h.affinityGroups[s.AffinityGroup.Name] == nil { - groupPhysicalPlacement, groupVirtualPlacement, preemptionVictims, waitReason = - h.schedulePodFromNewGroup(s, suggestedNodeSet, phase, pod) + if podGroupSchedStatus == nil { + physicalPlacement, virtualPlacement, preemptionVictims, waitReason = + h.schedulePodFromNewGroup(podSchedSpec, suggestedNodeSet, phase, pod) + podGroupIndex, _ = podSchedSpec.GetCurrentPod() } return generatePodScheduleResult( - groupPhysicalPlacement, - groupVirtualPlacement, + physicalPlacement, + virtualPlacement, preemptionVictims, waitReason, h.cellTypes, - s.LeafCellNumber, + podSchedSpec.CellNumber, + podGroupIndex, podIndex, - h.affinityGroups[s.AffinityGroup.Name], - s.AffinityGroup.Name, + h.podGroups[podSchedSpec.PodRootGroup.Name], suggestedNodeSet, pod) } @@ -230,16 +243,17 @@ func (h *HivedAlgorithm) DeleteUnallocatedPod(pod *core.Pod) { h.algorithmLock.Lock() defer h.algorithmLock.Unlock() - s := internal.ExtractPodSchedulingSpec(pod) - if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil && g.state == groupPreempting { - if g.preemptingPods[pod.UID] != nil { - klog.Infof("[%v]: Deleting preempting pod from affinity group %v...", internal.Key(pod), g.name) - delete(g.preemptingPods, pod.UID) + podSchedSpec := internal.ExtractPodSchedulingSpec(pod) + podGroupSchedStatus := h.podGroups[podSchedSpec.PodRootGroup.Name] + if podGroupSchedStatus != nil && podGroupSchedStatus.state == podGroupPreempting { + if podGroupSchedStatus.preemptingPods[pod.UID] != nil { + klog.Infof("[%v]: Deleting preempting pod from pod group %v...", internal.Key(pod), podGroupSchedStatus.name) + delete(podGroupSchedStatus.preemptingPods, pod.UID) } - if len(g.preemptingPods) == 0 { - klog.Infof("[%v]: Canceling affinity group %v's preemption because its pods are all deleted", - internal.Key(pod), g.name) - h.deletePreemptingAffinityGroup(g, pod) + if len(podGroupSchedStatus.preemptingPods) == 0 { + klog.Infof("[%v]: Canceling pod group %v's preemption because its pods are all deleted", + internal.Key(pod), podGroupSchedStatus.name) + h.deletePreemptingPodGroup(podGroupSchedStatus, pod) } } } @@ -248,75 +262,79 @@ func (h *HivedAlgorithm) AddAllocatedPod(pod *core.Pod) { h.algorithmLock.Lock() defer h.algorithmLock.Unlock() - s := internal.ExtractPodSchedulingSpec(pod) + podSchedSpec := internal.ExtractPodSchedulingSpec(pod) info := internal.ExtractPodBindInfo(pod) - klog.Infof("[%v]: Adding allocated pod to affinity group %v...", internal.Key(pod), s.AffinityGroup.Name) + klog.Infof("[%v]: Adding allocated pod to pod group %v...", internal.Key(pod), podSchedSpec.PodRootGroup.Name) klog.Infof("[%v]: Adding to node %v, leaf cells %v", internal.Key(pod), info.Node, common.ToJson(info.LeafCellIsolation)) + podGroupIndex, _ := podSchedSpec.GetCurrentPod() podIndex := int32(0) - if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil { - if g.state == groupPreempting { - h.allocatePreemptingAffinityGroup(g, pod) + podGroupSchedStatus := h.podGroups[podSchedSpec.PodRootGroup.Name] + if podGroupSchedStatus != nil { + if podGroupSchedStatus.state == podGroupPreempting { + h.allocatePreemptingPodGroup(podGroupSchedStatus, pod) } - if podIndex = getAllocatedPodIndex(info, s.LeafCellNumber); podIndex == -1 { + if podIndex = getAllocatedPodIndex(info, podGroupIndex); podIndex == -1 { klog.Errorf("[%v]: Pod placement not found in group %v: node %v, leaf cells %v", - internal.Key(pod), s.AffinityGroup.Name, info.Node, info.LeafCellIsolation) + internal.Key(pod), podSchedSpec.PodRootGroup.Name, info.Node, info.LeafCellIsolation) return } } else { - h.createAllocatedAffinityGroup(s, info, pod) + h.createAllocatedPodGroup(podSchedSpec, info, pod) } - h.affinityGroups[s.AffinityGroup.Name].allocatedPods[s.LeafCellNumber][podIndex] = pod + h.podGroups[podSchedSpec.PodRootGroup.Name].allocatedPodGroup.SetPod(pod, podGroupIndex, podIndex) } func (h *HivedAlgorithm) DeleteAllocatedPod(pod *core.Pod) { h.algorithmLock.Lock() defer h.algorithmLock.Unlock() - s := internal.ExtractPodSchedulingSpec(pod) + podSchedSpec := internal.ExtractPodSchedulingSpec(pod) info := internal.ExtractPodBindInfo(pod) - klog.Infof("[%v]: Deleting allocated pod from affinity group %v...", internal.Key(pod), s.AffinityGroup.Name) + klog.Infof("[%v]: Deleting allocated pod from pod group %v...", internal.Key(pod), podSchedSpec.PodRootGroup.Name) klog.Infof("[%v]: Deleting from node %v, leaf cells %v", internal.Key(pod), info.Node, common.ToJson(info.LeafCellIsolation)) - if g := h.affinityGroups[s.AffinityGroup.Name]; g == nil { - klog.Errorf("[%v]: Group %v not found when deleting pod", internal.Key(pod), s.AffinityGroup.Name) + podGroupSchedStatus := h.podGroups[podSchedSpec.PodRootGroup.Name] + if podGroupSchedStatus == nil { + klog.Errorf("[%v]: Group %v not found when deleting pod", internal.Key(pod), podSchedSpec.PodRootGroup.Name) return } else { - if podIndex := getAllocatedPodIndex(info, s.LeafCellNumber); podIndex == -1 { + podGroupIndex, _ := podSchedSpec.GetCurrentPod() + if podIndex := getAllocatedPodIndex(info, podGroupIndex); podIndex == -1 { klog.Errorf("[%v]: Pod placement not found in group %v: node %v, leaf cells %v", - internal.Key(pod), s.AffinityGroup.Name, info.Node, info.LeafCellIsolation) + internal.Key(pod), podSchedSpec.PodRootGroup.Name, info.Node, info.LeafCellIsolation) return } else { - g.allocatedPods[s.LeafCellNumber][podIndex] = nil + podGroupSchedStatus.allocatedPodGroup.SetPod(nil, podGroupIndex, podIndex) } - if allPodsReleased(g.allocatedPods) { - h.deleteAllocatedAffinityGroup(g, pod) + if allPodsReleased(podGroupSchedStatus.allocatedPodGroup) { + h.deleteAllocatedPodGroup(podGroupSchedStatus, pod) } } } -func (h *HivedAlgorithm) GetAllAffinityGroups() api.AffinityGroupList { +func (h *HivedAlgorithm) GetAllPodGroups() apiv2.PodGroupList { h.algorithmLock.RLock() defer h.algorithmLock.RUnlock() - ags := api.AffinityGroupList{} - for _, aag := range h.affinityGroups { - ags.Items = append(ags.Items, aag.ToAffinityGroup()) + podGroupList := apiv2.PodGroupList{} + for _, podGroup := range h.podGroups { + podGroupList.Items = append(podGroupList.Items, podGroup.DumpPodGroup()) } - return ags + return podGroupList } -func (h *HivedAlgorithm) GetAffinityGroup(name string) api.AffinityGroup { +func (h *HivedAlgorithm) GetPodGroup(name string) apiv2.PodGroup { h.algorithmLock.RLock() defer h.algorithmLock.RUnlock() - if aag := h.affinityGroups[name]; aag != nil { - return aag.ToAffinityGroup() + if podGroup := h.podGroups[name]; podGroup != nil { + return podGroup.DumpPodGroup() } panic(internal.NewBadRequestError(fmt.Sprintf( - "Affinity group %v does not exist since it is not allocated or preempting", + "Pod group %v does not exist since it is not allocated or preempting", name))) } @@ -652,216 +670,212 @@ func (h *HivedAlgorithm) tryUnbindDoomedBadCell(c CellChain, l CellLevel) { } } -// schedulePodFromExistingGroup schedules a pod from an allocated or preempting affinity group. +// schedulePodFromExistingGroup schedules a pod from an allocated or preempting pod group. // If it is from an allocated group, we will schedule the pod to the corresponding placement. // If it is from a preempting group, we will continue its preemption, or schedule it when the preemption is done. func (h *HivedAlgorithm) schedulePodFromExistingGroup( - g *AlgoAffinityGroup, - s *api.PodSchedulingSpec, + podGroupSchedStatus *PodGroupSchedulingStatus, + podSchedSpec *apiv2.PodSchedulingSpec, suggestedNodes common.Set, phase internal.SchedulingPhase, pod *core.Pod) ( - groupPhysicalPlacement groupPhysicalPlacement, - groupVirtualPlacement groupVirtualPlacement, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, preemptionVictims map[string]common.Set, + podGroupIndex int32, podIndex int32) { badOrNonSuggestedNodes := collectBadOrNonSuggestedNodes( - g.physicalLeafCellPlacement, suggestedNodes, g.ignoreK8sSuggestedNodes) + podGroupSchedStatus.physicalPlacement, suggestedNodes, true) // state of an existing group can be either Allocated or Preempting - if g.state == groupAllocated { - klog.Infof("[%v]: Pod is from an affinity group that is already allocated: %v", - internal.Key(pod), s.AffinityGroup.Name) - groupPhysicalPlacement = g.physicalLeafCellPlacement - groupVirtualPlacement = g.virtualLeafCellPlacement + if podGroupSchedStatus.state == podGroupAllocated { + klog.Infof("[%v]: Pod is from a pod group that is already allocated: %v", + internal.Key(pod), podSchedSpec.PodRootGroup.Name) + physicalPlacement = podGroupSchedStatus.physicalPlacement + virtualPlacement = podGroupSchedStatus.virtualPlacement if !badOrNonSuggestedNodes.IsEmpty() { // for an allocated group, we always insist the previous scheduling decision // even if some pods are now bad or not within suggested nodes - klog.Warningf("[%v]: Some nodes allocated to affinity group %v are no longer "+ - "healthy and within K8s suggested nodes: %v", internal.Key(pod), g.name, badOrNonSuggestedNodes) + klog.Warningf("[%v]: Some nodes allocated to pod group %v are no longer "+ + "healthy and within K8s suggested nodes: %v", internal.Key(pod), podGroupSchedStatus.name, badOrNonSuggestedNodes) } - if podIndex = getNewPodIndex(g.allocatedPods[s.LeafCellNumber]); podIndex == -1 { + var currentPod apiv2.PodGroupMemberSpec + podGroupIndex, currentPod = podSchedSpec.GetCurrentPod() + if podIndex = getNewPodIndex(podGroupSchedStatus.allocatedPodGroup, podGroupIndex); podIndex == -1 { panic(internal.NewBadRequestError(fmt.Sprintf( - "Requesting more pods than the configured number for %v leaf cells (%v pods) in affinity group %v", - s.LeafCellNumber, g.totalPodNums[s.LeafCellNumber], s.AffinityGroup.Name))) + "Requesting more pods than the configured number for %v cells (%v pods) in pod group %v", + podSchedSpec.CellNumber, currentPod.PodMinNumber, podSchedSpec.PodRootGroup.Name))) } } else { // groupPreempting - klog.Infof("[%v]: Pod is from an affinity group that is preempting others: %v", - internal.Key(pod), s.AffinityGroup.Name) + klog.Infof("[%v]: Pod is from a pod group that is preempting others: %v", + internal.Key(pod), podSchedSpec.PodRootGroup.Name) if phase == internal.PreemptingPhase && !badOrNonSuggestedNodes.IsEmpty() { // If we find a preempting group's placement is not fully healthy and within suggested nodes, // we should cancel the preemption so as to reschedule it to other places. // We should do this only in Preempting phase // because only suggested nodes of this phase consider preemption. - klog.Infof("[%v]: Canceling affinity group %v's preemption because its placement is "+ + klog.Infof("[%v]: Canceling pod group %v's preemption because its placement is "+ "no longer fully healthy and within Preempting-phase suggested nodes: %v", - internal.Key(pod), g.name, badOrNonSuggestedNodes) - h.deletePreemptingAffinityGroup(g, pod) + internal.Key(pod), podGroupSchedStatus.name, badOrNonSuggestedNodes) + h.deletePreemptingPodGroup(podGroupSchedStatus, pod) } else { - groupPhysicalPlacement = g.physicalLeafCellPlacement - groupVirtualPlacement = g.virtualLeafCellPlacement - preemptionVictims, _ = collectPreemptionVictims(groupPhysicalPlacement) + physicalPlacement = podGroupSchedStatus.physicalPlacement + virtualPlacement = podGroupSchedStatus.virtualPlacement + preemptionVictims, _ = collectPreemptionVictims(physicalPlacement) if len(preemptionVictims) == 0 { klog.Infof( - "Preemption victims have been cleaned up for the preemptor affinity group %v", g.name) + "Preemption victims have been cleaned up for the preemptor pod group %v", podGroupSchedStatus.name) } - g.preemptingPods[pod.UID] = pod + podGroupSchedStatus.preemptingPods[pod.UID] = pod } } - return groupPhysicalPlacement, groupVirtualPlacement, preemptionVictims, podIndex + return physicalPlacement, virtualPlacement, preemptionVictims, podGroupIndex, podIndex } -// schedulePodFromNewGroup schedules a pod from a new affinity group, find placement for the group, +// schedulePodFromNewGroup schedules a pod from a new pod group, find placement for the group, // and checks if the group needs preemption. func (h *HivedAlgorithm) schedulePodFromNewGroup( - s *api.PodSchedulingSpec, + podSchedSpec *apiv2.PodSchedulingSpec, suggestedNodes common.Set, phase internal.SchedulingPhase, pod *core.Pod) ( - groupPhysicalPlacement groupPhysicalPlacement, - groupVirtualPlacement groupVirtualPlacement, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, preemptionVictims map[string]common.Set, waitReason string) { - groupPhysicalPlacement, groupVirtualPlacement, waitReason = h.scheduleNewAffinityGroup( - pod, s, suggestedNodes) - if groupPhysicalPlacement == nil { - return nil, nil, nil, waitReason + physicalPlacement, virtualPlacement, waitReason = h.scheduleNewPodGroup( + pod, podSchedSpec, suggestedNodes) + if PodGroupPlacement(physicalPlacement).IsEmpty() { + return PodGroupPhysicalPlacement{}, PodGroupVirtualPlacement{}, nil, waitReason } - preemptionVictims, overlappingPreemptors := collectPreemptionVictims(groupPhysicalPlacement) + preemptionVictims, overlappingPreemptors := collectPreemptionVictims(physicalPlacement) // we allow a new preemption only when in Preempting phase // and the placement is fully within suggested nodes if phase == internal.PreemptingPhase { // first cancel preemption of other groups whose resources overlap with the current group for preemptor := range overlappingPreemptors.Items() { - klog.Infof("[%v]: Canceling affinity group %v's preemption because it is "+ - "further preempted by a higher-priority affinity group %v", - internal.Key(pod), preemptor.(*AlgoAffinityGroup).name, s.AffinityGroup.Name) - h.deletePreemptingAffinityGroup(preemptor.(*AlgoAffinityGroup), pod) + klog.Infof("[%v]: Canceling pod group %v's preemption because it is "+ + "further preempted by a higher-priority pod group %v", + internal.Key(pod), preemptor.(*PodGroupSchedulingStatus).name, podSchedSpec.PodRootGroup.Name) + h.deletePreemptingPodGroup(preemptor.(*PodGroupSchedulingStatus), pod) } if len(preemptionVictims) != 0 { // create preemption state to avoid resource contention among multiple preemptors - h.createPreemptingAffinityGroup(s, groupPhysicalPlacement, groupVirtualPlacement, pod) + h.createPreemptingPodGroup(podSchedSpec, physicalPlacement, virtualPlacement, pod) } } else if len(preemptionVictims) != 0 { // here we won't create preemption state since we call preempt only in Preempting phase klog.Infof("[%v]: Found preemption victims %v in non-Preempting phase, skipping it", internal.Key(pod), victimsToString(preemptionVictims)) } - return groupPhysicalPlacement, groupVirtualPlacement, preemptionVictims, waitReason + return physicalPlacement, virtualPlacement, preemptionVictims, waitReason } -// scheduleNewAffinityGroup schedules each pod of a new affinity group to a set of leaf cells +// scheduleNewPodGroup schedules each pod of a new pod group to a set of leaf cells // (in both the physical cluster and the VC). This is the entrance of a new scheduling attempt. -func (h *HivedAlgorithm) scheduleNewAffinityGroup( +func (h *HivedAlgorithm) scheduleNewPodGroup( pod *core.Pod, - s *api.PodSchedulingSpec, + podSchedSpec *apiv2.PodSchedulingSpec, suggestedNodes common.Set) ( - physicalPlacement groupPhysicalPlacement, - virtualPlacement groupVirtualPlacement, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, failedReason string) { - klog.Infof("[%v]: Scheduling new affinity group %v", internal.Key(pod), s.AffinityGroup.Name) - priority := CellPriority(s.Priority) - sr := schedulingRequest{ - vc: s.VirtualCluster, - pinnedCellId: s.PinnedCellId, - priority: priority, - affinityGroupName: s.AffinityGroup.Name, - affinityGroupPodNums: map[int32]int32{}, - suggestedNodes: suggestedNodes, - ignoreSuggestedNodes: s.IgnoreK8sSuggestedNodes, - } - for _, m := range s.AffinityGroup.Members { - // we will merge group members with same leaf cell number - sr.affinityGroupPodNums[m.LeafCellNumber] += m.PodNumber - } - h.validateSchedulingRequest(sr, pod) - if sr.pinnedCellId != "" { - klog.Infof("Using pinned cell %v", s.PinnedCellId) - physicalPlacement, virtualPlacement, failedReason = h.handleSchedulingRequest(sr) - } else if s.LeafCellType != "" { - if _, ok := h.cellChains[s.LeafCellType]; !ok { + klog.Infof("[%v]: Scheduling new pod group %v", internal.Key(pod), podSchedSpec.PodRootGroup.Name) + podGroupSchedRequest := PodGroupSchedulingRequest{ + vc: podSchedSpec.VirtualCluster, + pinnedCellId: podSchedSpec.PinnedCellId, + podRootGroup: *podSchedSpec.PodRootGroup, + priority: CellPriority(podSchedSpec.Priority), + } + h.validateSchedulingRequest(podGroupSchedRequest, pod) + if podGroupSchedRequest.pinnedCellId != "" { + klog.Infof("Using pinned cell %v", podGroupSchedRequest.pinnedCellId) + physicalPlacement, virtualPlacement, failedReason = h.handleSchedulingRequest(podGroupSchedRequest) + } else if podSchedSpec.CellType != "" { + if _, ok := h.cellChains[podSchedSpec.CellType]; !ok { panic(internal.NewBadRequestError(fmt.Sprintf( - "[%v]: Pod requesting leaf cell type %v which the whole cluster does not have", - internal.Key(pod), s.LeafCellType))) + "[%v]: Pod requesting cell type %v which the whole cluster does not have", + internal.Key(pod), podSchedSpec.CellType))) } - klog.Infof("Using specified leaf cell type %v", s.LeafCellType) - physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForLeafCellType( - sr, s.LeafCellType, pod, true) + klog.Infof("Using specified cell type %v", podSchedSpec.CellType) + physicalPlacement, virtualPlacement, failedReason = h.schedulePodGroupForCellType( + podGroupSchedRequest, podSchedSpec.CellType, pod, true) } else { - physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForAnyLeafCellType(sr, pod) + physicalPlacement, virtualPlacement, failedReason = h.schedulePodGroupForAnyLeafCellType(podGroupSchedRequest, pod) } return physicalPlacement, virtualPlacement, failedReason } -// scheduleAffinityGroupForLeafCellType schedules an affinity group in a certain cell chain -// that matches the given leaf cell type. -func (h *HivedAlgorithm) scheduleAffinityGroupForLeafCellType( - sr schedulingRequest, - leafCellType string, +// schedulePodGroupForCellType schedules a pod group in a certain cell chain +// that matches the given cell type. +func (h *HivedAlgorithm) schedulePodGroupForCellType( + podGroupSchedRequest PodGroupSchedulingRequest, + cellType string, pod *core.Pod, typeSpecified bool) ( - physicalPlacement groupPhysicalPlacement, - virtualPlacement groupVirtualPlacement, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, failedReason string) { vcHasType := false - for _, chain := range h.cellChains[leafCellType] { - if sr.priority < minGuaranteedPriority || - h.vcSchedulers[sr.vc].getNonPinnedPreassignedCells()[chain] != nil { + for _, chain := range h.cellChains[cellType] { + if podGroupSchedRequest.priority < minGuaranteedPriority || + h.vcSchedulers[podGroupSchedRequest.vc].getNonPinnedPreassignedCells()[chain] != nil { vcHasType = true klog.Infof("Searching chain %v", chain) - sr.chain = chain + podGroupSchedRequest.chain = chain physicalPlacement, virtualPlacement, failedReason = - h.handleSchedulingRequest(sr) - if physicalPlacement != nil { + h.handleSchedulingRequest(podGroupSchedRequest) + if !PodGroupPlacement(physicalPlacement).IsEmpty() { return physicalPlacement, virtualPlacement, "" } } } - if typeSpecified && sr.priority >= minGuaranteedPriority && !vcHasType { + if typeSpecified && podGroupSchedRequest.priority >= minGuaranteedPriority && !vcHasType { panic(internal.NewBadRequestError(fmt.Sprintf( - "[%v]: Pod requesting leaf cell type %v which VC %v does not have", - internal.Key(pod), leafCellType, sr.vc))) + "[%v]: Pod requesting cell type %v which VC %v does not have", + internal.Key(pod), cellType, podGroupSchedRequest.vc))) } - return nil, nil, failedReason + return PodGroupPhysicalPlacement{}, PodGroupVirtualPlacement{}, failedReason } -// scheduleAffinityGroupForAnyLeafCellType schedules an affinity group in every possible leaf cell type -// (when the user does not specify a leaf cell type). -func (h *HivedAlgorithm) scheduleAffinityGroupForAnyLeafCellType( - sr schedulingRequest, +// schedulePodGroupForAnyLeafCellType schedules a pod group in every possible leaf cell type +// (when the user does not specify a cell type). +func (h *HivedAlgorithm) schedulePodGroupForAnyLeafCellType( + podGroupSchedRequest PodGroupSchedulingRequest, pod *core.Pod) ( - groupPhysicalPlacement, - groupVirtualPlacement, + PodGroupPhysicalPlacement, + PodGroupVirtualPlacement, string) { var failedReason string - for leafCellType := range h.cellChains { + for leafCellType := range h.leafCellChains { klog.Infof("Searching leaf cell type %v", leafCellType) + podGroupSchedRequest.podRootGroup.SetCellType(leafCellType) typePhysicalPlacement, typeVirtualPlacement, typeFailedReason := - h.scheduleAffinityGroupForLeafCellType(sr, leafCellType, pod, false) - if typePhysicalPlacement != nil { + h.schedulePodGroupForCellType(podGroupSchedRequest, leafCellType, pod, false) + if !PodGroupPlacement(typePhysicalPlacement).IsEmpty() { return typePhysicalPlacement, typeVirtualPlacement, "" } if typeFailedReason != "" { failedReason = typeFailedReason } } - return nil, nil, failedReason + return PodGroupPhysicalPlacement{}, PodGroupVirtualPlacement{}, failedReason } // validateSchedulingRequest checks the existence of VC and pinned cell, and the legality of priority. -func (h *HivedAlgorithm) validateSchedulingRequest(sr schedulingRequest, pod *core.Pod) { +func (h *HivedAlgorithm) validateSchedulingRequest(podGroupSchedRequest PodGroupSchedulingRequest, pod *core.Pod) { var message string - if h.vcSchedulers[sr.vc] == nil { - message = fmt.Sprintf("VC %v does not exists!", sr.vc) - } else if sr.pinnedCellId != "" { - if h.vcSchedulers[sr.vc].getPinnedCells()[sr.pinnedCellId] == nil { - message = fmt.Sprintf("VC %v does not have pinned cell %v", sr.vc, sr.pinnedCellId) - } else if sr.priority == opportunisticPriority { - message = fmt.Sprintf("opportunistic pod not supported to use pinned cell %v", sr.pinnedCellId) + if h.vcSchedulers[podGroupSchedRequest.vc] == nil { + message = fmt.Sprintf("VC %v does not exists!", podGroupSchedRequest.vc) + } else if podGroupSchedRequest.pinnedCellId != "" { + if h.vcSchedulers[podGroupSchedRequest.vc].getPinnedCells()[podGroupSchedRequest.pinnedCellId] == nil { + message = fmt.Sprintf("VC %v does not have pinned cell %v", podGroupSchedRequest.vc, podGroupSchedRequest.pinnedCellId) + } else if podGroupSchedRequest.priority == opportunisticPriority { + message = fmt.Sprintf("opportunistic pod not supported to use pinned cell %v", podGroupSchedRequest.pinnedCellId) } } if message != "" { @@ -871,92 +885,84 @@ func (h *HivedAlgorithm) validateSchedulingRequest(sr schedulingRequest, pod *co // handleSchedulingRequest feeds a request to a VC scheduler or the opportunistic scheduler depending on its priority. func (h *HivedAlgorithm) handleSchedulingRequest( - sr schedulingRequest) ( - physicalPlacement groupPhysicalPlacement, - virtualPlacement groupVirtualPlacement, + podGroupSchedRequest PodGroupSchedulingRequest) ( + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, failedReason string) { - str := fmt.Sprintf("chain %v", sr.chain) - if sr.pinnedCellId != "" { - str = fmt.Sprintf("pinned cell %v", sr.pinnedCellId) + str := fmt.Sprintf("chain %v", podGroupSchedRequest.chain) + if podGroupSchedRequest.pinnedCellId != "" { + str = fmt.Sprintf("pinned cell %v", podGroupSchedRequest.pinnedCellId) } - klog.Infof("Processing scheduling request: %v, leaf cell numbers %v, priority %v", - str, common.ToJson(sr.affinityGroupPodNums), sr.priority) - if sr.priority >= minGuaranteedPriority { - physicalPlacement, virtualPlacement, failedReason = h.scheduleGuaranteedAffinityGroup(sr) + klog.Infof("Processing scheduling request: %v, pod root group %v, priority %v", + str, common.ToJson(podGroupSchedRequest.podRootGroup), podGroupSchedRequest.priority) + if podGroupSchedRequest.priority >= minGuaranteedPriority { + physicalPlacement, virtualPlacement, failedReason = h.scheduleGuaranteedPodGroup(podGroupSchedRequest) } else { - physicalPlacement, failedReason = h.scheduleOpportunisticAffinityGroup(sr) + physicalPlacement, failedReason = h.scheduleOpportunisticPodGroup(podGroupSchedRequest) } - if physicalPlacement == nil { + if PodGroupPlacement(physicalPlacement).IsEmpty() { klog.Infof("Cannot find placement in %v: %v", str, failedReason) - return nil, nil, failedReason + return PodGroupPhysicalPlacement{}, PodGroupVirtualPlacement{}, failedReason } klog.Infof("Found placement in %v: %v", str, physicalPlacement) return physicalPlacement, virtualPlacement, "" } -// scheduleGuaranteedAffinityGroup schedules an affinity group in its VC, +// scheduleGuaranteedPodGroup schedules a pod group in its VC, // and then maps the placement in VC to the physical cluster. -func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup( - sr schedulingRequest) ( - physicalPlacement groupPhysicalPlacement, - virtualPlacement groupVirtualPlacement, +func (h *HivedAlgorithm) scheduleGuaranteedPodGroup( + podGroupSchedRequest PodGroupSchedulingRequest) ( + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, failedReason string) { // schedule in VC - virtualPlacement, failedReason = h.vcSchedulers[sr.vc].schedule(sr) - if virtualPlacement == nil { - return nil, nil, failedReason + virtualPlacement, failedReason = h.vcSchedulers[podGroupSchedRequest.vc].schedule(podGroupSchedRequest) + if PodGroupPlacement(virtualPlacement).IsEmpty() { + return PodGroupPhysicalPlacement{}, PodGroupVirtualPlacement{}, failedReason } // map the vc placement to the physical cluster bindings := map[api.CellAddress]*PhysicalCell{} - leafCellNums := common.Int32MapKeys(sr.affinityGroupPodNums) - common.SortInt32(leafCellNums) - lazyPreemptedGroups := h.tryLazyPreempt(virtualPlacement, leafCellNums, sr.affinityGroupName) - preassignedCells, nonPreassignedCells := virtualPlacement.toBindingPaths(leafCellNums, bindings) + lazyPreemptedGroups := h.tryLazyPreempt(virtualPlacement, podGroupSchedRequest.podRootGroup.Name) + preassignedCells, nonPreassignedCells := virtualPlacement.toBindingPaths(bindings) // make a copy of freeCellNum, may change its values during allocation freeCellNumCopy := map[CellLevel]int32{} - for k, v := range h.allVCFreeCellNum[sr.chain] { + for k, v := range h.allVCFreeCellNum[podGroupSchedRequest.chain] { freeCellNumCopy[k] = v } if ok := mapVirtualPlacementToPhysical( preassignedCells, nonPreassignedCells, - h.freeCellList[sr.chain].shallowCopy(), + h.freeCellList[podGroupSchedRequest.chain].shallowCopy(), freeCellNumCopy, - sr.suggestedNodes, - sr.ignoreSuggestedNodes, + common.NewSet(), + true, bindings); ok { - return virtualPlacement.toPhysicalPlacement(bindings, leafCellNums), virtualPlacement, "" + return virtualPlacement.toPhysicalPlacement(bindings), virtualPlacement, "" } for groupName, placement := range lazyPreemptedGroups { - h.revertLazyPreempt(h.affinityGroups[groupName], placement) + h.revertLazyPreempt(h.podGroups[groupName], placement) } - failedNodeType := "bad or non-suggested" - if sr.ignoreSuggestedNodes { - failedNodeType = "bad" - } - return nil, nil, fmt.Sprintf( + // ignore suggested nodes globally + failedNodeType := "bad" + return PodGroupPhysicalPlacement{}, PodGroupVirtualPlacement{}, fmt.Sprintf( "Mapping the virtual placement would need to use at least one %v node "+ "(virtual placement : %v)", failedNodeType, virtualPlacement) } -// tryLazyPreempt tries to lazy preempt the affinity groups found on a placement. +// tryLazyPreempt tries to lazy preempt the pod groups found on a placement. func (h *HivedAlgorithm) tryLazyPreempt( - p groupVirtualPlacement, - leafCellNums []int32, - groupName string) map[string]groupVirtualPlacement { - - preemptedGroups := map[string]groupVirtualPlacement{} - for _, podLeafCellNum := range leafCellNums { - podPlacements := p[podLeafCellNum] - for _, pod := range podPlacements { - for _, leafCell := range pod { - if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil { - if pLeafCell.GetState() == cellUsed && pLeafCell.GetUsingGroup().lazyPreemptionEnable { - preemptedGroups[pLeafCell.GetUsingGroup().name] = h.lazyPreemptAffinityGroup( - pLeafCell.GetUsingGroup(), groupName) - } + virtualPlacement PodGroupVirtualPlacement, + groupName string) map[string]PodGroupVirtualPlacement { + + preemptedGroups := map[string]PodGroupVirtualPlacement{} + for iter := PodGroupPlacement(virtualPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil { + if pLeafCell.GetState() == cellUsed && pLeafCell.GetUsingGroup().lazyPreemptionEnable { + preemptedGroups[pLeafCell.GetUsingGroup().name] = h.lazyPreemptPodGroup( + pLeafCell.GetUsingGroup(), groupName) } } } @@ -964,262 +970,268 @@ func (h *HivedAlgorithm) tryLazyPreempt( return preemptedGroups } -// scheduleOpportunisticAffinityGroup calls the opportunistic pod scheduler to schedule an affinity group. -func (h *HivedAlgorithm) scheduleOpportunisticAffinityGroup( - sr schedulingRequest) ( - placement groupPhysicalPlacement, +// scheduleOpportunisticPodGroup calls the opportunistic pod scheduler to schedule a pod group. +func (h *HivedAlgorithm) scheduleOpportunisticPodGroup( + podGroupSchedRequest PodGroupSchedulingRequest) ( + physicalPlacement PodGroupPhysicalPlacement, failedReason string) { - placement, failedReason = h.opportunisticSchedulers[sr.chain].Schedule( - sr.affinityGroupPodNums, opportunisticPriority, sr.suggestedNodes, sr.ignoreSuggestedNodes) - if placement == nil { - return nil, fmt.Sprintf("%v when scheduling in physical cluster", failedReason) + var placement PodGroupPlacement + placement, failedReason = h.opportunisticSchedulers[podGroupSchedRequest.chain].Schedule( + &podGroupSchedRequest.podRootGroup, opportunisticPriority) + physicalPlacement = PodGroupPhysicalPlacement(placement) + if PodGroupPlacement(physicalPlacement).IsEmpty() { + return PodGroupPhysicalPlacement{}, fmt.Sprintf("%v when scheduling in physical cluster", failedReason) } - return placement, "" + return physicalPlacement, "" } -// createAllocatedAffinityGroup creates a new affinity group and allocate the resources. -func (h *HivedAlgorithm) createAllocatedAffinityGroup(s *api.PodSchedulingSpec, info *api.PodBindInfo, pod *core.Pod) { - klog.Infof("[%v]: Creating new allocated affinity group: %v", internal.Key(pod), s.AffinityGroup.Name) - newGroup := newAlgoAffinityGroup( - s.AffinityGroup, s.VirtualCluster, s.LazyPreemptionEnable, s.Priority, groupAllocated) +// createAllocatedPodGroup creates a new pod group and allocate the resources. +func (h *HivedAlgorithm) createAllocatedPodGroup(podSchedSpec *apiv2.PodSchedulingSpec, info *apiv2.PodBindInfo, pod *core.Pod) { + klog.Infof("[%v]: Creating new allocated pod group: %v", internal.Key(pod), podSchedSpec.PodRootGroup.Name) + newPodGroupSchedStatus := newPodGroupSchedulingStatus( + podSchedSpec, h.leafCellNums[CellChain(info.CellChain)], h.cellLevels[CellChain(info.CellChain)], podGroupAllocated) shouldLazyPreempt := false - for _, gms := range info.AffinityGroupBindInfo { - leafCellNumber := int32(len(gms.PodPlacements[0].PhysicalLeafCellIndices)) - for podIndex := int32(0); podIndex < int32(len(gms.PodPlacements)); podIndex++ { - node := gms.PodPlacements[podIndex].PhysicalNode - for leafCellIndex := int32(0); leafCellIndex < int32( - len(gms.PodPlacements[podIndex].PhysicalLeafCellIndices)); leafCellIndex++ { - pLeafCell, vLeafCell, lazyPreempt := h.findAllocatedLeafCell( - leafCellIndex, - gms.PodPlacements[podIndex].PhysicalLeafCellIndices, - gms.PodPlacements[podIndex].PreassignedCellTypes, - CellChain(info.CellChain), node, shouldLazyPreempt, s, newGroup, pod) - if pLeafCell == nil { - // pLeafCell not being found means that this leaf cell address does not exist in the spec. - // we simply ignore this leaf cell, and let the job run normally - // (but we cannot ignore the other leaf cells of this pod that are still in the spec, - // otherwise it may cause resource conflicts) - continue - } else { - newGroup.physicalLeafCellPlacement[leafCellNumber][podIndex][leafCellIndex] = pLeafCell - if lazyPreempt == nil { - newGroup.virtualLeafCellPlacement = nil - } else if vLeafCell != nil { - newGroup.virtualLeafCellPlacement[leafCellNumber][podIndex][leafCellIndex] = vLeafCell - if inFreeCellList(pLeafCell) && vLeafCell.GetPreassignedCell().GetPriority() > freePriority { - // This means we decide to bind this cell to a virtual cell whose preassigned cell - // has been bound (in cases like reconfiguration and the VC's cells are fewer than before). - // We need to destroy the previous binding, by lazy preempting all the groups - // in the preassigned cell - h.lazyPreemptCell(vLeafCell.GetPreassignedCell(), newGroup.name) - } - } else { - shouldLazyPreempt = shouldLazyPreempt || *lazyPreempt - } - // Even if we have successfully found the vLeafCell and pLeafCell, there is still one possibility - // that we should not bind them: allocating the physical cell may lead to broken safety. - // Such case won't happen by design as buddy alloc guarantees safety; but this could - // happen due to inconsistency of VC assignments for reasons like reconfiguration. - // In this case, we will lazy preempt this affinity group. - safetyOk, reason := h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(s.Priority), newGroup.vc) - pLeafCell.AddUsingGroup(newGroup) - setCellState(pLeafCell, cellUsed) - if !safetyOk { - shouldLazyPreempt = true - klog.Warningf("[%v]: %v", internal.Key(pod), reason) + + infoIter := info.PodRootGroupBindInfo.Iterator() + pIter := PodGroupPlacement(newPodGroupSchedStatus.physicalPlacement).Iterator() + vIter := PodGroupPlacement(newPodGroupSchedStatus.virtualPlacement).Iterator() + for infoIter.HasNext() { + podPlacementInfo := infoIter.Next() + pLeafCells := *pIter.Next() + vLeafCells := *vIter.Next() + + node := podPlacementInfo.PhysicalNode + for leafCellIndex := int32(0); leafCellIndex < int32( + len(podPlacementInfo.PhysicalLeafCellIndices)); leafCellIndex++ { + pLeafCell, vLeafCell, lazyPreempt := h.findAllocatedLeafCell( + leafCellIndex, + podPlacementInfo.PhysicalLeafCellIndices, + podPlacementInfo.PreassignedCellTypes, + CellChain(info.CellChain), node, shouldLazyPreempt, + podSchedSpec, newPodGroupSchedStatus, pod) + if pLeafCell == nil { + // pLeafCell not being found means that this leaf cell address does not exist in the spec. + // we simply ignore this leaf cell, and let the job run normally + // (but we cannot ignore the other leaf cells of this pod that are still in the spec, + // otherwise it may cause resource conflicts) + continue + } else { + pLeafCells[leafCellIndex] = pLeafCell + if lazyPreempt == nil { + newPodGroupSchedStatus.virtualPlacement = PodGroupVirtualPlacement{} + } else if vLeafCell != nil { + vLeafCells[leafCellIndex] = vLeafCell + if inFreeCellList(pLeafCell) && vLeafCell.GetPreassignedCell().GetPriority() > freePriority { + // This means we decide to bind this cell to a virtual cell whose preassigned cell + // has been bound (in cases like reconfiguration and the VC's cells are fewer than before). + // We need to destroy the previous binding, by lazy preempting all the groups + // in the preassigned cell + h.lazyPreemptCell(vLeafCell.GetPreassignedCell(), newPodGroupSchedStatus.name) } + } else { + shouldLazyPreempt = shouldLazyPreempt || *lazyPreempt + } + // Even if we have successfully found the vLeafCell and pLeafCell, there is still one possibility + // that we should not bind them: allocating the physical cell may lead to broken safety. + // Such case won't happen by design as buddy alloc guarantees safety; but this could + // happen due to inconsistency of VC assignments for reasons like reconfiguration. + // In this case, we will lazy preempt this affinity group. + safetyOk, reason := h.allocateLeafCell(pLeafCell, vLeafCell, newPodGroupSchedStatus.priority, newPodGroupSchedStatus.vc) + pLeafCell.AddUsingGroup(newPodGroupSchedStatus) + setCellState(pLeafCell, cellUsed) + if !safetyOk { + shouldLazyPreempt = true + klog.Warningf("[%v]: %v", internal.Key(pod), reason) } } } } if shouldLazyPreempt { - h.lazyPreemptAffinityGroup(newGroup, newGroup.name) + h.lazyPreemptPodGroup(newPodGroupSchedStatus, newPodGroupSchedStatus.name) } - h.affinityGroups[s.AffinityGroup.Name] = newGroup - klog.Infof("[%v]: New allocated affinity group created: %v", internal.Key(pod), s.AffinityGroup.Name) + h.podGroups[podSchedSpec.PodRootGroup.Name] = newPodGroupSchedStatus + klog.Infof("[%v]: New allocated pod group created: %v", internal.Key(pod), podSchedSpec.PodRootGroup.Name) } -// deleteAllocatedAffinityGroup deletes a new affinity group and release the resources (that are not +// deleteAllocatedPodGroup deletes a new pod group and release the resources (that are not // allocated to a preempting group). -func (h *HivedAlgorithm) deleteAllocatedAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) { - klog.Infof("[%v]: All pods complete, deleting allocated affinity group: %v", - internal.Key(pod), g.name) - for _, podPlacements := range g.physicalLeafCellPlacement { - for _, podPlacement := range podPlacements { - for _, leafCell := range podPlacement { - if leafCell == nil { - continue - } - pLeafCell := leafCell.(*PhysicalCell) - pLeafCell.DeleteUsingGroup(g) - // state of pLeafCell can be either Used or Reserving - if pLeafCell.GetState() == cellUsed { - h.releaseLeafCell(pLeafCell, g.vc) - setCellState(pLeafCell, cellFree) - } else { // cellReserving - // When pLeafCell is in Reserving state, we shouldn't call h.releaseLeafCell - // because it must have been allocated to the reserving group before - setCellState(pLeafCell, cellReserved) - } +func (h *HivedAlgorithm) deleteAllocatedPodGroup(podGroupSchedStatus *PodGroupSchedulingStatus, pod *core.Pod) { + klog.Infof("[%v]: All pods complete, deleting allocated pod group: %v", + internal.Key(pod), podGroupSchedStatus.name) + for iter := PodGroupPlacement(podGroupSchedStatus.physicalPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + if leafCell == nil { + continue + } + pLeafCell := leafCell.(*PhysicalCell) + pLeafCell.DeleteUsingGroup(podGroupSchedStatus) + // state of pLeafCell can be either Used or Reserving + if pLeafCell.GetState() == cellUsed { + h.releaseLeafCell(pLeafCell, podGroupSchedStatus.vc) + setCellState(pLeafCell, cellFree) + } else { // cellReserving + // When pLeafCell is in Reserving state, we shouldn't call h.releaseLeafCell + // because it must have been allocated to the reserving group before + setCellState(pLeafCell, cellReserved) } } } - delete(h.affinityGroups, g.name) - klog.Infof("[%v]: Allocated affinity group deleted: %v", internal.Key(pod), g.name) + delete(h.podGroups, podGroupSchedStatus.name) + klog.Infof("[%v]: Allocated pod group deleted: %v", internal.Key(pod), podGroupSchedStatus.name) } -// createPreemptingAffinityGroup creates a new affinity group that is preempting some other groups. +// createPreemptingPodGroup creates a new pod group that is preempting some other groups. // Its resources are immediately allocated to the group (even if the preemption victims have not yet been deleted), // so that other groups will not be scheduled to the same placement (unless they have higher priorities). // This avoids the case where multiple groups preempt the same victims simultaneously, which may cause resource deadlock. -func (h *HivedAlgorithm) createPreemptingAffinityGroup( - s *api.PodSchedulingSpec, - physicalPlacement groupPhysicalPlacement, - virtualPlacement groupVirtualPlacement, +func (h *HivedAlgorithm) createPreemptingPodGroup( + podSchedSpec *apiv2.PodSchedulingSpec, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, pod *core.Pod) { - klog.Infof("[%v]: Creating new preempting affinity group: %v", internal.Key(pod), s.AffinityGroup.Name) - newGroup := newAlgoAffinityGroup( - s.AffinityGroup, s.VirtualCluster, s.LazyPreemptionEnable, s.Priority, groupPreempting) - newGroup.physicalLeafCellPlacement = physicalPlacement - newGroup.virtualLeafCellPlacement = virtualPlacement - for leafCellNum := range physicalPlacement { - for podIndex := range physicalPlacement[leafCellNum] { - for leafCellIndex, leafCell := range physicalPlacement[leafCellNum][podIndex] { - pLeafCell := leafCell.(*PhysicalCell) - vLeafCell := virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell) - if pLeafCell.GetState() == cellUsed { - usingGroup := pLeafCell.GetUsingGroup() - h.releaseLeafCell(pLeafCell, usingGroup.vc) - usingGroup.state = groupBeingPreempted - } - h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(s.Priority), newGroup.vc) - pLeafCell.AddReservingOrReservedGroup(newGroup) - // state of pLeafCell can be either Used or Free (if it was Reserving or Reserved, - // we must have canceled the ongoing preemption before, in h.Schedule) - if pLeafCell.GetState() == cellUsed { - setCellState(pLeafCell, cellReserving) - } else { // cellFree - setCellState(pLeafCell, cellReserved) - } + klog.Infof("[%v]: Creating new preempting pod group: %v", internal.Key(pod), podSchedSpec.PodRootGroup.Name) + newPodGroupSchedStatus := newPodGroupSchedulingStatus( + podSchedSpec, map[CellLevel]int32{}, map[api.CellType]CellLevel{}, podGroupPreempting) + newPodGroupSchedStatus.physicalPlacement = physicalPlacement + newPodGroupSchedStatus.virtualPlacement = virtualPlacement + + pIter := PodGroupPlacement(physicalPlacement).Iterator() + vIter := PodGroupPlacement(virtualPlacement).Iterator() + for pIter.HasNext() { + pLeafCells := *pIter.Next() + vLeafCells := *vIter.Next() + for leafCellIndex := range pLeafCells { + pLeafCell := pLeafCells[leafCellIndex].(*PhysicalCell) + vLeafCell := vLeafCells[leafCellIndex].(*VirtualCell) + if pLeafCell.GetState() == cellUsed { + usingGroup := pLeafCell.GetUsingGroup() + h.releaseLeafCell(pLeafCell, usingGroup.vc) + usingGroup.state = podGroupBeingPreempted + } + h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(podSchedSpec.Priority), newPodGroupSchedStatus.vc) + pLeafCell.AddReservingOrReservedGroup(newPodGroupSchedStatus) + // state of pLeafCell can be either Used or Free (if it was Reserving or Reserved, + // we must have canceled the ongoing preemption before, in h.Schedule) + if pLeafCell.GetState() == cellUsed { + setCellState(pLeafCell, cellReserving) + } else { // cellFree + setCellState(pLeafCell, cellReserved) } } } - newGroup.preemptingPods[pod.UID] = pod - h.affinityGroups[s.AffinityGroup.Name] = newGroup - klog.Infof("[%v]: New preempting affinity group created: %v", internal.Key(pod), newGroup.name) + + newPodGroupSchedStatus.preemptingPods[pod.UID] = pod + h.podGroups[podSchedSpec.PodRootGroup.Name] = newPodGroupSchedStatus + klog.Infof("[%v]: New preempting pod group created: %v", internal.Key(pod), newPodGroupSchedStatus.name) } -// deletePreemptingAffinityGroup revokes a preemption and deletes the affinity group that is +// deletePreemptingPodGroup revokes a preemption and deletes the pod group that is // still waiting for the completion of the preemption. -func (h *HivedAlgorithm) deletePreemptingAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) { - for leafCellNum := range g.physicalLeafCellPlacement { - for podIndex := range g.physicalLeafCellPlacement[leafCellNum] { - for _, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] { - pLeafCell := leafCell.(*PhysicalCell) - h.releaseLeafCell(pLeafCell, g.vc) - pLeafCell.DeleteReservingOrReservedGroup(pLeafCell.GetReservingOrReservedGroup()) - // state of pLeafCell can be either Reserving or Reserved - if pLeafCell.GetState() == cellReserving { - setCellState(pLeafCell, cellUsed) - // return the cell to the group being preempted - beingPreemptedGroup := pLeafCell.GetUsingGroup() - var beingPreemptedVLeafCell *VirtualCell - if beingPreemptedGroup.virtualLeafCellPlacement != nil { - beingPreemptedVLeafCell = retrieveVirtualCell( - beingPreemptedGroup.physicalLeafCellPlacement, - beingPreemptedGroup.virtualLeafCellPlacement, pLeafCell) - } - h.allocateLeafCell( - pLeafCell, beingPreemptedVLeafCell, CellPriority(beingPreemptedGroup.priority), beingPreemptedGroup.vc) - } else { // cellReserved - setCellState(pLeafCell, cellFree) +func (h *HivedAlgorithm) deletePreemptingPodGroup(podGroupSchedStatus *PodGroupSchedulingStatus, pod *core.Pod) { + for iter := PodGroupPlacement(podGroupSchedStatus.physicalPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + pLeafCell := leafCell.(*PhysicalCell) + h.releaseLeafCell(pLeafCell, podGroupSchedStatus.vc) + pLeafCell.DeleteReservingOrReservedGroup(pLeafCell.GetReservingOrReservedGroup()) + // state of pLeafCell can be either Reserving or Reserved + if pLeafCell.GetState() == cellReserving { + setCellState(pLeafCell, cellUsed) + // return the cell to the group being preempted + beingPreemptedGroup := pLeafCell.GetUsingGroup() + var beingPreemptedVLeafCell *VirtualCell + if !PodGroupPlacement(beingPreemptedGroup.virtualPlacement).IsEmpty() { + beingPreemptedVLeafCell = retrieveVirtualCell( + beingPreemptedGroup.physicalPlacement, + beingPreemptedGroup.virtualPlacement, pLeafCell) } + h.allocateLeafCell( + pLeafCell, beingPreemptedVLeafCell, CellPriority(beingPreemptedGroup.priority), beingPreemptedGroup.vc) + } else { // cellReserved + setCellState(pLeafCell, cellFree) } } } - delete(h.affinityGroups, g.name) - klog.Infof("[%v]: Preempting affinity group %v deleted", internal.Key(pod), g.name) + delete(h.podGroups, podGroupSchedStatus.name) + klog.Infof("[%v]: Preempting pod group %v deleted", internal.Key(pod), podGroupSchedStatus.name) } -// allocatePreemptingAffinityGroup lets a preemptor affinity group whose preemption has completed +// allocatePreemptingPodGroup lets a preemptor pod group whose preemption has completed // transition to allocated state. -func (h *HivedAlgorithm) allocatePreemptingAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) { - for leafCellNum := range g.physicalLeafCellPlacement { - for podIndex := range g.physicalLeafCellPlacement[leafCellNum] { - for _, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] { - pLeafCell := leafCell.(*PhysicalCell) - pLeafCell.DeleteReservingOrReservedGroup(g) - pLeafCell.AddUsingGroup(g) - setCellState(pLeafCell, cellUsed) - } +func (h *HivedAlgorithm) allocatePreemptingPodGroup(podGroupSchedStatus *PodGroupSchedulingStatus, pod *core.Pod) { + for iter := PodGroupPlacement(podGroupSchedStatus.physicalPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + pLeafCell := leafCell.(*PhysicalCell) + pLeafCell.DeleteReservingOrReservedGroup(podGroupSchedStatus) + pLeafCell.AddUsingGroup(podGroupSchedStatus) + setCellState(pLeafCell, cellUsed) } } - g.state = groupAllocated - g.preemptingPods = nil - klog.Infof("[%v]: Preempting affinity group %v transitioned to allocated", internal.Key(pod), g.name) + podGroupSchedStatus.state = podGroupAllocated + podGroupSchedStatus.preemptingPods = nil + klog.Infof("[%v]: Preempting pod group %v transitioned to allocated", internal.Key(pod), podGroupSchedStatus.name) } -// lazyPreemptAffinityGroup removes an affinity group from its VC, clears it virtual placement, +// lazyPreemptPodGroup removes a pod group from its VC, clears it virtual placement, // and exposes this decision. -func (h *HivedAlgorithm) lazyPreemptAffinityGroup( - victim *AlgoAffinityGroup, - preemptor string) (originalVirtualPlacement groupVirtualPlacement) { - for _, podVirtualPlacements := range victim.virtualLeafCellPlacement { - for _, podVirtualPlacement := range podVirtualPlacements { - for _, leafCell := range podVirtualPlacement { - if leafCell != nil { - vLeafCell := leafCell.(*VirtualCell) - pLeafCell := vLeafCell.GetPhysicalCell() - h.releaseLeafCell(pLeafCell, victim.vc) - h.allocateLeafCell(pLeafCell, nil, opportunisticPriority, victim.vc) - } +func (h *HivedAlgorithm) lazyPreemptPodGroup( + victim *PodGroupSchedulingStatus, + preemptor string) (originalVirtualPlacement PodGroupVirtualPlacement) { + for iter := PodGroupPlacement(victim.virtualPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + if leafCell != nil { + vLeafCell := leafCell.(*VirtualCell) + pLeafCell := vLeafCell.GetPhysicalCell() + h.releaseLeafCell(pLeafCell, victim.vc) + h.allocateLeafCell(pLeafCell, nil, opportunisticPriority, victim.vc) } } } - originalVirtualPlacement = victim.virtualLeafCellPlacement - victim.virtualLeafCellPlacement = nil + originalVirtualPlacement = victim.virtualPlacement + victim.virtualPlacement = PodGroupVirtualPlacement{} victim.lazyPreemptionStatus = &api.LazyPreemptionStatus{ Preemptor: preemptor, PreemptionTime: meta.Now(), } - klog.Infof("Affinity group %v is lazy preempted from VC by %v", victim.name, preemptor) + klog.Infof("Pod group %v is lazy preempted from VC by %v", victim.name, preemptor) return originalVirtualPlacement } -// lazyPreemptCell lazy preempts all the affinity groups inside a virtual cell (and its children). +// lazyPreemptCell lazy preempts all the pod groups inside a virtual cell (and its children). func (h *HivedAlgorithm) lazyPreemptCell(c *VirtualCell, preemptor string) { if c.GetLevel() == lowestLevel && c.GetState() == cellUsed { - h.lazyPreemptAffinityGroup(c.GetPhysicalCell().GetUsingGroup(), preemptor) + h.lazyPreemptPodGroup(c.GetPhysicalCell().GetUsingGroup(), preemptor) } for _, child := range c.GetChildren() { h.lazyPreemptCell(child.(*VirtualCell), preemptor) } } -// revertLazyPreempt reverts the lazy preemption of an affinity group. -func (h *HivedAlgorithm) revertLazyPreempt(g *AlgoAffinityGroup, virtualPlacement groupVirtualPlacement) { - for leafCellNum := range g.physicalLeafCellPlacement { - for podIndex := range g.physicalLeafCellPlacement[leafCellNum] { - for leafCellIndex, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] { - if leafCell == nil { - continue - } - pLeafCell := leafCell.(*PhysicalCell) - vLeafCell := virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell) - h.releaseLeafCell(pLeafCell, g.vc) - h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(g.priority), g.vc) +// revertLazyPreempt reverts the lazy preemption of a pod group. +func (h *HivedAlgorithm) revertLazyPreempt(podGroupSchedStatus *PodGroupSchedulingStatus, virtualPlacement PodGroupVirtualPlacement) { + pIter := PodGroupPlacement(podGroupSchedStatus.physicalPlacement).Iterator() + vIter := PodGroupPlacement(virtualPlacement).Iterator() + for pIter.HasNext() { + pLeafCells := *pIter.Next() + vLeafCells := *vIter.Next() + for leafCellIndex := range pLeafCells { + if pLeafCells[leafCellIndex] == nil { + continue } + pLeafCell := pLeafCells[leafCellIndex].(*PhysicalCell) + vLeafCell := vLeafCells[leafCellIndex].(*VirtualCell) + h.releaseLeafCell(pLeafCell, podGroupSchedStatus.vc) + h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(podGroupSchedStatus.priority), podGroupSchedStatus.vc) } } - g.virtualLeafCellPlacement = virtualPlacement - g.lazyPreemptionStatus = nil - klog.Infof("Lazy preemption of affinity group %v is reverted", g.name) + podGroupSchedStatus.virtualPlacement = virtualPlacement + podGroupSchedStatus.lazyPreemptionStatus = nil + klog.Infof("Lazy preemption of pod group %v is reverted", podGroupSchedStatus.name) } // findAllocatedLeafCell finds the physical and virtual leaf cells in the full cell lists for an allocate pod. -// The boolean return value indicates whether the affinity group should be lazy-preempted. +// The boolean return value indicates whether the pod group should be lazy-preempted. // The bool being nil means the group is OT and has no virtual placement. func (h *HivedAlgorithm) findAllocatedLeafCell( index int32, @@ -1228,11 +1240,11 @@ func (h *HivedAlgorithm) findAllocatedLeafCell( chain CellChain, node string, lazyPreempted bool, - s *api.PodSchedulingSpec, - group *AlgoAffinityGroup, + podSchedSpec *apiv2.PodSchedulingSpec, + podGroupSchedStatus *PodGroupSchedulingStatus, pod *core.Pod) (*PhysicalCell, *VirtualCell, *bool) { - priority := CellPriority(s.Priority) + priority := CellPriority(podSchedSpec.Priority) physicalLeafCellIndex := physicalLeafCellIndices[index] if pLeafCell := findPhysicalLeafCell(h.fullCellList, chain, node, physicalLeafCellIndex); pLeafCell == nil { klog.Warningf( @@ -1245,7 +1257,7 @@ func (h *HivedAlgorithm) findAllocatedLeafCell( klog.Warningf("[%v]: Cannot find virtual cell: preassigned cell not found in pod bind info", internal.Key(pod)) return pLeafCell, nil, common.PtrBool(true) } - if group.virtualLeafCellPlacement != nil && !lazyPreempted { + if !PodGroupPlacement(podGroupSchedStatus.virtualPlacement).IsEmpty() && !lazyPreempted { preassignedType := preassignedCellTypes[index] if preassignedType != "" { var preassignedLevel CellLevel @@ -1259,17 +1271,17 @@ func (h *HivedAlgorithm) findAllocatedLeafCell( var message string if !typeFound { message = fmt.Sprintf("Preassigned cell type %v not found in chain %v", preassignedType, pLeafCell.GetChain()) - } else if vcs := h.vcSchedulers[s.VirtualCluster]; vcs == nil { - message = fmt.Sprintf("VC %v not found", s.VirtualCluster) + } else if vcs := h.vcSchedulers[podSchedSpec.VirtualCluster]; vcs == nil { + message = fmt.Sprintf("VC %v not found", podSchedSpec.VirtualCluster) } else { vccl := vcs.getNonPinnedPreassignedCells()[pLeafCell.GetChain()] str := string(pLeafCell.GetChain()) - if s.PinnedCellId != "" { - vccl = vcs.getPinnedCells()[s.PinnedCellId] - str = string(s.PinnedCellId) + if podSchedSpec.PinnedCellId != "" { + vccl = vcs.getPinnedCells()[podSchedSpec.PinnedCellId] + str = string(podSchedSpec.PinnedCellId) } if vccl == nil { - message = fmt.Sprintf("VC %v has no cell for %v", s.VirtualCluster, str) + message = fmt.Sprintf("VC %v has no cell for %v", podSchedSpec.VirtualCluster, str) } else { vLeafCell, message = mapPhysicalCellToVirtual(pLeafCell, vccl, preassignedLevel, priority) } diff --git a/pkg/algorithm/hived_algorithm_tester.go b/pkg/algorithm/hived_algorithm_tester.go new file mode 100644 index 0000000..efa1cb4 --- /dev/null +++ b/pkg/algorithm/hived_algorithm_tester.go @@ -0,0 +1,302 @@ +// MIT License +// +// Copyright (c) Microsoft Corporation. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE + +package algorithm + +import ( + "fmt" + "io/ioutil" + "reflect" + "sort" + "testing" + + "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" + "github.com/microsoft/hivedscheduler/pkg/common" + "github.com/microsoft/hivedscheduler/pkg/internal" + core "k8s.io/api/core/v1" + meta "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" +) + +type bindResult struct { + Node string `yaml:"node"` + LeafCellIsolation []int32 `yaml:"leafCellIsolation"` +} + +type preemptResult struct { + VictimPods []string `yaml:"victimPods"` +} + +type waitResult struct { + Reason string `yaml:"reason"` +} + +type step struct { + Method string `yaml:"method"` + Paramaters map[interface{}]interface{} `yaml:"parameters"` +} + +type stepList []step + +type HivedAlgorithmTester interface { + SetAllNodesToHealthy() + SetAllNodesToBad() + SetNodeToBad(nodeName string) + SetNodeToHealthy(nodeName string) + + SchedulePod(podName string, podGroupSchedulingRequest apiv2.PodSchedulingSpec, phase internal.SchedulingPhase) + DeallocatePod(podName string) + + AssertPodBindResult(podName string, expectedResult bindResult) + AssertPodPreemptResult(podName string, expectedResult preemptResult) + AssertPodWait(podName string) + AssertPodPanic(podName string) + ExecuteCaseFromYamlFile(filePath string) +} + +type GenericHivedAlgorithmTester struct { + h *HivedAlgorithm + t *testing.T + // allNodes records all node names in the configuration + allNodes []string + // podScheduleResult map podName to internal.PodScheduleResult. + // pods which have valid scheduling result will be kept in this map. + podScheduleResult map[string]internal.PodScheduleResult + // panicPodNames record the pods which panic during scheduling, + // not including those with preempt info or wait info. + panicPodNames map[string]bool + // pods record pod definition. + pods map[string]*core.Pod +} + +func NewHivedAlgorithmTester(t *testing.T, configFilePath string) *GenericHivedAlgorithmTester { + sConfig := api.NewConfig(api.InitRawConfig(&configFilePath)) + h := NewHivedAlgorithm(sConfig) + var allNodes []string + for _, ccl := range h.fullCellList { + for _, c := range ccl[CellLevel(len(ccl))] { + allNodes = append(allNodes, c.(*PhysicalCell).nodes...) + } + } + // sort chains of each leaf cell type for stability of the test + for _, chains := range h.cellChains { + var chainsTemp []string + for _, c := range chains { + chainsTemp = append(chainsTemp, string(c)) + } + sort.Strings(chainsTemp) + for i := range chains { + chains[i] = CellChain(chainsTemp[len(chainsTemp)-i-1]) + } + } + tester := GenericHivedAlgorithmTester{h: h, t: t, allNodes: allNodes} + tester.SetAllNodesToHealthy() + tester.pods = make(map[string]*core.Pod) + tester.podScheduleResult = make(map[string]internal.PodScheduleResult) + tester.panicPodNames = make(map[string]bool) + return &tester +} + +func (tester *GenericHivedAlgorithmTester) SetAllNodesToHealthy() { + h := tester.h + for _, nodeName := range tester.allNodes { + h.setHealthyNode(nodeName) + } +} + +func (tester *GenericHivedAlgorithmTester) SetNodeToBad(nodeName string) { + h := tester.h + h.setBadNode(nodeName) +} + +func (tester *GenericHivedAlgorithmTester) SetNodeToHealthy(nodeName string) { + h := tester.h + h.setHealthyNode(nodeName) +} + +func (tester *GenericHivedAlgorithmTester) SchedulePod(podName string, podGroupSchedulingRequest apiv2.PodSchedulingSpec, phase internal.SchedulingPhase) { + h := tester.h + t := tester.t + defer func() { + if err := recover(); err != nil { + // record the panic + t.Logf("Panic detected for pod %v. Details: %v", podName, err) + tester.panicPodNames[podName] = true + } + }() + pod := &core.Pod{ + ObjectMeta: meta.ObjectMeta{ + Name: podName, + Namespace: "test", + UID: types.UID(podName), + Annotations: map[string]string{}, + }, + } + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podGroupSchedulingRequest) + psr := h.Schedule(pod, tester.allNodes, phase) + if psr.PodBindInfo != nil { + allocatedPod := internal.NewBindingPod(pod, psr.PodBindInfo) + h.AddAllocatedPod(allocatedPod) + tester.podScheduleResult[podName] = psr + tester.pods[podName] = allocatedPod + } else if psr.PodPreemptInfo != nil { + for _, victimPod := range psr.PodPreemptInfo.VictimPods { + h.DeleteAllocatedPod(victimPod) + } + tester.pods[podName] = pod + tester.podScheduleResult[podName] = psr + } else { + tester.pods[podName] = pod + tester.podScheduleResult[podName] = psr + } +} + +func (tester *GenericHivedAlgorithmTester) DeallocatePod(podName string) { + h := tester.h + pod, ok := tester.pods[podName] + if ok { + h.DeleteAllocatedPod(pod) + delete(tester.pods, podName) + } else { + panic("Cannot find pod " + podName) + } +} + +func (tester *GenericHivedAlgorithmTester) AssertPodBindResult(podName string, expectedResult bindResult) { + t := tester.t + psr, ok := tester.podScheduleResult[podName] + if ok { + if psr.PodBindInfo == nil { + t.Errorf("AssertPodBindResult failed for pod %v: Cannot find valid PodBindInfo!", podName) + } else { + podBindInfo := *psr.PodBindInfo + if expectedResult.Node != podBindInfo.Node { + t.Errorf("AssertPodBindResult failed for pod %v: Expected node is %v but got node %v", podName, expectedResult.Node, podBindInfo.Node) + } else { + if !reflect.DeepEqual(expectedResult.LeafCellIsolation, podBindInfo.LeafCellIsolation) { + t.Errorf("AssertPodBindResult failed for pod %v: Expected LeafCellIsolation is %v but got %v", podName, + expectedResult.LeafCellIsolation, podBindInfo.LeafCellIsolation) + } else { + t.Logf("AssertPodBindResult ok for pod %v.", podName) + } + } + } + } else { + t.Errorf("AssertPodBindResult failed for pod %v: Cannot find valid schedule result!", podName) + } +} + +func (tester *GenericHivedAlgorithmTester) AssertPodPreemptResult(podName string, expectedResult preemptResult) { + t := tester.t + psr, ok := tester.podScheduleResult[podName] + if ok { + if psr.PodPreemptInfo == nil { + t.Errorf("AssertPodPreemptResult failed for pod %v: Cannot find valid PodPreemptInfo!", podName) + } else { + victimPodNames := []string{} + for _, pod := range psr.PodPreemptInfo.VictimPods { + victimPodNames = append(victimPodNames, pod.Name) + } + expectedVictimPodNames := []string{} + for _, podName := range expectedResult.VictimPods { + expectedVictimPodNames = append(expectedVictimPodNames, podName) + } + sort.Strings(expectedVictimPodNames) + sort.Strings(victimPodNames) + if !reflect.DeepEqual(expectedVictimPodNames, victimPodNames) { + t.Errorf("AssertPodPreemptResult failed for pod %v: Expected victim pods are %v but got %v", podName, + expectedVictimPodNames, victimPodNames) + } else { + t.Logf("AssertPodPreemptResult ok for pod %v.", podName) + } + } + } else { + t.Errorf("AssertPodPreemptResult failed for pod %v: Cannot find valid schedule result!", podName) + } +} + +func (tester *GenericHivedAlgorithmTester) AssertPodWait(podName string) { + t := tester.t + psr, ok := tester.podScheduleResult[podName] + if ok { + if psr.PodWaitInfo == nil { + t.Errorf("AssertPodWait failed for pod %v: Cannot find valid PodWaitInfo!", podName) + } else { + t.Logf("AssertPodWait ok for pod %v.", podName) + } + } else { + t.Errorf("AssertPodWait failed for pod %v: Cannot find valid schedule result!", podName) + } +} + +func (tester *GenericHivedAlgorithmTester) AssertPodPanic(podName string) { + t := tester.t + _, ok := tester.panicPodNames[podName] + if !ok { + t.Errorf("AssertPodPanic failed for pod %v .", podName) + } else { + t.Logf("AssertPodPanic ok for pod %v.", podName) + } +} + +func (tester *GenericHivedAlgorithmTester) ExecuteCaseFromYamlFile(filePath string) { + yamlBytes, err := ioutil.ReadFile(filePath) + if err != nil { + panic(fmt.Errorf("Failed to read test case file: %v, %v", filePath, err)) + } + steps := stepList{} + common.FromYaml(string(yamlBytes), &steps) + for _, step := range steps { + if step.Method == "SchedulePod" { + var podName = step.Paramaters["podName"].(string) + var phase = internal.SchedulingPhase(step.Paramaters["phase"].(string)) + podGroupSchedulingRequest := apiv2.PodSchedulingSpec{} + common.FromYaml(common.ToYaml(step.Paramaters["podGroupSchedulingRequest"]), &podGroupSchedulingRequest) + tester.SchedulePod(podName, podGroupSchedulingRequest, phase) + } else if step.Method == "DeallocatePod" { + var podName = step.Paramaters["podName"].(string) + tester.DeallocatePod(podName) + } else if step.Method == "AssertPodBindResult" { + var podName = step.Paramaters["podName"].(string) + expectedResult := bindResult{} + common.FromYaml(common.ToYaml(step.Paramaters["expectedResult"]), &expectedResult) + tester.AssertPodBindResult(podName, expectedResult) + } else if step.Method == "AssertPodPreemptResult" { + var podName = step.Paramaters["podName"].(string) + expectedResult := preemptResult{} + common.FromYaml(common.ToYaml(step.Paramaters["expectedResult"]), &expectedResult) + tester.AssertPodPreemptResult(podName, expectedResult) + } else if step.Method == "AssertPodWait" { + var podName = step.Paramaters["podName"].(string) + tester.AssertPodWait(podName) + } else if step.Method == "AssertPodPanic" { + var podName = step.Paramaters["podName"].(string) + tester.AssertPodPanic(podName) + } else if step.Method == "SetNodeToBad" { + var nodeName = step.Paramaters["nodeName"].(string) + tester.SetNodeToBad(nodeName) + } else { + panic(fmt.Errorf("The method %v is not implemented!", step.Method)) + } + } +} diff --git a/pkg/algorithm/hived_algorithm_v2_new_test.go b/pkg/algorithm/hived_algorithm_v2_new_test.go new file mode 100644 index 0000000..0946d60 --- /dev/null +++ b/pkg/algorithm/hived_algorithm_v2_new_test.go @@ -0,0 +1,61 @@ +// MIT License +// +// Copyright (c) Microsoft Corporation. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE + +package algorithm + +import ( + "io/ioutil" + "path/filepath" + "strings" + "testing" +) + +var testConfigRootPath, _ = filepath.Abs("../../test/config") + +func ExecuteHivedAlgorithmTestGroup(t *testing.T, groupFolder string) { + groupSettingPath := filepath.Join(groupFolder, "setting.yaml") + fileInfo, err := ioutil.ReadDir(groupFolder) + if err != nil { + panic(err) + } + for _, file := range fileInfo { + if (file.IsDir() == false) && (strings.HasPrefix(file.Name(), "case")) { + caseFileName := file.Name() + caseFilePath := filepath.Join(groupFolder, caseFileName) + t.Logf("Will execute %v", caseFilePath) + tester := NewHivedAlgorithmTester(t, groupSettingPath) + tester.ExecuteCaseFromYamlFile(caseFilePath) + } + } +} + +func TestHivedAlgorithmGroup1(t *testing.T) { + ExecuteHivedAlgorithmTestGroup(t, filepath.Join(testConfigRootPath, "group1")) +} + +func TestHivedAlgorithmGroup2(t *testing.T) { + ExecuteHivedAlgorithmTestGroup(t, filepath.Join(testConfigRootPath, "group2")) +} + +func TestHivedAlgorithmGroup3(t *testing.T) { + ExecuteHivedAlgorithmTestGroup(t, filepath.Join(testConfigRootPath, "group3")) +} diff --git a/pkg/algorithm/hived_algorithm_test.go b/pkg/algorithm/hived_algorithm_v2_test.go similarity index 51% rename from pkg/algorithm/hived_algorithm_test.go rename to pkg/algorithm/hived_algorithm_v2_test.go index 61d6519..2f35914 100644 --- a/pkg/algorithm/hived_algorithm_test.go +++ b/pkg/algorithm/hived_algorithm_v2_test.go @@ -29,6 +29,7 @@ import ( "testing" "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" "github.com/microsoft/hivedscheduler/pkg/common" "github.com/microsoft/hivedscheduler/pkg/internal" core "k8s.io/api/core/v1" @@ -40,7 +41,8 @@ var allPods = map[string]*core.Pod{} func init() { common.InitAll() - for i := 1; i <= len(pss); i++ { + common.FromYaml(podSchedulingSpecTestData, &podSchedulingSpec) + for i := 1; i <= len(podSchedulingSpec); i++ { podName := fmt.Sprintf("pod%v", i) allPods[podName] = &core.Pod{ ObjectMeta: meta.ObjectMeta{ @@ -63,483 +65,945 @@ func initNodes(h *HivedAlgorithm) { } } -var group1, group2, group3, group4, group5, group6, group7, group8, group9, group10, group11, group12, group13, group14, - group15, group16, group17, group18, group19, group20, group21, group22, group23, group24, group25, group26, group27, - group28, group29, group30, group31, group32, group33, group34 = &api.AffinityGroupSpec{ - Name: "group1", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}}, -}, &api.AffinityGroupSpec{ - Name: "group2", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}}, -}, &api.AffinityGroupSpec{ - Name: "group3", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 8}}, -}, &api.AffinityGroupSpec{ - Name: "group4", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}}, -}, &api.AffinityGroupSpec{ - Name: "group5", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group6", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}}, -}, &api.AffinityGroupSpec{ - Name: "group7", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 3, LeafCellNumber: 8}}, -}, &api.AffinityGroupSpec{ - Name: "group8", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 8}}, -}, &api.AffinityGroupSpec{ - Name: "group9", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 7}, {PodNumber: 1, LeafCellNumber: 5}}, -}, &api.AffinityGroupSpec{ - Name: "group10", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}}, -}, &api.AffinityGroupSpec{ - Name: "group11", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group12", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group13", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group14", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group15", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}}, -}, &api.AffinityGroupSpec{ - Name: "group16", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}}, -}, &api.AffinityGroupSpec{ - Name: "group17", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}}, -}, &api.AffinityGroupSpec{ - Name: "group18", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group19", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group20", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group21", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group22", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group23", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group24", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group25", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group26", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group27", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group28", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group29", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 4, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group30", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group31", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group32", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group33", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -}, &api.AffinityGroupSpec{ - Name: "group34", - Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}}, -} +type podSchedulingSpecType map[types.UID]apiv2.PodSchedulingSpec -var pss = map[types.UID]api.PodSchedulingSpec{ - "pod1": { - VirtualCluster: "VC1", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 1, - AffinityGroup: group1, - }, "pod2": { // buddy of pod1 - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 1, - AffinityGroup: group2, - }, "pod3": { // non-buddy of pod 1 & 2 (avoidance of preemption) - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 8, - AffinityGroup: group3, - }, "pod4": { // opportunistic pod (will stay away from the guaranteed pods) - VirtualCluster: "VC1", - Priority: -1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 1, - AffinityGroup: group4, - }, "pod5": { // use pinned cell - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group5, - }, "pod6": { // use pinned cell - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group5, - }, "pod7": { // insufficient VC cells; should return PodWaitInfo - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX1-P100", - LeafCellNumber: 8, - AffinityGroup: group7, - }, "pod8": { // any leaf cell type; heterogeneous affinity group - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "", - LeafCellNumber: 7, - AffinityGroup: group9, - }, "pod9": { // any leaf cell type; heterogeneous affinity group - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "", - LeafCellNumber: 5, - AffinityGroup: group9, - }, "pod10": { // use a leaf cell type that the VC does not have; should User Error Panic - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 1, - AffinityGroup: group6, - }, "pod11": { // invalid affinity group configuration - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX1-P100", - LeafCellNumber: 2, - AffinityGroup: group8, - }, "pod12": { // invalid affinity group configuration - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX1-P100", - LeafCellNumber: 2, - AffinityGroup: group8, - }, "pod13": { // invalid VC - VirtualCluster: "surprise!", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX1-P100", - LeafCellNumber: 1, - AffinityGroup: group10, - }, "pod14": { // invalid pinned cell - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "surprise!", - LeafCellType: "DGX1-P100", - LeafCellNumber: 1, - AffinityGroup: group10, - }, "pod15": { // invalid priority - VirtualCluster: "VC2", - Priority: 1001, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX1-P100", - LeafCellNumber: 1, - AffinityGroup: group10, - }, "pod16": { // trigger preemption - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group11, - }, "pod17": { // trigger preemption - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group11, - }, "pod18": { // used for test splitting physical cell hierarchies in reconfiguration - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group12, - }, "pod19": { // used for test splitting physical cell hierarchies in reconfiguration - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group12, - }, "pod20": { // guaranteed pod in splitting physical cell hierarchies - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group13, - }, "pod21": { // guaranteed pod in splitting physical cell hierarchies - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group13, - }, "pod22": { // opportunistic pod in splitting physical cell hierarchies - VirtualCluster: "VC1", - Priority: -1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group14, - }, "pod23": { // opportunistic pod in splitting physical cell hierarchies - VirtualCluster: "VC1", - Priority: -1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group14, - }, "pod24": { // used for triggering intra-VC preemption - VirtualCluster: "VC2", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "CT1", - LeafCellNumber: 2, - AffinityGroup: group15, - }, "pod25": { // trigger intra-VC preemption - VirtualCluster: "VC2", - Priority: 1, - LazyPreemptionEnable: false, - PinnedCellId: "", - LeafCellType: "CT1", - LeafCellNumber: 2, - AffinityGroup: group16, - }, "pod26": { // will preempt pod25 immediately (as lazy preemption is not enabled) - VirtualCluster: "VC2", - Priority: 2, - LazyPreemptionEnable: false, - PinnedCellId: "", - LeafCellType: "CT1", - LeafCellNumber: 2, - AffinityGroup: group17, - }, "pod27": { // will be rejected because one of the pod in this group is allocated a non-suggested node - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: false, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group18, - }, "pod28": { // used for stateful preemption test - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: false, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group19, - }, "pod29": { // will try to preempt pod28 - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group20, - }, "pod30": { // cannot get scheduled because pod28's still holding the resource - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group21, - }, "pod31": { // will try to preempt pod28, and will be scheduled to a different node from pod29 - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group22, - }, "pod32": { // cannot get scheduled because VC1-YQW-DGX2 has been used up by pod29 and pod31 - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group23, - }, "pod33": { // will cancel pod29 and pod31's preemption, and continue to preempt pod28 - VirtualCluster: "VC1", - Priority: 3, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group24, - }, "pod34": { // will cancel pod33's preemption, and get scheduled immediately (because pod28 has been deleted) - VirtualCluster: "VC1", - Priority: 4, - LazyPreemptionEnable: false, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group25, - }, "pod35": { // will preempt pod34, and will be deleted before the preemption is done (so the preemption will be canceled) - VirtualCluster: "VC1", - Priority: 5, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group26, - }, "pod36": { // will iterate the leaf cell types until find a placement within suggested nodes - VirtualCluster: "VC1", - Priority: -1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "", - LeafCellNumber: 1, - AffinityGroup: group1, - }, "pod37": { // used for test aware of suggested nodes in VC - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 1, - AffinityGroup: group1, - }, "pod38": { // used for test aware of suggested nodes in VC - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "VC1-YQW-DGX2", - LeafCellType: "DGX2-V100", - LeafCellNumber: 1, - AffinityGroup: group2, - }, "pod39": { // used for triggering backtrack cell search - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group27, - }, "pod40": { // backtrack cell search - VirtualCluster: "VC1", - Priority: 1, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group28, - }, "pod41": { // revert lazy preemption in backtrack cell search - VirtualCluster: "VC1", - Priority: 2, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group29, - }, "pod42": { // doomed bad cell test - VirtualCluster: "VC1", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group30, - }, "pod43": { // doomed bad cell test - VirtualCluster: "VC2", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group31, - }, "pod44": { // safe relaxed buddy allocate for bad node test - VirtualCluster: "VC1", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group32, - }, "pod45": { // safe relaxed buddy allocate for bad node test - VirtualCluster: "VC1", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group33, - }, "pod46": { // safe relaxed buddy allocate safety test - VirtualCluster: "VC1", - Priority: 0, - LazyPreemptionEnable: true, - PinnedCellId: "", - LeafCellType: "DGX2-V100", - LeafCellNumber: 16, - AffinityGroup: group34, - }, -} +var podSchedulingSpecTestData = `pod1: + version: v2 + virtualCluster: VC1 + priority: 0 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod2: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod3: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 8 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group3 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 8 + containsCurrentPod: true + childGroups: [] +pod4: + version: v2 + virtualCluster: VC1 + priority: -1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group4 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod5: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group5 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod6: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group5 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod7: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: DGX1-P100 + cellNumber: 8 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group7 + withinOneCell: "" + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: DGX1-P100 + cellNumber: 8 + containsCurrentPod: true + childGroups: [] +pod8: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: "" + cellNumber: 7 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group9 + withinOneCell: "" + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: "" + cellNumber: 7 + containsCurrentPod: true + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: "" + cellNumber: 5 + containsCurrentPod: false +pod9: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: "" + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group9 + withinOneCell: "" + childGroups: + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: "" + cellNumber: 7 + containsCurrentPod: false + - pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: "" + cellNumber: 5 + containsCurrentPod: true +pod10: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group6 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod11: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: DGX1-P100 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group8 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX1-P100 + cellNumber: 8 + containsCurrentPod: false + childGroups: [] +pod12: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: DGX1-P100 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group8 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX1-P100 + cellNumber: 8 + containsCurrentPod: false + childGroups: [] +pod13: + version: v2 + virtualCluster: surprise! + priority: 1 + pinnedCellId: "" + cellType: DGX1-P100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group10 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX1-P100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod14: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: surprise! + cellType: DGX1-P100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group10 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX1-P100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod15: + version: v2 + virtualCluster: VC2 + priority: 1001 + pinnedCellId: "" + cellType: DGX1-P100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group10 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX1-P100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod16: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group11 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod17: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group11 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod18: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group12 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod19: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group12 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod20: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group13 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod21: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group13 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod22: + version: v2 + virtualCluster: VC1 + priority: -1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group14 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod23: + version: v2 + virtualCluster: VC1 + priority: -1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group14 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod24: + version: v2 + virtualCluster: VC2 + priority: 0 + pinnedCellId: "" + cellType: CT1 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group15 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: CT1 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +pod25: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: CT1 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group16 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: CT1 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +pod26: + version: v2 + virtualCluster: VC2 + priority: 2 + pinnedCellId: "" + cellType: CT1 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group17 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: CT1 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +pod27: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group18 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod28: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group19 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod29: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group20 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod30: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group21 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod31: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group22 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod32: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group23 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod33: + version: v2 + virtualCluster: VC1 + priority: 3 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group24 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod34: + version: v2 + virtualCluster: VC1 + priority: 4 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group25 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod35: + version: v2 + virtualCluster: VC1 + priority: 5 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group26 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod36: + version: v2 + virtualCluster: VC1 + priority: -1 + pinnedCellId: "" + cellType: "" + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: "" + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod37: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod38: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: VC1-YQW-DGX2 + cellType: DGX2-V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +pod39: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group27 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod40: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group28 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod41: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group29 + withinOneCell: "" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod42: + version: v2 + virtualCluster: VC1 + priority: 0 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group30 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod43: + version: v2 + virtualCluster: VC2 + priority: 0 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group31 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod44: + version: v2 + virtualCluster: VC1 + priority: 0 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group32 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod45: + version: v2 + virtualCluster: VC1 + priority: 0 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group33 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +pod46: + version: v2 + virtualCluster: VC1 + priority: 0 + pinnedCellId: "" + cellType: DGX2-V100 + cellNumber: 16 + gangReleaseEnable: false + lazyPreemptionEnable: true + podRootGroup: + name: group34 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: DGX2-V100 + cellNumber: 16 + containsCurrentPod: true + childGroups: [] +` + +var podSchedulingSpec = podSchedulingSpecType{} var casesThatShouldSucceed = []string{ "pod1", "pod2", "pod3", "pod4", "pod5", "pod6", "pod7", "pod8", "pod9", "pod16", "pod17", "pod18", "pod19", "pod20", @@ -570,8 +1034,8 @@ var expectedBindInfos = map[string]result{ "pod4": {node: "0.0.5.0", leafCellIsolation: []int32{0}}, "pod5": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}}, "pod6": {node: "0.0.3.1", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}}, - "pod8": {node: "1.0.0.0", leafCellIsolation: []int32{1, 3, 4, 7, 0, 2, 6}}, - "pod9": {node: "1.0.0.2", leafCellIsolation: []int32{0, 1, 2, 3, 4}}, + "pod8": {node: "1.0.0.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6}}, + "pod9": {node: "1.0.0.0", leafCellIsolation: []int32{1, 3, 4, 7, 0}}, "pod18": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}}, "pod19": {node: "0.0.3.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}}, "pod20": {node: "0.0.4.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}}, @@ -687,7 +1151,7 @@ func testCasesThatShouldSucceed(t *testing.T, h *HivedAlgorithm) { var psr internal.PodScheduleResult for _, podName := range casesThatShouldSucceed { pod := allPods[podName] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, allNodes, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) if psr.PodBindInfo != nil { @@ -717,7 +1181,7 @@ func testOneCaseThatShouldFail(t *testing.T, h *HivedAlgorithm, podNames []strin var psr internal.PodScheduleResult for _, podName := range podNames { pod := allPods[podName] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, allNodes, internal.PreemptingPhase) allocatedPod := internal.NewBindingPod(pod, psr.PodBindInfo) h.AddAllocatedPod(allocatedPod) @@ -736,21 +1200,22 @@ func testDeletePods(t *testing.T, h *HivedAlgorithm) { h.DeleteAllocatedPod(allocatedPods[i]) } for _, pod := range allocatedPods { - if g, ok := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]; ok { - t.Errorf("Group %v is expected to be deleted in scheduler, but not", g.name) + if _, ok := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name]; ok { + t.Errorf("Group %v is expected to be deleted in scheduler, but not", podSchedulingSpec[pod.UID].PodRootGroup.Name) } } for i := len(preemptingPods) - 1; i >= 0; i-- { h.DeleteUnallocatedPod(preemptingPods[i]) } for _, pod := range preemptingPods { - if g, ok := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]; ok { - t.Errorf("Group %v is expected to be deleted in scheduler, but not", g.name) + if _, ok := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name]; ok { + t.Errorf("Group %v is expected to be deleted in scheduler, but not", podSchedulingSpec[pod.UID].PodRootGroup.Name) } } } func testSuggestedNodes(t *testing.T, configFilePath string) { + t.Skip("Do not support suggested nodes any more!") sConfig := api.NewConfig(api.InitRawConfig(&configFilePath)) h := NewHivedAlgorithm(sConfig) for _, chains := range h.cellChains { @@ -758,18 +1223,18 @@ func testSuggestedNodes(t *testing.T, configFilePath string) { } setHealthyNodes(h) pod := allPods["pod36"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr := h.Schedule(pod, []string{"0.0.1.0"}, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) pod = allPods["pod37"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.3.0"}, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) allocatedPod := internal.NewBindingPod(pod, psr.PodBindInfo) h.AddAllocatedPod(allocatedPod) pod = allPods["pod38"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.3.1"}, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) h.DeleteAllocatedPod(allocatedPod) @@ -781,38 +1246,38 @@ func testSuggestedNodes(t *testing.T, configFilePath string) { } } pod = allPods["pod27"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, nodes, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) nodes = append(nodes, "0.0.3.1") - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) // this time scheduling will succeed psr = h.Schedule(pod, nodes, internal.PreemptingPhase) h.AddAllocatedPod(internal.NewBindingPod(pod, psr.PodBindInfo)) pod = allPods["pod33"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, nodes, internal.FilteringPhase) // group should not be preempting because this is Filtering phase - if g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]; g != nil { - t.Errorf("Group %v should not exist but it does", g.name) + if g := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name]; g != nil { + t.Errorf("Group %v should not exist but it does", podSchedulingSpec[pod.UID].PodRootGroup.Name) } psr = h.Schedule(pod, nodes[:len(nodes)-1], internal.PreemptingPhase) // group should not be preempting because the placement is not fully within Preempting-phase suggested nodes - if g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]; g != nil { - t.Errorf("Group %v should not exist but it does", g.name) + if g := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name]; g != nil { + t.Errorf("Group %v should not exist but it does", podSchedulingSpec[pod.UID].PodRootGroup.Name) } // this time group will be preempting psr = h.Schedule(pod, nodes, internal.PreemptingPhase) - if g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]; g == nil { + if g := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name]; g == nil { t.Errorf("Group %v should be preempting but does not exist", - pss[pod.UID].AffinityGroup.Name) - } else if g.state != groupPreempting { - t.Errorf("Group %v should be in Preempting state but not", g.name) + podSchedulingSpec[pod.UID].PodRootGroup.Name) + } else if g.state != podGroupPreempting { + t.Errorf("Group %v should be in Preempting state but not", podSchedulingSpec[pod.UID].PodRootGroup.Name) } psr = h.Schedule(pod, nodes[:len(nodes)-1], internal.PreemptingPhase) // group should have been deleted because the placement is not within Preempting-phase suggested nodes - if g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]; g != nil { - t.Errorf("Group %v should have been deleted, but not", g.name) + if g := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name]; g != nil { + t.Errorf("Group %v should have been deleted, but not", podSchedulingSpec[pod.UID].PodRootGroup.Name) } // test backtracking search for cell binding @@ -825,30 +1290,27 @@ func testSuggestedNodes(t *testing.T, configFilePath string) { } setHealthyNodes(h) pod = allPods["pod39"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.3.2", "0.0.3.3"}, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) h.AddAllocatedPod(internal.NewBindingPod(pod, psr.PodBindInfo)) pod = allPods["pod40"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.4.3"}, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) h.AddAllocatedPod(internal.NewBindingPod(pod, psr.PodBindInfo)) pod = allPods["pod41"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.3.2", "0.0.3.3", "0.0.4.3"}, internal.PreemptingPhase) // the pod tries to lazy preempt group27 and group28, but is reverted - if g := h.affinityGroups["group27"]; g == nil { - t.Errorf("Group %v should be allocated but does not exist", - pss[pod.UID].AffinityGroup.Name) - } else if g.state != groupAllocated { - t.Errorf("Group %v should be in Allocated state but not", g.name) - } - if g := h.affinityGroups["group28"]; g == nil { - t.Errorf("Group %v should be allocated but does not exist", - pss[pod.UID].AffinityGroup.Name) - } else if g.state != groupAllocated { - t.Errorf("Group %v should be in Allocated state but not", g.name) + lazyPreemptedGroupList := []string{"group27", "group28"} + for _, groupName := range lazyPreemptedGroupList { + if g := h.podGroups[groupName]; g == nil { + t.Errorf("Group %v should be allocated but does not exist", + podSchedulingSpec[pod.UID].PodRootGroup.Name) + } else if g.state != podGroupAllocated { + t.Errorf("Group %v should be in Allocated state but not", groupName) + } } } @@ -863,7 +1325,7 @@ func testStatefulPreemption(t *testing.T, configFilePath string) { var psr internal.PodScheduleResult for _, podName := range casesForStatefulPreemption { pod := allPods[podName] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, allNodes, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) if psr.PodBindInfo != nil { @@ -875,30 +1337,32 @@ func testStatefulPreemption(t *testing.T, configFilePath string) { h.DeleteAllocatedPod(allocatedPods[0]) } if podName == "pod35" { - p := &groupPhysicalPlacement{} - *p = h.affinityGroups[pss[pod.UID].AffinityGroup.Name].physicalLeafCellPlacement + p := &PodGroupPhysicalPlacement{} + *p = h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name].physicalPlacement h.DeleteUnallocatedPod(pod) // test correctness of preemption cancellation - for _, podPlacements := range *p { - for _, podLeafCells := range podPlacements { - for _, leafCell := range podLeafCells { - pLeafCell := leafCell.(*PhysicalCell) - if pLeafCell.GetState() == cellUsed { - if int32(pLeafCell.GetPriority()) != pss["pod34"].Priority { - t.Errorf("Cell %v's priority should be pod34's priority, but is %v", - pLeafCell.GetAddress(), pLeafCell.GetPriority()) - } - } else if pLeafCell.GetState() != cellFree { - t.Errorf("Cell %v should be in Free state, but is %v", - pLeafCell.GetAddress(), pLeafCell.GetState()) + // The physicalPlacement shouldn't have child groups, and p.podsPlacement will be the leaf cell. + if len(p.childGroupsPlacement) != 0 { + t.Errorf("Group %v should not contain childGroupsPlacement but it does", podSchedulingSpec[pod.UID].PodRootGroup.Name) + } + for _, podLeafCells := range p.podsPlacement { + for _, leafCell := range podLeafCells { + pLeafCell := leafCell.(*PhysicalCell) + if pLeafCell.GetState() == cellUsed { + if int32(pLeafCell.GetPriority()) != podSchedulingSpec["pod34"].Priority { + t.Errorf("Cell %v's priority should be pod34's priority, but is %v", + pLeafCell.GetAddress(), pLeafCell.GetPriority()) } + } else if pLeafCell.GetState() != cellFree { + t.Errorf("Cell %v should be in Free state, but is %v", + pLeafCell.GetAddress(), pLeafCell.GetState()) } } } } if deletedGroups := deletedPreemptorGroups[podName]; deletedGroups != nil { for _, g := range deletedGroups { - if _, ok := h.affinityGroups[g]; ok { + if _, ok := h.podGroups[g]; ok { t.Errorf("Group %v is expected to be deleted in scheduler, but not", g) } } @@ -917,7 +1381,7 @@ func testBadNodes(t *testing.T, configFilePath string) { allocatedPods = []*core.Pod{} pod := allPods["pod42"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr := h.Schedule(pod, []string{"0.0.2.0"}, internal.PreemptingPhase) bindingPod := internal.NewBindingPod(pod, psr.PodBindInfo) h.AddAllocatedPod(bindingPod) @@ -936,7 +1400,7 @@ func testBadNodes(t *testing.T, configFilePath string) { } } pod = allPods["pod43"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.2.2"}, internal.PreemptingPhase) bindingPod = internal.NewBindingPod(pod, psr.PodBindInfo) h.AddAllocatedPod(bindingPod) @@ -1013,7 +1477,7 @@ func testSafeRelaxedBuddyAlloc(t *testing.T, configFilePath string) { allocatedPods = []*core.Pod{} pod := allPods["pod44"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr := h.Schedule(pod, []string{"0.0.3.2", "0.0.3.3", "0.0.4.2", "0.0.4.3"}, internal.PreemptingPhase) bindingPod := internal.NewBindingPod(pod, psr.PodBindInfo) h.AddAllocatedPod(bindingPod) @@ -1022,7 +1486,7 @@ func testSafeRelaxedBuddyAlloc(t *testing.T, configFilePath string) { h.setBadNode("0.0.3.3") pod = allPods["pod45"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.3.2", "0.0.3.3", "0.0.4.2", "0.0.4.3"}, internal.PreemptingPhase) if psr.PodBindInfo == nil { t.Errorf("Cannot split higher level cells when requested level cell is bad") @@ -1034,7 +1498,7 @@ func testSafeRelaxedBuddyAlloc(t *testing.T, configFilePath string) { h.setBadNode("0.0.4.3") pod = allPods["pod46"] - pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(pss[pod.UID]) + pod.Annotations[api.AnnotationKeyPodSchedulingSpec] = common.ToYaml(podSchedulingSpec[pod.UID]) psr = h.Schedule(pod, []string{"0.0.3.2", "0.0.3.3", "0.0.4.0", "0.0.4.1", "0.0.4.2", "0.0.4.3"}, internal.PreemptingPhase) compareSchedulingResult(t, pod, psr) } @@ -1083,9 +1547,9 @@ func testReconfiguration(t *testing.T, configFilePath string) { } for _, podName := range casesThatShouldBeLazyPreempted { pod := allPods[podName] - g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name] - if g.virtualLeafCellPlacement != nil { - t.Errorf("Group %v is expected to be lazy preempted, but not", g.name) + g := h.podGroups[podSchedulingSpec[pod.UID].PodRootGroup.Name] + if !PodGroupPlacement(g.virtualPlacement).IsEmpty() { + t.Errorf("Group %v is expected to be lazy preempted, but not", podSchedulingSpec[pod.UID].PodRootGroup.Name) } } testDeletePods(t, h) diff --git a/pkg/algorithm/intra_vc_scheduler.go b/pkg/algorithm/intra_vc_scheduler.go index 0c91534..a7670b1 100644 --- a/pkg/algorithm/intra_vc_scheduler.go +++ b/pkg/algorithm/intra_vc_scheduler.go @@ -32,41 +32,42 @@ import ( // intraVCScheduler is an interface for scheduling pods inside a VC. // It stores two maps of ChainCellList, one for pinned cells, the other for non-pinned ones. -// It should be able to return a set of leaf cell placements in the VC for a scheduling request. +// It should be able to return a set of cell placements in the VC for a scheduling request. type intraVCScheduler interface { getNonPinnedFullCellList() map[CellChain]ChainCellList getNonPinnedPreassignedCells() map[CellChain]ChainCellList getPinnedCells() map[api.PinnedCellId]ChainCellList - // Schedule an affinity group inside a VC. We use topologyAwareScheduler by default. - schedule(schedulingRequest) (groupVirtualPlacement, string) + // Schedule a pod group inside a VC. We use topologyGuaranteeScheduler. + schedule(PodGroupSchedulingRequest) (PodGroupVirtualPlacement, string) } type defaultIntraVCScheduler struct { nonPinnedFullCellList map[CellChain]ChainCellList nonPinnedPreassignedCells map[CellChain]ChainCellList pinnedCells map[api.PinnedCellId]ChainCellList - // Currently we create a topologyAwareScheduler for each cluster view (each chain, each pinned cell). + // Currently we create a topologyGuaranteeScheduler for each cluster view (each chain, each pinned cell). // We plan to support multiple cluster views in one scheduler, and to support schedule pods // across different cluster views. - // TODO: Support an affinity group can relax to be allocated across multiple chains. - nonPinnedCellSchedulers map[CellChain]*topologyAwareScheduler - pinnedCellSchedulers map[api.PinnedCellId]*topologyAwareScheduler + // TODO: Support a pod group can relax to be allocated across multiple chains. + nonPinnedCellSchedulers map[CellChain]*topologyGuaranteeScheduler + pinnedCellSchedulers map[api.PinnedCellId]*topologyGuaranteeScheduler } func newDefaultIntraVCScheduler( nonPinnedFullList map[CellChain]ChainCellList, nonPinnedFreeList map[CellChain]ChainCellList, pinnedList map[api.PinnedCellId]ChainCellList, - leafCellNums map[CellChain]map[CellLevel]int32) *defaultIntraVCScheduler { + leafCellNums map[CellChain]map[CellLevel]int32, + cellLevels map[CellChain]map[api.CellType]CellLevel) *defaultIntraVCScheduler { - snr := map[CellChain]*topologyAwareScheduler{} - sr := map[api.PinnedCellId]*topologyAwareScheduler{} + snr := map[CellChain]*topologyGuaranteeScheduler{} + sr := map[api.PinnedCellId]*topologyGuaranteeScheduler{} for chain, ccl := range nonPinnedFullList { - snr[chain] = NewTopologyAwareScheduler(ccl, leafCellNums[chain], true) + snr[chain] = NewTopologyGuaranteeScheduler(ccl, leafCellNums[chain], cellLevels[chain], true) } for pid, ccl := range pinnedList { - sr[pid] = NewTopologyAwareScheduler(ccl, leafCellNums[ccl[CellLevel(1)][0].GetChain()], true) + sr[pid] = NewTopologyGuaranteeScheduler(ccl, leafCellNums[ccl[CellLevel(1)][0].GetChain()], cellLevels[ccl[CellLevel(1)][0].GetChain()], true) } return &defaultIntraVCScheduler{ nonPinnedFullCellList: nonPinnedFullList, @@ -90,28 +91,29 @@ func (s *defaultIntraVCScheduler) getPinnedCells() map[api.PinnedCellId]ChainCel } func (s *defaultIntraVCScheduler) schedule( - sr schedulingRequest) ( - placement groupVirtualPlacement, + podGroupSchedRequest PodGroupSchedulingRequest) ( + virtualPlacement PodGroupVirtualPlacement, failedReason string) { - scheduler := s.nonPinnedCellSchedulers[sr.chain] - str := fmt.Sprintf("chain %v", sr.chain) - if sr.pinnedCellId != "" { - scheduler = s.pinnedCellSchedulers[sr.pinnedCellId] - str = fmt.Sprintf("pinned cell %v", sr.pinnedCellId) + scheduler := s.nonPinnedCellSchedulers[podGroupSchedRequest.chain] + str := fmt.Sprintf("chain %v", podGroupSchedRequest.chain) + if podGroupSchedRequest.pinnedCellId != "" { + scheduler = s.pinnedCellSchedulers[podGroupSchedRequest.pinnedCellId] + str = fmt.Sprintf("pinned cell %v", podGroupSchedRequest.pinnedCellId) } - klog.Infof("Processing scheduling request in VC %v: %v, leaf cell numbers %v, priority %v", - sr.vc, str, common.ToJson(sr.affinityGroupPodNums), sr.priority) + klog.Infof("Processing scheduling request in VC %v: %v, pod group %v, priority %v", + podGroupSchedRequest.vc, str, common.ToJson(podGroupSchedRequest.podRootGroup), podGroupSchedRequest.priority) if scheduler != nil { + var placement PodGroupPlacement placement, failedReason = scheduler.Schedule( - sr.affinityGroupPodNums, - sr.priority, - sr.suggestedNodes, - sr.ignoreSuggestedNodes) + &podGroupSchedRequest.podRootGroup, + podGroupSchedRequest.priority, + ) + virtualPlacement = PodGroupVirtualPlacement(placement) } - if placement == nil { - return nil, fmt.Sprintf("%v when scheduling in VC %v", failedReason, sr.vc) + if PodGroupPlacement(virtualPlacement).IsEmpty() { + return PodGroupVirtualPlacement{}, fmt.Sprintf("%v when scheduling in VC %v", failedReason, podGroupSchedRequest.vc) } - klog.Infof("Found placement in VC %v: %v", sr.vc, placement) - return placement, "" + klog.Infof("Found placement in VC %v: %v", podGroupSchedRequest.vc, virtualPlacement) + return virtualPlacement, "" } diff --git a/pkg/algorithm/topology_aware_scheduler.go b/pkg/algorithm/topology_aware_scheduler.go index 9d4392f..faf00e4 100644 --- a/pkg/algorithm/topology_aware_scheduler.go +++ b/pkg/algorithm/topology_aware_scheduler.go @@ -20,6 +20,8 @@ // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE // SOFTWARE +// +build ignore + package algorithm import ( @@ -440,27 +442,6 @@ func removePickedLeafCells(leafCells CellList, indices []int32) CellList { return leafCells[:len(leafCells)-len(indices)] } -// findLCA finds the lowest common ancestor of two cells (nil if they have no LCA). -func findLCA(lower Cell, higher Cell) Cell { - for lower.GetLevel() < higher.GetLevel() { - if lower.GetParent() == nil { - return nil - } - lower = lower.GetParent() - } - if CellEqual(lower, higher) { - return lower - } - for !CellEqual(lower.GetParent(), higher.GetParent()) { - if lower.GetParent() == nil || higher.GetParent() == nil { - return nil - } - lower = lower.GetParent() - higher = higher.GetParent() - } - return lower.GetParent() -} - // getLeafCellsFromNode collects free leaf cells and preemptible leaf cells according to the priority. func getLeafCellsFromNode(c Cell, p CellPriority, freeLeafCells CellList, preemptibleLeafCells CellList) (CellList, CellList) { if c.GetLevel() > 1 { diff --git a/pkg/algorithm/topology_guarantee_scheduler.go b/pkg/algorithm/topology_guarantee_scheduler.go new file mode 100644 index 0000000..418ccd5 --- /dev/null +++ b/pkg/algorithm/topology_guarantee_scheduler.go @@ -0,0 +1,578 @@ +// MIT License +// +// Copyright (c) Microsoft Corporation. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE + +package algorithm + +import ( + "fmt" + "sort" + + "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" +) + +// skuCell type for selected level cell in virtual cluster view +type skuCell struct { + cell Cell // within cell, whose level maybe higher or lower than node level + freeLeafCellNumAtPriority int32 // free leaf cell number at the priority of the pod to be scheduled (lower priority considered as free) + usedLeafCellNumAtPriority int32 // used leaf cell number at the priority of the pod to be scheduler + usedLeafCellNumHigherPriority int32 // used leaf cell number by higher priorities than the pod to be scheduler + healthy bool // if the cell is healthy + address api.CellAddress // used for logging the cell address when bad or not suggested +} + +// virtual cluster view +type skuClusterView []*skuCell + +// topologyGuaranteeScheduler can schedule a group of pods with guaranteed affinity requirement (withinOneCell, e.g., within one rack), +// and each pod can specify arbitrary cell types (e.g., non-leaf cell). +// It first tries to place pod group without preemption, then enable preemption if schedule failed. +// For each try, it will find within cells for each pod group, then find cells for each pods with better affinity. +type topologyGuaranteeScheduler struct { + // cell list for each level in a chain. + chainCellList ChainCellList + // leaf cell number at each level in the cell hierarchy. we use this to + // calculate the optimal affinity for a given leaf cell number. + levelLeafCellNum map[CellLevel]int32 + // cell type to cell level in a chain. + cellLevels map[api.CellType]CellLevel + // pack pods cross different priorities, or inside each priority. the former is for intra-VC scheduling, + // because high-priority can avoid preemption in the whole cluster view, + // and hence we can pack pods with different priorities. + // the latter is for opportunistic pod scheduling (stay away from guaranteed pods), + // because guaranteed pods can avoid preempting opportunistic pods only among buddy cells (this is decided + // by the buddy cell allocation algorithm). + crossPriorityPack bool +} + +// NewTopologyGuaranteeScheduler initializes the scheduler +func NewTopologyGuaranteeScheduler( + chainCellList ChainCellList, + levelLeafCellNum map[CellLevel]int32, + cellLevels map[api.CellType]CellLevel, + crossPriorityPack bool) *topologyGuaranteeScheduler { + + return &topologyGuaranteeScheduler{ + chainCellList: chainCellList, + levelLeafCellNum: levelLeafCellNum, + cellLevels: cellLevels, + crossPriorityPack: crossPriorityPack, + } +} + +func (s *topologyGuaranteeScheduler) Schedule( + podRootGroup *apiv2.PodGroupSpec, + priority CellPriority) ( + placement PodGroupPlacement, + failedReason string) { + + // sort pods in descending order by counting leaf cell number + s.sortPodGroup(podRootGroup) + + // disable preemption first to reduce preemption, try to schedule + placement, failedReason = s.findCellsForPodGroup(podRootGroup, opportunisticPriority, nil, &placement) + + // enable preemption if scheduling failed + if failedReason != "" && priority > opportunisticPriority { + placement, failedReason = s.findCellsForPodGroup(podRootGroup, priority, nil, &placement) + } + + // convert cells to leaf cells in placement + if failedReason == "" { + for iter := placement.Iterator(); iter.HasNext(); { + cells, leafCells := iter.Next(), CellList{} + for _, c := range *cells { + currLevelCells := CellList{c} + for currLevelCells[0].GetLevel() > CellLevel(1) { + childLevelCells := CellList{} + for _, cc := range currLevelCells { + childLevelCells = append(childLevelCells, cc.GetChildren()...) + } + currLevelCells = childLevelCells + } + leafCells = append(leafCells, currLevelCells...) + } + *cells = leafCells + } + } + + return placement, failedReason +} + +func (s *topologyGuaranteeScheduler) sortPodGroup(podGroup *apiv2.PodGroupSpec) { + sort.SliceStable(podGroup.Pods, func(i, j int) bool { + return s.countLeafCellNums(podGroup.Pods[i]) > s.countLeafCellNums(podGroup.Pods[j]) + }) + sortedPods := []apiv2.PodGroupMemberSpec{} + for _, p := range podGroup.Pods { + for i := int32(0); i < p.PodMinNumber; i++ { + sortedPods = append(sortedPods, p) + } + } + podGroup.Pods = sortedPods + + sort.SliceStable(podGroup.ChildGroups, func(i, j int) bool { + return s.countLeafCellNums(podGroup.ChildGroups[i]) > s.countLeafCellNums(podGroup.ChildGroups[j]) + }) + for _, g := range podGroup.ChildGroups { + s.sortPodGroup(g) + } +} + +func (s *topologyGuaranteeScheduler) countLeafCellNums(x interface{}) int32 { + count := int32(0) + switch p := x.(type) { + case apiv2.PodGroupMemberSpec: + count = s.levelLeafCellNum[s.cellLevels[p.CellsPerPod.CellType]] * p.CellsPerPod.CellNumber + case []apiv2.PodGroupMemberSpec: + for _, pp := range p { + count += s.countLeafCellNums(pp) + } + case *apiv2.PodGroupSpec: + count += s.countLeafCellNums(p.Pods) + s.countLeafCellNums(p.ChildGroups) + case []*apiv2.PodGroupSpec: + for _, pp := range p { + count += s.countLeafCellNums(pp) + } + } + return count +} + +func (s *topologyGuaranteeScheduler) findCellsForPodGroup( + podGroup *apiv2.PodGroupSpec, + priority CellPriority, + within *skuCell, + allocated *PodGroupPlacement) ( + placement PodGroupPlacement, + failedReason string) { + + placement, failedReason = PodGroupPlacement{}, "no matched cells in vc" + if _, ok := s.cellLevels[podGroup.WithinOneCell]; !ok && podGroup.WithinOneCell != "" { + return placement, fmt.Sprintf( + "%v, unknown withinOneCell %v", failedReason, podGroup.WithinOneCell) + } + + cv := skuClusterView{nil} + if level, ok := s.cellLevels[podGroup.WithinOneCell]; ok { + cv = s.createSkuClusterView(within, level, priority) + } else if within != nil { + cv = s.createSkuClusterView(within, within.cell.GetLevel(), priority) + } + + for _, withinCell := range cv { + if len(podGroup.Pods) > 0 && withinCell != nil && !withinCell.healthy { + return PodGroupPlacement{}, fmt.Sprintf( + "have to use at least one bad cell %v", withinCell.address) + } + placement.podsPlacement, failedReason = s.findCellsForPods(podGroup.Pods, priority, withinCell, allocated) + if failedReason == "" { + for _, childGroup := range podGroup.ChildGroups { + childPodsPlacement, childFailedReason := s.findCellsForPodGroup(childGroup, priority, withinCell, &placement) + if childFailedReason != "" { + placement.childGroupsPlacement, failedReason = nil, childFailedReason + break + } + if placement.childGroupsPlacement == nil { + placement.childGroupsPlacement = []*PodGroupPlacement{} + } + placement.childGroupsPlacement = append(placement.childGroupsPlacement, &childPodsPlacement) + } + if failedReason == "" { + break + } + } + } + return placement, failedReason +} + +func (s *topologyGuaranteeScheduler) findCellsForPods( + pods []apiv2.PodGroupMemberSpec, + priority CellPriority, + within *skuCell, + allocated *PodGroupPlacement) ( + placement []CellList, + failedReason string) { + + placement, failedReason = []CellList{}, "" + if pods == nil || len(pods) == 0 { + return placement, failedReason + } + + allocatedCells := CellList{} + for iter := allocated.Iterator(); iter.HasNext(); { + for _, c := range *iter.Next() { + allocatedCells = append(allocatedCells, c) + } + } + + cv := skuClusterView{within} + nodeLevel := s.getNodeLevel() + if within == nil || within.cell.GetLevel() > nodeLevel { + cv = s.createSkuClusterView(within, nodeLevel, priority) + } + + withinCellIndex, podIndex := 0, 0 + for podIndex < len(pods) { + if withinCellIndex >= len(cv) { + return nil, "insufficient capacity" + } + withinCell := cv[withinCellIndex] + if !withinCell.healthy { + return nil, fmt.Sprintf( + "have to use at least one bad cell %v", withinCell.address) + } + podPlacement := s.findCellsForSinglePod(pods[podIndex], priority, withinCell, allocatedCells) + if podPlacement == nil { + withinCellIndex++ + } else { + placement = append(placement, podPlacement) + allocatedCells = append(allocatedCells, podPlacement...) + podIndex++ + } + } + + return placement, failedReason +} + +// findCellsForSinglePod finds a set of cells with the best affinity in a node for a pod in best effort. +func (s *topologyGuaranteeScheduler) findCellsForSinglePod( + pod apiv2.PodGroupMemberSpec, + priority CellPriority, + withinCell *skuCell, + allocatedCells CellList) CellList { + + currLevel := s.cellLevels[pod.CellsPerPod.CellType] + availableCells, preemptibleCells := CellList{}, CellList{} + availableCells, preemptibleCells = getFreeCellsAtLevel( + withinCell.cell, currLevel, priority, allocatedCells, availableCells, preemptibleCells) + // free leaf cells will be used first (before preemptible leaf cells) + availableCells = append(availableCells, preemptibleCells...) + if pod.CellsPerPod.CellNumber > int32(len(availableCells)) { + return nil + } + + var freeCell Cell + freeCellIndex, searchCellIndex := int32(0), int32(0) + // indices of the currently picked cells + currentCellIndices := make([]int32, pod.CellsPerPod.CellNumber) + // affinity of the currently picked cells, defined as the lowest common ancestor + // of the leaf cells in the cell hierarchy (lower level means better affinity) + currentAffinity := make(CellList, pod.CellsPerPod.CellNumber) + // cells with the best affinity ever seen + bestAffinityCells := make(CellList, pod.CellsPerPod.CellNumber) + // indices of the cells with the best affinity ever seen + bestAffinityCellIndices := make([]int32, pod.CellsPerPod.CellNumber) + // the best affinity ever seen (i.e., lowest level of lowest common ancestor of a set of cells) + bestAffinity := highestLevel + // the optimal affinity for the cell number, i.e., the lowest possible of the lowest common ancestor of cells + optimalAffinity := CellLevel(1) + for l := CellLevel(currLevel); l <= CellLevel(len(s.levelLeafCellNum)); l++ { + if s.levelLeafCellNum[l] >= s.levelLeafCellNum[currLevel]*pod.CellsPerPod.CellNumber { + optimalAffinity = l + break + } + } + + for { + for freeCellIndex < int32(len(availableCells)) { + freeCell = availableCells[freeCellIndex] + currentCellIndices[searchCellIndex] = freeCellIndex + if searchCellIndex == 0 { + currentAffinity[searchCellIndex] = freeCell + } else { + currentAffinity[searchCellIndex] = findLCA(freeCell, currentAffinity[searchCellIndex-1]) + // pruning: if the current LCA has been higher than the lowest ever, + // the node will be skipped + if (currentAffinity[searchCellIndex] == nil && bestAffinity < highestLevel) || + (currentAffinity[searchCellIndex] != nil && currentAffinity[searchCellIndex].GetLevel() > bestAffinity) { + freeCellIndex++ + continue + } + } + if searchCellIndex == pod.CellsPerPod.CellNumber-1 { + foundOptimalAffinity := false + bestAffinity, foundOptimalAffinity = checkOptimalAffinityForCells( + currentAffinity[len(currentAffinity)-1].GetLevel(), + availableCells, + currentCellIndices, + bestAffinity, + bestAffinityCells, + bestAffinityCellIndices, + optimalAffinity) + if foundOptimalAffinity { + // early stop: return if the solution is optimal (i.e., all buddies) + return bestAffinityCells + } + } else { + searchCellIndex++ + } + freeCellIndex++ + } + searchCellIndex-- + if searchCellIndex < 0 { + if bestAffinity == highestLevel { + // Unreachable + panic(fmt.Sprintf("Assert Failure: failed to allocate %v cells in cell %v", pod.CellsPerPod.CellNumber, withinCell.address)) + } + return bestAffinityCells + } + freeCellIndex = currentCellIndices[searchCellIndex] + 1 + } +} + +func (s *topologyGuaranteeScheduler) getNodeLevel() CellLevel { + for l := CellLevel(1); l <= CellLevel(len(s.chainCellList)); l++ { + if s.chainCellList[l][0].AtOrHigherThanNode() { + return l + } + } + return -1 +} + +// getFreeCellsAtLevel collects free cells and preemptible cells at given level according to the priority. +// Sort cells when splitting so that cells need higher level split can be used later. +func getFreeCellsAtLevel( + cell Cell, + level CellLevel, + priority CellPriority, + allocatedCells CellList, + availableCells CellList, + preemptibleCells CellList) ( + CellList, CellList) { + + if cell.GetLevel() > level { + cellChildren := cell.GetChildren() + usedCellNums := make([]int32, len(cellChildren)) + for i, c := range cellChildren { + usedCellNums[i] = 0 + for p, num := range c.GetUsedLeafCellNumAtPriorities() { + if p >= priority { + usedCellNums[i] += num + } + } + } + sort.SliceStable(cellChildren, func(i, j int) bool { + return usedCellNums[i] > usedCellNums[j] + }) + for _, c := range cellChildren { + availableCells, preemptibleCells = getFreeCellsAtLevel( + c, level, priority, allocatedCells, availableCells, preemptibleCells) + } + } else if cell.GetLevel() == level { + isAllocated := false + for _, c := range allocatedCells { + if isAncestor(cell, c) || isAncestor(c, cell) { + isAllocated = true + break + } + } + if !isAllocated { + if cell.GetPriority() == freePriority { + availableCells = append(availableCells, cell) + } else if cell.GetPriority() < priority { + preemptibleCells = append(preemptibleCells, cell) + } + } + } + return availableCells, preemptibleCells +} + +// checkOptimalAffinityForCells checks if the currently picked cells have the lowest LCA. +// It also checks if the solution is optimal (if the leaf cells are all buddies). +func checkOptimalAffinityForCells( + affinity CellLevel, + availableCells CellList, + currentCellIndices []int32, + bestAffinity CellLevel, + bestAffinityCells CellList, + bestAffinityCellIndices []int32, + optimalAffinity CellLevel) (CellLevel, bool) { + + if affinity < bestAffinity { + copy(bestAffinityCellIndices, currentCellIndices) + for i := 0; i < len(currentCellIndices); i++ { + bestAffinityCells[i] = availableCells[currentCellIndices[i]] + } + if affinity == optimalAffinity { + return affinity, true + } else { + return affinity, false + } + } + return bestAffinity, false +} + +// findLCA finds the lowest common ancestor of two cells (nil if they have no LCA). +func findLCA(lower Cell, higher Cell) Cell { + for lower.GetLevel() < higher.GetLevel() { + if lower.GetParent() == nil { + return nil + } + lower = lower.GetParent() + } + if CellEqual(lower, higher) { + return lower + } + for !CellEqual(lower.GetParent(), higher.GetParent()) { + if lower.GetParent() == nil || higher.GetParent() == nil { + return nil + } + lower = lower.GetParent() + higher = higher.GetParent() + } + return lower.GetParent() +} + +// createSkuClusterView returns list of sku cells within +// the given cell, level and priority in virtual cluster view. +func (s *topologyGuaranteeScheduler) createSkuClusterView( + withinCell *skuCell, + withinLevel CellLevel, + priority CellPriority) skuClusterView { + + cv := skuClusterView{} + for l := withinLevel; l >= CellLevel(1); l-- { + for _, c := range s.chainCellList[l] { + if (withinCell != nil && !isAncestor(withinCell.cell, c)) || + cv.contains(ancestorNoLowerThanLevel(withinLevel, c)) { + continue + } + cell := &skuCell{ + cell: c, + freeLeafCellNumAtPriority: c.GetTotalLeafCellNum(), + usedLeafCellNumAtPriority: 0, + usedLeafCellNumHigherPriority: 0, + healthy: true, + address: "", + } + for p, num := range c.GetUsedLeafCellNumAtPriorities() { + if p >= priority { + cell.freeLeafCellNumAtPriority -= num + } + if s.crossPriorityPack { + cell.usedLeafCellNumAtPriority += num + } else { + if p == priority { + cell.usedLeafCellNumAtPriority += num + } + if p > priority { + cell.usedLeafCellNumHigherPriority += num + } + } + } + switch v := c.(type) { + case *PhysicalCell: + cell.healthy = v.IsHealthy() + cell.address = c.GetAddress() + case *VirtualCell: + if pn := v.GetPhysicalCell(); pn != nil { + cell.healthy = pn.IsHealthy() + cell.address = pn.GetAddress() + } + } + cv = append(cv, cell) + } + } + sort.Stable(cv) + return cv +} + +// Len method for sorting sku cells in cluster view. +func (cv skuClusterView) Len() int { + return len(cv) +} + +// Less method for sorting sku cells in cluster view +// sort in the following order: +// 1. cell health (prefer healthy) +// 2. cell level (prefer lower) +// 3. usedLeafCellNumAtPriority (prefer higher) +// 4. usedLeafCellNumHigherPriority (prefer lower) +// 5. cell physical/virtual address (prefer lower) +// +// When crossPriorityPack is not enabled, we count the cell numbers used by the current +// priority (usedLeafCellNumAtPriority), and the higher priorities (usedLeafCellNumHigherPriority), respectively. +// When sorting the sku cells, cells with higher usedLeafCellNumAtPriority and lower usedLeafCellNumHigherPriority +// will be preferred (i.e., pack pods inside the same priority, and stay from higher priorities). +// Note that in this case, the sku cells may NOT be ordered in term of total used leaf cell number, +// which may result in feasible pod placements being not found. +// +// Otherwise, usedLeafCellNumAtPriority is set to the total used leaf cell number, +// so that nodes with more used leaf cells will be preferred (i.e., pack pods globally across priorities). +// In this case a feasible pod placement is guaranteed to be found (as long as all nodes are in suggested nodes). +func (cv skuClusterView) Less(i, j int) bool { + if cv[i].healthy != cv[j].healthy { + return cv[i].healthy + } + if cv[i].cell.GetLevel() != cv[j].cell.GetLevel() { + return cv[i].cell.GetLevel() < cv[j].cell.GetLevel() + } + if cv[i].usedLeafCellNumAtPriority != cv[j].usedLeafCellNumAtPriority { + return cv[i].usedLeafCellNumAtPriority > cv[j].usedLeafCellNumAtPriority + } + if cv[i].usedLeafCellNumHigherPriority != cv[j].usedLeafCellNumHigherPriority { + return cv[i].usedLeafCellNumHigherPriority < cv[j].usedLeafCellNumHigherPriority + } + if cv[i].address != cv[j].address { + return cv[i].address < cv[j].address + } + if cv[i].cell.GetAddress() != cv[j].cell.GetAddress() { + return cv[i].cell.GetAddress() < cv[j].cell.GetAddress() + } + return true +} + +// Swap method for sorting sku cells in cluster view. +func (cv skuClusterView) Swap(i int, j int) { + cv[i], cv[j] = cv[j], cv[i] +} + +func (cv skuClusterView) contains(cell Cell) bool { + for _, withinCell := range cv { + if CellEqual(cell, withinCell.cell) { + return true + } + } + return false +} + +// ancestorNoLowerThanLevel returns the ancestor of the given cell +// and its level is no lower than given cell's level. +func ancestorNoLowerThanLevel(withinLevel CellLevel, cell Cell) Cell { + if cell.GetLevel() >= withinLevel || cell.GetParent() == nil { + return cell + } else { + return ancestorNoLowerThanLevel(withinLevel, cell.GetParent()) + } +} + +// isAncestor determines whether the given ancestor +// is the ancestor of the given cell. +func isAncestor(ancestor Cell, cell Cell) bool { + if CellEqual(ancestor, cell) { + return true + } + if cell.GetLevel() >= ancestor.GetLevel() || cell.GetParent() == nil { + return false + } + return isAncestor(ancestor, cell.GetParent()) +} diff --git a/pkg/algorithm/types.go b/pkg/algorithm/types.go index 8d769aa..a38ae53 100644 --- a/pkg/algorithm/types.go +++ b/pkg/algorithm/types.go @@ -25,11 +25,6 @@ package algorithm import ( "fmt" "strings" - - "github.com/microsoft/hivedscheduler/pkg/api" - "github.com/microsoft/hivedscheduler/pkg/common" - core "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/types" ) type ( @@ -40,17 +35,6 @@ type ( AffinityGroupState string ) -type schedulingRequest struct { - vc api.VirtualClusterName - pinnedCellId api.PinnedCellId - chain CellChain - affinityGroupName string - affinityGroupPodNums map[int32]int32 // leaf cell number -> pod number - priority CellPriority - suggestedNodes common.Set - ignoreSuggestedNodes bool -} - // CellList is a list of cells at a certain level of a chain. type CellList []Cell @@ -129,216 +113,6 @@ func (ccl ChainCellList) shallowCopy() ChainCellList { return copied } -// AlgoAffinityGroup is the algorithm-internal representation of an affinity group. -type AlgoAffinityGroup struct { - name string - vc api.VirtualClusterName - lazyPreemptionEnable bool - // Whether we should ignore K8s suggested nodes. If false, we will avoid binding cells to non-suggested nodes. - // Note that we always avoid using bad nodes; avoiding non-suggested nodes is optional and best-effort. - ignoreK8sSuggestedNodes bool - priority int32 - totalPodNums map[int32]int32 // LeafCellNum -> PodNum - allocatedPods map[int32][]*core.Pod // LeafCellNum -> a list of allocated pods - preemptingPods map[types.UID]*core.Pod - physicalLeafCellPlacement groupPhysicalPlacement - virtualLeafCellPlacement groupVirtualPlacement - state AffinityGroupState - lazyPreemptionStatus *api.LazyPreemptionStatus -} - -func newAlgoAffinityGroup( - g *api.AffinityGroupSpec, - vc api.VirtualClusterName, - lazyPreemptionEnable bool, - priority int32, - state AffinityGroupState) *AlgoAffinityGroup { - - podNums := make(map[int32]int32) - for _, m := range g.Members { - podNums[m.LeafCellNumber] += m.PodNumber - } - group := &AlgoAffinityGroup{ - name: g.Name, - vc: vc, - lazyPreemptionEnable: lazyPreemptionEnable, - priority: priority, - totalPodNums: podNums, - allocatedPods: map[int32][]*core.Pod{}, - physicalLeafCellPlacement: groupPhysicalPlacement{}, - virtualLeafCellPlacement: groupVirtualPlacement{}, - state: state, - } - if state == groupPreempting { - group.preemptingPods = map[types.UID]*core.Pod{} - } - for leafCellNum, podNum := range podNums { - group.physicalLeafCellPlacement[leafCellNum] = make([]CellList, podNum) - group.virtualLeafCellPlacement[leafCellNum] = make([]CellList, podNum) - group.allocatedPods[leafCellNum] = make([]*core.Pod, podNum) - for i := int32(0); i < podNum; i++ { - group.physicalLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum) - group.virtualLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum) - } - } - return group -} - -func (aag *AlgoAffinityGroup) ToAffinityGroup() api.AffinityGroup { - ag := api.AffinityGroup{ - ObjectMeta: api.ObjectMeta{Name: aag.name}, - Status: api.AffinityGroupStatus{ - VC: aag.vc, - Priority: aag.priority, - State: api.AffinityGroupState(aag.state), - LazyPreemptionStatus: aag.lazyPreemptionStatus, - }, - } - if aag.physicalLeafCellPlacement != nil { - ag.Status.PhysicalPlacement = aag.physicalLeafCellPlacement.nodeToLeafCellIndices() - } - if aag.virtualLeafCellPlacement != nil { - ag.Status.VirtualPlacement = aag.virtualLeafCellPlacement.preassignedCellToLeafCells() - } - for _, pods := range aag.allocatedPods { - for _, p := range pods { - if p != nil { - ag.Status.AllocatedPods = append(ag.Status.AllocatedPods, p.UID) - } - } - } - for p := range aag.preemptingPods { - ag.Status.PreemptingPods = append(ag.Status.PreemptingPods, p) - } - return ag -} - -type groupPhysicalPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of physical leaf cells of each pod -type groupVirtualPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of virtual leaf cells of each pod - -func (p groupPhysicalPlacement) String() string { - return common.ToJson(p.nodeToLeafCellIndices()) -} - -func (p groupPhysicalPlacement) nodeToLeafCellIndices() map[string][]int32 { - nodeToLeafCellIndices := map[string][]int32{} - for _, podPlacements := range p { - for _, podPlacement := range podPlacements { - for _, leafCell := range podPlacement { - pLeafCell := leafCell.(*PhysicalCell) - nodes, leafCellIndices := pLeafCell.GetPhysicalPlacement() - if _, ok := nodeToLeafCellIndices[nodes[0]]; !ok { - nodeToLeafCellIndices[nodes[0]] = []int32{} - } - nodeToLeafCellIndices[nodes[0]] = append(nodeToLeafCellIndices[nodes[0]], leafCellIndices[0]) - } - } - } - return nodeToLeafCellIndices -} - -func (p groupVirtualPlacement) String() string { - return common.ToJson(p.preassignedCellToLeafCells()) -} - -func (p groupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress][]api.CellAddress { - preassignedCellToLeafCells := map[api.CellAddress][]api.CellAddress{} - for _, podPlacements := range p { - for _, podPlacement := range podPlacements { - for _, leafCell := range podPlacement { - vLeafCell := leafCell.(*VirtualCell) - address := vLeafCell.GetAddress() - preassignedAddress := vLeafCell.GetPreassignedCell().GetAddress() - if _, ok := preassignedCellToLeafCells[preassignedAddress]; !ok { - preassignedCellToLeafCells[preassignedAddress] = []api.CellAddress{} - } - preassignedCellToLeafCells[preassignedAddress] = append( - preassignedCellToLeafCells[preassignedAddress], address) - } - } - } - return preassignedCellToLeafCells -} - -func (p groupVirtualPlacement) toPhysicalPlacement( - bindings map[api.CellAddress]*PhysicalCell, - leafCellNums []int32) groupPhysicalPlacement { - - physicalPlacement := groupPhysicalPlacement{} - for _, podLeafCellNum := range leafCellNums { - podPlacements := p[podLeafCellNum] - physicalPlacement[podLeafCellNum] = make([]CellList, len(podPlacements)) - for i, podPlacement := range podPlacements { - physicalPlacement[podLeafCellNum][i] = make(CellList, len(podPlacement)) - for j, leafCell := range podPlacement { - pLeafCell := bindings[leafCell.GetAddress()] - physicalPlacement[podLeafCellNum][i][j] = pLeafCell - } - } - } - return physicalPlacement -} - -// A binding path is a tree consisting of all cells that should be bound for binding a set of -// lowest-level cells in a physical placement. It is generated by collecting all the unbound -// ancestors for these cells and group them in a tree. -func (p groupVirtualPlacement) toBindingPaths( - leafCellNums []int32, - bindings map[api.CellAddress]*PhysicalCell) ( - preassignedCells []*cellBindingPathVertex, - nonPreassignedCells [][]*cellBindingPathVertex) { - - allBindingPathVertices := map[api.CellAddress]*cellBindingPathVertex{} - for _, podLeafCellNum := range leafCellNums { - podPlacements := p[podLeafCellNum] - for _, podPlacement := range podPlacements { - for _, leafCell := range podPlacement { - if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil { - bindings[leafCell.GetAddress()] = pLeafCell - continue - } - var bindingPath []*VirtualCell - for c := leafCell; c != nil; c = c.GetParent() { - vc := c.(*VirtualCell) - if vc.GetPhysicalCell() != nil || allBindingPathVertices[vc.GetAddress()] != nil { - break - } - bindingPath = append(bindingPath, vc) - } - pathRoot := bindingPath[len(bindingPath)-1] - n := &cellBindingPathVertex{cell: pathRoot} - allBindingPathVertices[pathRoot.GetAddress()] = n - if parent := pathRoot.GetParent(); parent == nil { - preassignedCells = append(preassignedCells, n) - } else if parent.(*VirtualCell).GetPhysicalCell() != nil { - buddyExist := false - for i := range nonPreassignedCells { - if CellEqual(parent, nonPreassignedCells[i][0].cell.GetParent()) { - buddyExist = true - nonPreassignedCells[i] = append(nonPreassignedCells[i], n) - break - } - } - if !buddyExist { - nonPreassignedCells = append(nonPreassignedCells, []*cellBindingPathVertex{n}) - } - } else { - parentNode := allBindingPathVertices[pathRoot.GetParent().GetAddress()] - parentNode.childrenToBind = append(parentNode.childrenToBind, n) - } - for i := len(bindingPath) - 2; i >= 0; i-- { - c := bindingPath[i] - n := &cellBindingPathVertex{cell: c} - parentNode := allBindingPathVertices[c.GetParent().GetAddress()] - parentNode.childrenToBind = append(parentNode.childrenToBind, n) - allBindingPathVertices[c.GetAddress()] = n - } - } - } - } - return preassignedCells, nonPreassignedCells -} - // cellBindingPathVertex is a single vertex in the tree of a cell binding path, // containing the vertices of its children to bind. type cellBindingPathVertex struct { diff --git a/pkg/algorithm/types_v1.go b/pkg/algorithm/types_v1.go new file mode 100644 index 0000000..f03139d --- /dev/null +++ b/pkg/algorithm/types_v1.go @@ -0,0 +1,253 @@ +// MIT License +// +// Copyright (c) Microsoft Corporation. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE + +// +build ignore + +package algorithm + +import ( + "github.com/microsoft/hivedscheduler/pkg/api" + "github.com/microsoft/hivedscheduler/pkg/common" + core "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/types" +) + +type schedulingRequest struct { + vc api.VirtualClusterName + pinnedCellId api.PinnedCellId + chain CellChain + affinityGroupName string + affinityGroupPodNums map[int32]int32 // leaf cell number -> pod number + priority CellPriority + suggestedNodes common.Set + ignoreSuggestedNodes bool +} + +// AlgoAffinityGroup is the algorithm-internal representation of an affinity group. +type AlgoAffinityGroup struct { + name string + vc api.VirtualClusterName + lazyPreemptionEnable bool + // Whether we should ignore K8s suggested nodes. If false, we will avoid binding cells to non-suggested nodes. + // Note that we always avoid using bad nodes; avoiding non-suggested nodes is optional and best-effort. + ignoreK8sSuggestedNodes bool + priority int32 + totalPodNums map[int32]int32 // LeafCellNum -> PodNum + allocatedPods map[int32][]*core.Pod // LeafCellNum -> a list of allocated pods + preemptingPods map[types.UID]*core.Pod + physicalLeafCellPlacement groupPhysicalPlacement + virtualLeafCellPlacement groupVirtualPlacement + state AffinityGroupState + lazyPreemptionStatus *api.LazyPreemptionStatus +} + +func newAlgoAffinityGroup( + g *api.AffinityGroupSpec, + vc api.VirtualClusterName, + lazyPreemptionEnable bool, + priority int32, + state AffinityGroupState) *AlgoAffinityGroup { + + podNums := make(map[int32]int32) + for _, m := range g.Members { + podNums[m.LeafCellNumber] += m.PodNumber + } + group := &AlgoAffinityGroup{ + name: g.Name, + vc: vc, + lazyPreemptionEnable: lazyPreemptionEnable, + priority: priority, + totalPodNums: podNums, + allocatedPods: map[int32][]*core.Pod{}, + physicalLeafCellPlacement: groupPhysicalPlacement{}, + virtualLeafCellPlacement: groupVirtualPlacement{}, + state: state, + } + if state == AffinityGroupState("Preempting") { + group.preemptingPods = map[types.UID]*core.Pod{} + } + for leafCellNum, podNum := range podNums { + group.physicalLeafCellPlacement[leafCellNum] = make([]CellList, podNum) + group.virtualLeafCellPlacement[leafCellNum] = make([]CellList, podNum) + group.allocatedPods[leafCellNum] = make([]*core.Pod, podNum) + for i := int32(0); i < podNum; i++ { + group.physicalLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum) + group.virtualLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum) + } + } + return group +} + +func (aag *AlgoAffinityGroup) ToAffinityGroup() api.AffinityGroup { + ag := api.AffinityGroup{ + ObjectMeta: api.ObjectMeta{Name: aag.name}, + Status: api.AffinityGroupStatus{ + VC: aag.vc, + Priority: aag.priority, + State: api.AffinityGroupState(aag.state), + LazyPreemptionStatus: aag.lazyPreemptionStatus, + }, + } + if aag.physicalLeafCellPlacement != nil { + ag.Status.PhysicalPlacement = aag.physicalLeafCellPlacement.nodeToLeafCellIndices() + } + if aag.virtualLeafCellPlacement != nil { + ag.Status.VirtualPlacement = aag.virtualLeafCellPlacement.preassignedCellToLeafCells() + } + for _, pods := range aag.allocatedPods { + for _, p := range pods { + if p != nil { + ag.Status.AllocatedPods = append(ag.Status.AllocatedPods, p.UID) + } + } + } + for p := range aag.preemptingPods { + ag.Status.PreemptingPods = append(ag.Status.PreemptingPods, p) + } + return ag +} + +type groupPhysicalPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of physical leaf cells of each pod +type groupVirtualPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of virtual leaf cells of each pod + +func (p groupPhysicalPlacement) String() string { + return common.ToJson(p.nodeToLeafCellIndices()) +} + +func (p groupPhysicalPlacement) nodeToLeafCellIndices() map[string][]int32 { + nodeToLeafCellIndices := map[string][]int32{} + for _, podPlacements := range p { + for _, podPlacement := range podPlacements { + for _, leafCell := range podPlacement { + pLeafCell := leafCell.(*PhysicalCell) + nodes, leafCellIndices := pLeafCell.GetPhysicalPlacement() + if _, ok := nodeToLeafCellIndices[nodes[0]]; !ok { + nodeToLeafCellIndices[nodes[0]] = []int32{} + } + nodeToLeafCellIndices[nodes[0]] = append(nodeToLeafCellIndices[nodes[0]], leafCellIndices[0]) + } + } + } + return nodeToLeafCellIndices +} + +func (p groupVirtualPlacement) String() string { + return common.ToJson(p.preassignedCellToLeafCells()) +} + +func (p groupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress][]api.CellAddress { + preassignedCellToLeafCells := map[api.CellAddress][]api.CellAddress{} + for _, podPlacements := range p { + for _, podPlacement := range podPlacements { + for _, leafCell := range podPlacement { + vLeafCell := leafCell.(*VirtualCell) + address := vLeafCell.GetAddress() + preassignedAddress := vLeafCell.GetPreassignedCell().GetAddress() + if _, ok := preassignedCellToLeafCells[preassignedAddress]; !ok { + preassignedCellToLeafCells[preassignedAddress] = []api.CellAddress{} + } + preassignedCellToLeafCells[preassignedAddress] = append( + preassignedCellToLeafCells[preassignedAddress], address) + } + } + } + return preassignedCellToLeafCells +} + +func (p groupVirtualPlacement) toPhysicalPlacement( + bindings map[api.CellAddress]*PhysicalCell, + leafCellNums []int32) groupPhysicalPlacement { + + physicalPlacement := groupPhysicalPlacement{} + for _, podLeafCellNum := range leafCellNums { + podPlacements := p[podLeafCellNum] + physicalPlacement[podLeafCellNum] = make([]CellList, len(podPlacements)) + for i, podPlacement := range podPlacements { + physicalPlacement[podLeafCellNum][i] = make(CellList, len(podPlacement)) + for j, leafCell := range podPlacement { + pLeafCell := bindings[leafCell.GetAddress()] + physicalPlacement[podLeafCellNum][i][j] = pLeafCell + } + } + } + return physicalPlacement +} + +// A binding path is a tree consisting of all cells that should be bound for binding a set of +// lowest-level cells in a physical placement. It is generated by collecting all the unbound +// ancestors for these cells and group them in a tree. +func (p groupVirtualPlacement) toBindingPaths( + leafCellNums []int32, + bindings map[api.CellAddress]*PhysicalCell) ( + preassignedCells []*cellBindingPathVertex, + nonPreassignedCells [][]*cellBindingPathVertex) { + + allBindingPathVertices := map[api.CellAddress]*cellBindingPathVertex{} + for _, podLeafCellNum := range leafCellNums { + podPlacements := p[podLeafCellNum] + for _, podPlacement := range podPlacements { + for _, leafCell := range podPlacement { + if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil { + bindings[leafCell.GetAddress()] = pLeafCell + continue + } + var bindingPath []*VirtualCell + for c := leafCell; c != nil; c = c.GetParent() { + vc := c.(*VirtualCell) + if vc.GetPhysicalCell() != nil || allBindingPathVertices[vc.GetAddress()] != nil { + break + } + bindingPath = append(bindingPath, vc) + } + pathRoot := bindingPath[len(bindingPath)-1] + n := &cellBindingPathVertex{cell: pathRoot} + allBindingPathVertices[pathRoot.GetAddress()] = n + if parent := pathRoot.GetParent(); parent == nil { + preassignedCells = append(preassignedCells, n) + } else if parent.(*VirtualCell).GetPhysicalCell() != nil { + buddyExist := false + for i := range nonPreassignedCells { + if CellEqual(parent, nonPreassignedCells[i][0].cell.GetParent()) { + buddyExist = true + nonPreassignedCells[i] = append(nonPreassignedCells[i], n) + break + } + } + if !buddyExist { + nonPreassignedCells = append(nonPreassignedCells, []*cellBindingPathVertex{n}) + } + } else { + parentNode := allBindingPathVertices[pathRoot.GetParent().GetAddress()] + parentNode.childrenToBind = append(parentNode.childrenToBind, n) + } + for i := len(bindingPath) - 2; i >= 0; i-- { + c := bindingPath[i] + n := &cellBindingPathVertex{cell: c} + parentNode := allBindingPathVertices[c.GetParent().GetAddress()] + parentNode.childrenToBind = append(parentNode.childrenToBind, n) + allBindingPathVertices[c.GetAddress()] = n + } + } + } + } + return preassignedCells, nonPreassignedCells +} diff --git a/pkg/algorithm/types_v2.go b/pkg/algorithm/types_v2.go new file mode 100644 index 0000000..527d168 --- /dev/null +++ b/pkg/algorithm/types_v2.go @@ -0,0 +1,399 @@ +// MIT License +// +// Copyright (c) Microsoft Corporation. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE + +package algorithm + +import ( + "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" + "github.com/microsoft/hivedscheduler/pkg/common" + core "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/types" +) + +// PodGroupState represents the internal state of pod group. +type PodGroupState string + +// PodGroupSchedulingRequest represents request of pod group. +type PodGroupSchedulingRequest struct { + vc api.VirtualClusterName + pinnedCellId api.PinnedCellId + podRootGroup apiv2.PodGroupSpec + chain CellChain + priority CellPriority +} + +// PodGroupSchedulingStatus represents internal scheduling status of pod group. +type PodGroupSchedulingStatus struct { + name string + vc api.VirtualClusterName + priority CellPriority + lazyPreemptionEnable bool + preemptingPods map[types.UID]*core.Pod + allocatedPodGroup AllocatedPodGroup + virtualPlacement PodGroupVirtualPlacement + physicalPlacement PodGroupPhysicalPlacement + state PodGroupState + lazyPreemptionStatus *api.LazyPreemptionStatus +} + +func (podGroupSchedStatus *PodGroupSchedulingStatus) DumpPodGroup() apiv2.PodGroup { + podGroup := apiv2.PodGroup{ + ObjectMeta: api.ObjectMeta{Name: podGroupSchedStatus.name}, + Status: apiv2.PodGroupStatus{ + VC: podGroupSchedStatus.vc, + Priority: int32(podGroupSchedStatus.priority), + State: apiv2.PodGroupState(podGroupSchedStatus.state), + LazyPreemptionStatus: podGroupSchedStatus.lazyPreemptionStatus, + }, + } + if !PodGroupPlacement(podGroupSchedStatus.physicalPlacement).IsEmpty() { + podGroup.Status.PhysicalPlacement = podGroupSchedStatus.physicalPlacement.nodeToLeafCellIndices() + } + if !PodGroupPlacement(podGroupSchedStatus.virtualPlacement).IsEmpty() { + podGroup.Status.VirtualPlacement = podGroupSchedStatus.virtualPlacement.preassignedCellToLeafCells() + } + for iter := podGroupSchedStatus.allocatedPodGroup.Iterator(); iter.HasNext(); { + pod := iter.Next() + if pod != nil { + podGroup.Status.AllocatedPods = append(podGroup.Status.AllocatedPods, pod.UID) + } + } + for uid := range podGroupSchedStatus.preemptingPods { + podGroup.Status.PreemptingPods = append(podGroup.Status.PreemptingPods, uid) + } + return podGroup +} + +func newPodGroupSchedulingStatus( + podSchedSpec *apiv2.PodSchedulingSpec, + leafCellNums map[CellLevel]int32, + cellLevel map[api.CellType]CellLevel, + state PodGroupState) *PodGroupSchedulingStatus { + + podGroupSchedStatus := &PodGroupSchedulingStatus{ + name: podSchedSpec.PodRootGroup.Name, + vc: podSchedSpec.VirtualCluster, + priority: CellPriority(podSchedSpec.Priority), + lazyPreemptionEnable: podSchedSpec.LazyPreemptionEnable, + allocatedPodGroup: AllocatedPodGroup{}, + virtualPlacement: PodGroupVirtualPlacement{}, + physicalPlacement: PodGroupPhysicalPlacement{}, + state: state, + } + if state == podGroupPreempting { + podGroupSchedStatus.preemptingPods = map[types.UID]*core.Pod{} + } + podGroupSpecQueue := []*apiv2.PodGroupSpec{podSchedSpec.PodRootGroup} + allocatedPodGroupQueue := []*AllocatedPodGroup{&podGroupSchedStatus.allocatedPodGroup} + virtualPlacementQueue := []*PodGroupPlacement{(*PodGroupPlacement)(&podGroupSchedStatus.virtualPlacement)} + physicalPlacementQueue := []*PodGroupPlacement{(*PodGroupPlacement)(&podGroupSchedStatus.physicalPlacement)} + for len(podGroupSpecQueue) > 0 { + newPodGroupSpecQueue := []*apiv2.PodGroupSpec{} + newAllocatedPodGroupQueue := []*AllocatedPodGroup{} + newVirtualPlacementQueue := []*PodGroupPlacement{} + newPhysicalPlacementQueue := []*PodGroupPlacement{} + for index, podGroup := range podGroupSpecQueue { + podNum := int32(0) + for _, pod := range podGroup.Pods { + podNum += pod.PodMinNumber + } + allocatedPodGroupQueue[index].pods = make([]*core.Pod, podNum) + virtualPlacementQueue[index].podsPlacement = make([]CellList, podNum) + physicalPlacementQueue[index].podsPlacement = make([]CellList, podNum) + podNumIndex := int32(0) + for _, pod := range podGroup.Pods { + for i := int32(0); i < pod.PodMinNumber; i++ { + if level, ok := cellLevel[pod.CellsPerPod.CellType]; ok { + virtualPlacementQueue[index].podsPlacement[podNumIndex] = make(CellList, pod.CellsPerPod.CellNumber*leafCellNums[level]) + physicalPlacementQueue[index].podsPlacement[podNumIndex] = make(CellList, pod.CellsPerPod.CellNumber*leafCellNums[level]) + } else { + virtualPlacementQueue[index].podsPlacement[podNumIndex] = make(CellList, pod.CellsPerPod.CellNumber) + physicalPlacementQueue[index].podsPlacement[podNumIndex] = make(CellList, pod.CellsPerPod.CellNumber) + } + podNumIndex++ + } + } + if podGroup.ChildGroups != nil { + allocatedPodGroupQueue[index].allocatedChildGroup = make([]*AllocatedPodGroup, len(podGroup.ChildGroups)) + virtualPlacementQueue[index].childGroupsPlacement = make([]*PodGroupPlacement, len(podGroup.ChildGroups)) + physicalPlacementQueue[index].childGroupsPlacement = make([]*PodGroupPlacement, len(podGroup.ChildGroups)) + for childIndex := range podGroup.ChildGroups { + allocatedPodGroupQueue[index].allocatedChildGroup[childIndex] = &AllocatedPodGroup{} + virtualPlacementQueue[index].childGroupsPlacement[childIndex] = &PodGroupPlacement{} + physicalPlacementQueue[index].childGroupsPlacement[childIndex] = &PodGroupPlacement{} + } + } + newPodGroupSpecQueue = append(newPodGroupSpecQueue, podGroup.ChildGroups...) + newAllocatedPodGroupQueue = append(newAllocatedPodGroupQueue, allocatedPodGroupQueue[index].allocatedChildGroup...) + newVirtualPlacementQueue = append(newVirtualPlacementQueue, virtualPlacementQueue[index].childGroupsPlacement...) + newPhysicalPlacementQueue = append(newPhysicalPlacementQueue, physicalPlacementQueue[index].childGroupsPlacement...) + } + podGroupSpecQueue = newPodGroupSpecQueue + allocatedPodGroupQueue = newAllocatedPodGroupQueue + virtualPlacementQueue = newVirtualPlacementQueue + physicalPlacementQueue = newPhysicalPlacementQueue + } + return podGroupSchedStatus +} + +// AllocatedPodGroup represents a tree structure of allocated pod group. +type AllocatedPodGroup struct { + pods []*core.Pod + allocatedChildGroup []*AllocatedPodGroup +} + +type allocatedPodGroupIterator struct { + pods []*core.Pod + index int + length int +} + +// Next returns the next item in iteration. +func (i *allocatedPodGroupIterator) Next() *core.Pod { + i.index++ + return i.pods[i.index-1] +} + +// HasNext return true if iteration not finishes. +func (i *allocatedPodGroupIterator) HasNext() bool { + return i.index < i.length +} + +// Iterator returns a stateful iterator for AllocatedPodGroup +func (podRootGroup AllocatedPodGroup) Iterator(args ...int32) *allocatedPodGroupIterator { + index := int32(0) + pods := []*core.Pod{} + queue := []*AllocatedPodGroup{&podRootGroup} + for len(queue) > 0 { + newQueue := []*AllocatedPodGroup{} + for _, podGroup := range queue { + if len(args) == 1 && args[0] == index { + return &allocatedPodGroupIterator{podGroup.pods, 0, len(podGroup.pods)} + } + pods = append(pods, podGroup.pods...) + index++ + newQueue = append(newQueue, podGroup.allocatedChildGroup...) + } + queue = newQueue + } + return &allocatedPodGroupIterator{pods, 0, len(pods)} +} + +// SetPod sets allocated pod in AllocatedPodGroup. +func (podRootGroup AllocatedPodGroup) SetPod(pod *core.Pod, podGroupIndex int32, podIndex int32) { + index := int32(0) + queue := []*AllocatedPodGroup{&podRootGroup} + for len(queue) > 0 { + newQueue := []*AllocatedPodGroup{} + for _, podGroup := range queue { + if index == podGroupIndex { + podGroup.pods[podIndex] = pod + } + index++ + newQueue = append(newQueue, podGroup.allocatedChildGroup...) + } + queue = newQueue + } +} + +// PodGroupPlacement represents a tree structure of intra VC scheduled placement. +type PodGroupPlacement struct { + podsPlacement []CellList + childGroupsPlacement []*PodGroupPlacement +} + +// PodGroupPhysicalPlacement represents physical placement of pod group. +type PodGroupPhysicalPlacement PodGroupPlacement + +// PodGroupVirtualPlacement represents virtual placement of pod group. +type PodGroupVirtualPlacement PodGroupPlacement + +// IsEmpty checks whether PodGroupPlacement is empty +func (placement PodGroupPlacement) IsEmpty() bool { + return ((placement.podsPlacement == nil || len(placement.podsPlacement) == 0) && + (placement.childGroupsPlacement == nil || len(placement.childGroupsPlacement) == 0)) +} + +type podGroupPlacementIterator struct { + cellLists []*CellList + index int + length int +} + +// Next returns the next item in iteration. +func (i *podGroupPlacementIterator) Next() *CellList { + i.index++ + return i.cellLists[i.index-1] +} + +// HasNext return true if iteration not finishes. +func (i *podGroupPlacementIterator) HasNext() bool { + return i.index < i.length +} + +// Iterator returns a stateful iterator for PodGroupPlacement +func (placement PodGroupPlacement) Iterator() *podGroupPlacementIterator { + cellLists := []*CellList{} + queue := []*PodGroupPlacement{&placement} + for len(queue) > 0 { + newQueue := []*PodGroupPlacement{} + for _, groupPlacement := range queue { + for podIndex := range groupPlacement.podsPlacement { + cellLists = append(cellLists, &groupPlacement.podsPlacement[podIndex]) + } + newQueue = append(newQueue, groupPlacement.childGroupsPlacement...) + } + queue = newQueue + } + return &podGroupPlacementIterator{cellLists, 0, len(cellLists)} +} + +func (physicalPlacement PodGroupPhysicalPlacement) String() string { + return common.ToJson(physicalPlacement.nodeToLeafCellIndices()) +} + +func (physicalPlacement PodGroupPhysicalPlacement) nodeToLeafCellIndices() map[string][]int32 { + nodeToLeafCellIndices := map[string][]int32{} + for iter := PodGroupPlacement(physicalPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + pLeafCell := leafCell.(*PhysicalCell) + nodes, leafCellIndices := pLeafCell.GetPhysicalPlacement() + if _, ok := nodeToLeafCellIndices[nodes[0]]; !ok { + nodeToLeafCellIndices[nodes[0]] = []int32{} + } + nodeToLeafCellIndices[nodes[0]] = append(nodeToLeafCellIndices[nodes[0]], leafCellIndices[0]) + } + } + return nodeToLeafCellIndices +} + +func (virtualPlacement PodGroupVirtualPlacement) String() string { + return common.ToJson(virtualPlacement.preassignedCellToLeafCells()) +} + +func (virtualPlacement PodGroupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress][]api.CellAddress { + preassignedCellToLeafCells := map[api.CellAddress][]api.CellAddress{} + for iter := PodGroupPlacement(virtualPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + vLeafCell := leafCell.(*VirtualCell) + address := vLeafCell.GetAddress() + preassignedAddress := vLeafCell.GetPreassignedCell().GetAddress() + if _, ok := preassignedCellToLeafCells[preassignedAddress]; !ok { + preassignedCellToLeafCells[preassignedAddress] = []api.CellAddress{} + } + preassignedCellToLeafCells[preassignedAddress] = append( + preassignedCellToLeafCells[preassignedAddress], address) + } + } + return preassignedCellToLeafCells +} + +func (virtualPlacement PodGroupVirtualPlacement) toPhysicalPlacement( + bindings map[api.CellAddress]*PhysicalCell) PodGroupPhysicalPlacement { + + physicalPlacement := PodGroupPhysicalPlacement{} + + virtualPlacementQueue := []*PodGroupPlacement{(*PodGroupPlacement)(&virtualPlacement)} + physicalPlacementQueue := []*PodGroupPlacement{(*PodGroupPlacement)(&physicalPlacement)} + for len(virtualPlacementQueue) > 0 { + newVirtualPlacementQueue := []*PodGroupPlacement{} + newPhysicalPlacementQueue := []*PodGroupPlacement{} + for index, placement := range virtualPlacementQueue { + physicalPlacementQueue[index].podsPlacement = make([]CellList, len(placement.podsPlacement)) + for i, podPlacement := range placement.podsPlacement { + physicalPlacementQueue[index].podsPlacement[i] = make(CellList, len(podPlacement)) + for j, leafCell := range podPlacement { + pLeafCell := bindings[leafCell.GetAddress()] + physicalPlacementQueue[index].podsPlacement[i][j] = pLeafCell + } + } + if placement.childGroupsPlacement != nil { + physicalPlacementQueue[index].childGroupsPlacement = make([]*PodGroupPlacement, len(placement.childGroupsPlacement)) + for childIndex := range placement.childGroupsPlacement { + physicalPlacementQueue[index].childGroupsPlacement[childIndex] = &PodGroupPlacement{} + } + } + newVirtualPlacementQueue = append(newVirtualPlacementQueue, virtualPlacementQueue[index].childGroupsPlacement...) + newPhysicalPlacementQueue = append(newPhysicalPlacementQueue, physicalPlacementQueue[index].childGroupsPlacement...) + } + virtualPlacementQueue = newVirtualPlacementQueue + physicalPlacementQueue = newPhysicalPlacementQueue + } + return physicalPlacement +} + +// A binding path is a tree consisting of all cells that should be bound for binding a set of +// lowest-level cells in a physical placement. It is generated by collecting all the unbound +// ancestors for these cells and group them in a tree. +func (virtualPlacement PodGroupVirtualPlacement) toBindingPaths( + bindings map[api.CellAddress]*PhysicalCell) ( + preassignedCells []*cellBindingPathVertex, + nonPreassignedCells [][]*cellBindingPathVertex) { + + allBindingPathVertices := map[api.CellAddress]*cellBindingPathVertex{} + for iter := PodGroupPlacement(virtualPlacement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil { + bindings[leafCell.GetAddress()] = pLeafCell + continue + } + var bindingPath []*VirtualCell + for c := leafCell; c != nil; c = c.GetParent() { + vc := c.(*VirtualCell) + if vc.GetPhysicalCell() != nil || allBindingPathVertices[vc.GetAddress()] != nil { + break + } + bindingPath = append(bindingPath, vc) + } + pathRoot := bindingPath[len(bindingPath)-1] + n := &cellBindingPathVertex{cell: pathRoot} + allBindingPathVertices[pathRoot.GetAddress()] = n + if parent := pathRoot.GetParent(); parent == nil { + preassignedCells = append(preassignedCells, n) + } else if parent.(*VirtualCell).GetPhysicalCell() != nil { + buddyExist := false + for i := range nonPreassignedCells { + if CellEqual(parent, nonPreassignedCells[i][0].cell.GetParent()) { + buddyExist = true + nonPreassignedCells[i] = append(nonPreassignedCells[i], n) + break + } + } + if !buddyExist { + nonPreassignedCells = append(nonPreassignedCells, []*cellBindingPathVertex{n}) + } + } else { + parentNode := allBindingPathVertices[pathRoot.GetParent().GetAddress()] + parentNode.childrenToBind = append(parentNode.childrenToBind, n) + } + for i := len(bindingPath) - 2; i >= 0; i-- { + c := bindingPath[i] + n := &cellBindingPathVertex{cell: c} + parentNode := allBindingPathVertices[c.GetParent().GetAddress()] + parentNode.childrenToBind = append(parentNode.childrenToBind, n) + allBindingPathVertices[c.GetAddress()] = n + } + } + } + return preassignedCells, nonPreassignedCells +} diff --git a/pkg/algorithm/utils.go b/pkg/algorithm/utils.go index 5e2d65f..86ed9c5 100644 --- a/pkg/algorithm/utils.go +++ b/pkg/algorithm/utils.go @@ -27,6 +27,7 @@ import ( "math/rand" "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" "github.com/microsoft/hivedscheduler/pkg/common" "github.com/microsoft/hivedscheduler/pkg/internal" core "k8s.io/api/core/v1" @@ -36,26 +37,26 @@ import ( // generatePodScheduleResult writes the scheduling result into a PodScheduleResult. func generatePodScheduleResult( - groupPhysicalPlacement groupPhysicalPlacement, - groupVirtualPlacement groupVirtualPlacement, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, preemptionVictims map[string]common.Set, waitReason string, cellLevelToType map[CellChain]map[CellLevel]api.CellType, - currentLeafCellNum int32, + currentCellNum int32, + currentPodGroupIndex int32, currentPodIndex int32, - group *AlgoAffinityGroup, - groupName string, + podGroupSchedStatus *PodGroupSchedulingStatus, suggestedNodes common.Set, pod *core.Pod) internal.PodScheduleResult { klog.V(4).Infof("[%v]: Got K8s suggested nodes: %v", internal.Key(pod), suggestedNodes) - if groupPhysicalPlacement == nil { + if PodGroupPlacement(physicalPlacement).IsEmpty() { klog.Infof("[%v]: Pod needs to wait, reason: %v", internal.Key(pod), waitReason) return internal.PodScheduleResult{PodWaitInfo: &internal.PodWaitInfo{Reason: waitReason}} } - klog.Infof("[%v]: Physical placement: %v", internal.Key(pod), groupPhysicalPlacement) - if groupVirtualPlacement != nil { - klog.Infof("[%v]: Virtual placement: %v", internal.Key(pod), groupVirtualPlacement) + klog.Infof("[%v]: Physical placement: %v", internal.Key(pod), physicalPlacement) + if !PodGroupPlacement(virtualPlacement).IsEmpty() { + klog.Infof("[%v]: Virtual placement: %v", internal.Key(pod), virtualPlacement) } if len(preemptionVictims) > 0 { return internal.PodScheduleResult{ @@ -64,16 +65,17 @@ func generatePodScheduleResult( } // we find the selected node after the preemption is done, otherwise the preemption victims // may cause the selected node to be excluded from the suggested nodes - affinityGroupBindInfo, selectedNode, selectedLeafCellIndices, cellChain := generateAffinityGroupBindInfo( - groupPhysicalPlacement, groupVirtualPlacement, cellLevelToType, currentLeafCellNum, currentPodIndex, group, groupName) + podRootGroupBindInfo, selectedNode, selectedLeafCellIndices, cellChain := generatePodGroupBindInfo( + physicalPlacement, virtualPlacement, cellLevelToType, currentCellNum, currentPodGroupIndex, currentPodIndex, podGroupSchedStatus) klog.Infof("[%v]: pod is decided to be scheduled to node %v, leaf cells %v", internal.Key(pod), selectedNode, common.ToJson(selectedLeafCellIndices)) return internal.PodScheduleResult{ - PodBindInfo: &api.PodBindInfo{ - Node: selectedNode, - LeafCellIsolation: selectedLeafCellIndices, - CellChain: cellChain, - AffinityGroupBindInfo: affinityGroupBindInfo, + PodBindInfo: &apiv2.PodBindInfo{ + Version: "v2", + Node: selectedNode, + LeafCellIsolation: selectedLeafCellIndices, + CellChain: cellChain, + PodRootGroupBindInfo: podRootGroupBindInfo, }, } } @@ -102,132 +104,148 @@ func generatePodPreemptInfo(preemptionVictims map[string]common.Set, pod *core.P return &internal.PodPreemptInfo{VictimPods: victimPods} } -// generateAffinityGroupBindInfo translates the physical and virtual placements of an affinity group -// into a a series of AffinityGroupMemberBindInfos, and also returns the allocated node and leaf cell addresses +// generatePodGroupBindInfo translates the physical and virtual placements of a pod group +// into PodGroupBindInfo, and also returns the allocated node and leaf cell addresses // of the current pod. -func generateAffinityGroupBindInfo( - groupPhysicalPlacement groupPhysicalPlacement, - groupVirtualPlacement groupVirtualPlacement, +func generatePodGroupBindInfo( + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, cellLevelToType map[CellChain]map[CellLevel]api.CellType, currentLeafCellNum int32, + currentPodGroupIndex int32, currentPodIndex int32, - group *AlgoAffinityGroup, - groupName string) ( - affinityGroupBindInfo []api.AffinityGroupMemberBindInfo, + podGroupSchedStatus *PodGroupSchedulingStatus) ( + podRootGroupBindInfo *apiv2.PodGroupBindInfo, selectedNode string, selectedLeafCellIndices []int32, chain string) { - affinityGroupBindInfo = make([]api.AffinityGroupMemberBindInfo, len(groupPhysicalPlacement)) - groupMemberIndex := 0 - for podLeafCellNum, podPhysicalPlacements := range groupPhysicalPlacement { - mbi := api.AffinityGroupMemberBindInfo{ - PodPlacements: make([]api.PodPlacementInfo, len(podPhysicalPlacements)), - } - for podIndex := int32(0); podIndex < int32(len(podPhysicalPlacements)); podIndex++ { - mbi.PodPlacements[podIndex].PhysicalLeafCellIndices = make([]int32, podLeafCellNum) - mbi.PodPlacements[podIndex].PreassignedCellTypes = make([]api.CellType, podLeafCellNum) - for leafCellIndex := int32(0); leafCellIndex < podLeafCellNum; leafCellIndex++ { - pLeafCell := podPhysicalPlacements[podIndex][leafCellIndex] - if pLeafCell == nil { - if group == nil || group.state == groupPreempting { - panic(fmt.Sprintf("The first pod in group %v was allocated invalid resource", groupName)) - } - // if the physical placement of this pod is not found (e.g., removed due to reconfiguration), - // we will insist the decision by retrieving it from other pods - mbi.PodPlacements[podIndex], chain = retrieveMissingPodPlacement(group, podLeafCellNum, podIndex) - klog.Warningf( - "pod placement has been invalid and is retrieved from annotation of other pods: node %v, leaf cell %v", - mbi.PodPlacements[podIndex].PhysicalNode, mbi.PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex]) - } else { - nodes, leafCellIndices := pLeafCell.(*PhysicalCell).GetPhysicalPlacement() - // here each cell (i.e., pLeafCell) is only one leaf cell, hence we takes the first element - // in its "nodes" and "leafCellIndices" as the node and leaf cell address - if mbi.PodPlacements[podIndex].PhysicalNode == "" { - mbi.PodPlacements[podIndex].PhysicalNode = nodes[0] - } - mbi.PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex] = leafCellIndices[0] - if groupVirtualPlacement != nil { - vLeafCell := groupVirtualPlacement[podLeafCellNum][podIndex][leafCellIndex].(*VirtualCell) - mbi.PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] = - cellLevelToType[vLeafCell.GetChain()][vLeafCell.GetPreassignedCell().GetLevel()] + podGroupIndex := int32(0) + podRootGroupBindInfo = &apiv2.PodGroupBindInfo{} + + physicalPlacementQueue := []*PodGroupPlacement{(*PodGroupPlacement)(&physicalPlacement)} + virtualPlacementQueue := []*PodGroupPlacement{(*PodGroupPlacement)(&virtualPlacement)} + podGroupBindInfoQueue := []*apiv2.PodGroupBindInfo{podRootGroupBindInfo} + for len(physicalPlacementQueue) > 0 { + newPhysicalPlacementQueue := []*PodGroupPlacement{} + newVirtualPlacementQueue := []*PodGroupPlacement{} + newPodGroupBindInfoQueue := []*apiv2.PodGroupBindInfo{} + for index, placement := range physicalPlacementQueue { + podGroupBindInfoQueue[index].PodPlacements = make([]apiv2.PodPlacementInfo, len(placement.podsPlacement)) + for podIndex, podPlacement := range placement.podsPlacement { + podLeafCellNum := len(podPlacement) + podGroupBindInfoQueue[index].PodPlacements[podIndex].PhysicalLeafCellIndices = make([]int32, podLeafCellNum) + podGroupBindInfoQueue[index].PodPlacements[podIndex].PreassignedCellTypes = make([]api.CellType, podLeafCellNum) + for leafCellIndex, pLeafCell := range podPlacement { + if pLeafCell == nil { + if podGroupSchedStatus == nil || podGroupSchedStatus.state == podGroupPreempting { + panic(fmt.Sprintf("The first pod in group %v was allocated invalid resource", podGroupSchedStatus.name)) + } + // if the physical placement of this pod is not found (e.g., removed due to reconfiguration), + // we will insist the decision by retrieving it from other pods + podGroupBindInfoQueue[index].PodPlacements[podIndex], chain = + retrieveMissingPodPlacement(podGroupSchedStatus, podGroupIndex, int32(podIndex)) + klog.Warningf( + "pod placement has been invalid and is retrieved from annotation of other pods: node %v, leaf cell %v", + podGroupBindInfoQueue[index].PodPlacements[podIndex].PhysicalNode, + podGroupBindInfoQueue[index].PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex]) } else { - mbi.PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] = "" + nodes, leafCellIndices := pLeafCell.(*PhysicalCell).GetPhysicalPlacement() + // here each cell (i.e., pLeafCell) is only one leaf cell, hence we takes the first element + // in its "nodes" and "leafCellIndices" as the node and leaf cell address + if podGroupBindInfoQueue[index].PodPlacements[podIndex].PhysicalNode == "" { + podGroupBindInfoQueue[index].PodPlacements[podIndex].PhysicalNode = nodes[0] + } + podGroupBindInfoQueue[index].PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex] = leafCellIndices[0] + if !PodGroupPlacement(virtualPlacement).IsEmpty() { + vLeafCell := virtualPlacementQueue[index].podsPlacement[podIndex][leafCellIndex].(*VirtualCell) + podGroupBindInfoQueue[index].PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] = + cellLevelToType[vLeafCell.GetChain()][vLeafCell.GetPreassignedCell().GetLevel()] + } else { + podGroupBindInfoQueue[index].PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] = "" + } } } } - } - if podLeafCellNum == currentLeafCellNum { - selectedNode = mbi.PodPlacements[currentPodIndex].PhysicalNode - selectedLeafCellIndices = mbi.PodPlacements[currentPodIndex].PhysicalLeafCellIndices - if pLeafCell := groupPhysicalPlacement[currentLeafCellNum][currentPodIndex][0]; pLeafCell != nil { - chain = string(pLeafCell.GetChain()) + if podGroupIndex == currentPodGroupIndex && len(podGroupBindInfoQueue[index].PodPlacements) > 0 { + selectedNode = podGroupBindInfoQueue[index].PodPlacements[currentPodIndex].PhysicalNode + selectedLeafCellIndices = podGroupBindInfoQueue[index].PodPlacements[currentPodIndex].PhysicalLeafCellIndices + if pLeafCell := physicalPlacementQueue[index].podsPlacement[currentPodIndex][0]; pLeafCell != nil { + chain = string(pLeafCell.GetChain()) + } } + if placement.childGroupsPlacement != nil { + podGroupBindInfoQueue[index].ChildGroupBindingInfo = make([]*apiv2.PodGroupBindInfo, len(placement.childGroupsPlacement)) + for childIndex := range placement.childGroupsPlacement { + podGroupBindInfoQueue[index].ChildGroupBindingInfo[childIndex] = &apiv2.PodGroupBindInfo{} + } + } + podGroupIndex++ + newPhysicalPlacementQueue = append(newPhysicalPlacementQueue, physicalPlacementQueue[index].childGroupsPlacement...) + newVirtualPlacementQueue = append(newVirtualPlacementQueue, virtualPlacementQueue[index].childGroupsPlacement...) + newPodGroupBindInfoQueue = append(newPodGroupBindInfoQueue, podGroupBindInfoQueue[index].ChildGroupBindingInfo...) } - affinityGroupBindInfo[groupMemberIndex] = mbi - groupMemberIndex++ + physicalPlacementQueue = newPhysicalPlacementQueue + virtualPlacementQueue = newVirtualPlacementQueue + podGroupBindInfoQueue = newPodGroupBindInfoQueue } - return affinityGroupBindInfo, selectedNode, selectedLeafCellIndices, chain + + return podRootGroupBindInfo, selectedNode, selectedLeafCellIndices, chain } // collectBadOrNonSuggestedNodes collects all the nodes that are not within the suggested nodes -// in the physical placement of an affinity group. +// in the physical placement of a pod group. func collectBadOrNonSuggestedNodes( - placement groupPhysicalPlacement, + placement PodGroupPhysicalPlacement, suggestedNodes common.Set, ignoreSuggestedNodes bool) ( badOrNonSuggestedNodes common.Set) { badOrNonSuggestedNodes = common.NewSet() - for leafCellNum := range placement { - for podIndex := range placement[leafCellNum] { - for _, leafCell := range placement[leafCellNum][podIndex] { - if leafCell == nil { - continue - } - nodes, _ := leafCell.(*PhysicalCell).GetPhysicalPlacement() - if !leafCell.(*PhysicalCell).IsHealthy() || - (!ignoreSuggestedNodes && !suggestedNodes.Contains(nodes[0])) { - badOrNonSuggestedNodes.Add(nodes[0]) - } + for iter := PodGroupPlacement(placement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + if leafCell == nil { + continue + } + nodes, _ := leafCell.(*PhysicalCell).GetPhysicalPlacement() + if !leafCell.(*PhysicalCell).IsHealthy() || + (!ignoreSuggestedNodes && !suggestedNodes.Contains(nodes[0])) { + badOrNonSuggestedNodes.Add(nodes[0]) } } } return badOrNonSuggestedNodes } -// collectPreemptionVictims collects preemption victims of an affinity group. +// collectPreemptionVictims collects preemption victims of a pod group. // If any of the leaf cells allocated for the whole group is still used by a pod, // we will wait for the preemption, as a group is gang-scheduled. -func collectPreemptionVictims(placement groupPhysicalPlacement) ( +func collectPreemptionVictims(placement PodGroupPhysicalPlacement) ( victimPods map[string]common.Set, overlappingPreemptorGroups common.Set) { victimPods = map[string]common.Set{} // node -> pods overlappingPreemptorGroups = common.NewSet() - for leafCellNum := range placement { - for podIndex := range placement[leafCellNum] { - for _, leafCell := range placement[leafCellNum][podIndex] { - if leafCell == nil { - continue - } - pLeafCell := leafCell.(*PhysicalCell) - state := pLeafCell.GetState() - if state == cellUsed || state == cellReserving { - // for any victim pod, gang-preempt all the other pods from the same affinity group - for _, pods := range pLeafCell.GetUsingGroup().allocatedPods { - for _, v := range pods { - if v != nil { - if _, ok := victimPods[v.Spec.NodeName]; !ok { - victimPods[v.Spec.NodeName] = common.NewSet() - } - victimPods[v.Spec.NodeName].Add(v) - } + for iter := PodGroupPlacement(placement).Iterator(); iter.HasNext(); { + for _, leafCell := range *iter.Next() { + if leafCell == nil { + continue + } + pLeafCell := leafCell.(*PhysicalCell) + state := pLeafCell.GetState() + if state == cellUsed || state == cellReserving { + // for any victim pod, gang-preempt all the other pods from the same pod group + for iter := pLeafCell.GetUsingGroup().allocatedPodGroup.Iterator(); iter.HasNext(); { + pod := iter.Next() + if pod != nil { + if _, ok := victimPods[pod.Spec.NodeName]; !ok { + victimPods[pod.Spec.NodeName] = common.NewSet() } + victimPods[pod.Spec.NodeName].Add(pod) } } - if state == cellReserving || state == cellReserved { - overlappingPreemptorGroups.Add(pLeafCell.GetReservingOrReservedGroup()) - } + } + if state == cellReserving || state == cellReserved { + overlappingPreemptorGroups.Add(pLeafCell.GetReservingOrReservedGroup()) } } } @@ -245,77 +263,78 @@ func victimsToString(victimPods map[string]common.Set) string { return common.ToJson(s) } -// retrieveMissingPodPlacement finds the placement of a pod from the annotation of other pods in the same group +// retrieveMissingPodPlacement finds the placement of a pod from the annotation of other pods in the same pod group // when the pod's placement has been invalid (i.e., not found in the spec). -func retrieveMissingPodPlacement(g *AlgoAffinityGroup, leafCellNum int32, podIndex int32) (api.PodPlacementInfo, string) { - for _, pods := range g.allocatedPods { - for _, p := range pods { - if p != nil { - info := internal.ExtractPodBindInfo(p) - for _, mbi := range info.AffinityGroupBindInfo { - if leafCellNum == int32(len(mbi.PodPlacements[0].PhysicalLeafCellIndices)) { - return mbi.PodPlacements[podIndex], info.CellChain - } +func retrieveMissingPodPlacement(podGroupSchedStatus *PodGroupSchedulingStatus, podGroupIndex int32, podIndex int32) (apiv2.PodPlacementInfo, string) { + for iter := podGroupSchedStatus.allocatedPodGroup.Iterator(); iter.HasNext(); { + pod := iter.Next() + if pod != nil { + info := internal.ExtractPodBindInfo(pod) + index := int32(0) + for infoIter := info.PodRootGroupBindInfo.Iterator(podGroupIndex); infoIter.HasNext(); { + podPlacementInfo := infoIter.Next() + if index == podIndex { + return *podPlacementInfo, info.CellChain } + index++ } } } panic(fmt.Sprintf( - "No allocated pod found in an allocated group %v when retrieving placement for pod %v with leaf cell number %v", g.name, podIndex, leafCellNum)) + "No allocated pod found in an allocated group %v when retrieving placement for pod group %v pod %v", podGroupSchedStatus.name, podGroupIndex, podIndex)) } -// retrieveVirtualCell finds the corresponding virtual cell for a physical cell in the placements of an affinity group. +// retrieveVirtualCell finds the corresponding virtual cell for a physical cell in the placements of a pod group. func retrieveVirtualCell( - physicalPlacement groupPhysicalPlacement, - virtualPlacement groupVirtualPlacement, + physicalPlacement PodGroupPhysicalPlacement, + virtualPlacement PodGroupVirtualPlacement, pLeafCell *PhysicalCell) (vLeafCell *VirtualCell) { - for leafCellNum := range physicalPlacement { - for podIndex := range physicalPlacement[leafCellNum] { - for leafCellIndex, leafCell := range physicalPlacement[leafCellNum][podIndex] { - if leafCell != nil && CellEqual(leafCell, pLeafCell) { - return virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell) - } + pIter := PodGroupPlacement(physicalPlacement).Iterator() + vIter := PodGroupPlacement(virtualPlacement).Iterator() + for pIter.HasNext() { + pLeafCells := *pIter.Next() + vLeafCells := *vIter.Next() + for leafCellIndex, leafCell := range pLeafCells { + if leafCell != nil && CellEqual(leafCell, pLeafCell) { + return vLeafCells[leafCellIndex].(*VirtualCell) } } } return nil } -// getAllocatedPodIndex assigns a new index for a new pod in an affinity group. -func getNewPodIndex(pods []*core.Pod) int32 { - podIndex := int32(-1) - for i, p := range pods { - if p == nil { - podIndex = int32(i) - break +// getAllocatedPodIndex assigns a new index for a new pod in a pod group. +func getNewPodIndex(allocatedPodGroup AllocatedPodGroup, podGroupIndex int32) int32 { + podIndex := int32(0) + for iter := allocatedPodGroup.Iterator(podGroupIndex); iter.HasNext(); { + if iter.Next() == nil { + return podIndex } + podIndex++ } - return podIndex + return -1 } // getAllocatedPodIndex finds the index of an allocated pod in its group according to its placement. -func getAllocatedPodIndex(info *api.PodBindInfo, leafCellNum int32) int32 { - for _, gms := range info.AffinityGroupBindInfo { - if leafCellNumber := int32(len(gms.PodPlacements[0].PhysicalLeafCellIndices)); leafCellNumber == leafCellNum { - for podIndex, placement := range gms.PodPlacements { - if placement.PhysicalNode == info.Node && common.Int32SliceContains( - placement.PhysicalLeafCellIndices, info.LeafCellIsolation[0]) { - return int32(podIndex) - } - } +func getAllocatedPodIndex(info *apiv2.PodBindInfo, podGroupIndex int32) int32 { + podIndex := int32(0) + for iter := info.PodRootGroupBindInfo.Iterator(podGroupIndex); iter.HasNext(); { + podPlacementInfo := iter.Next() + if podPlacementInfo.PhysicalNode == info.Node && common.Int32SliceContains( + podPlacementInfo.PhysicalLeafCellIndices, info.LeafCellIsolation[0]) { + return podIndex } + podIndex++ } return -1 } // allPodsReleased checks if all the pods of an affinity group were released. -func allPodsReleased(allocatedPods map[int32][]*core.Pod) bool { - for _, pods := range allocatedPods { - for _, p := range pods { - if p != nil { - return false - } +func allPodsReleased(allocatedPodGroup AllocatedPodGroup) bool { + for iter := allocatedPodGroup.Iterator(); iter.HasNext(); { + if iter.Next() != nil { + return false } } return true diff --git a/pkg/api/constants.go b/pkg/api/constants.go index 4141920..6583dd6 100644 --- a/pkg/api/constants.go +++ b/pkg/api/constants.go @@ -81,10 +81,10 @@ const ( // Scheduler Inspect API: API to inspect current scheduling status // Notes: - // 1. Both Binding and Bound AffinityGroups/Pods are considered as Allocated. + // 1. Both Binding and Bound PodGroups/Pods are considered as Allocated. InspectPath = VersionPath + "/inspect" - // Inspect current allocated AffinityGroup(s) - AffinityGroupsPath = InspectPath + "/affinitygroups/" + // Inspect current allocated PodGroup(s) + PodGroupsPath = InspectPath + "/podgroups/" // Inspect current cluster status ClusterStatusPath = InspectPath + "/clusterstatus" // Inspect current physical cluster status diff --git a/pkg/api/types.go b/pkg/api/types.go index c604794..a696473 100644 --- a/pkg/api/types.go +++ b/pkg/api/types.go @@ -38,6 +38,9 @@ type ( PinnedCellId string ) +// GeneralSpec represents a generic key-value spec. +type GeneralSpec map[string]interface{} + // Physical cluster definition type PhysicalClusterSpec struct { CellTypes map[CellType]CellTypeSpec `yaml:"cellTypes"` diff --git a/pkg/api/v2/types.go b/pkg/api/v2/types.go new file mode 100644 index 0000000..4040be8 --- /dev/null +++ b/pkg/api/v2/types.go @@ -0,0 +1,336 @@ +// MIT License +// +// Copyright (c) Microsoft Corporation. All rights reserved. +// +// Permission is hereby granted, free of charge, to any person obtaining a copy +// of this software and associated documentation files (the "Software"), to deal +// in the Software without restriction, including without limitation the rights +// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +// copies of the Software, and to permit persons to whom the Software is +// furnished to do so, subject to the following conditions: +// +// The above copyright notice and this permission notice shall be included in all +// copies or substantial portions of the Software. +// +// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +// SOFTWARE + +package v2 + +import ( + "fmt" + + "github.com/microsoft/hivedscheduler/pkg/api" + core "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/types" +) + +// PodSchedulingSpec represents HiveD scheudling spec in k8s pod request. +type PodSchedulingSpec struct { + Version string `yaml:"version"` // version of HiveD PodSchedulingSpec, currently supports v1, v2. + VirtualCluster api.VirtualClusterName `yaml:"virtualCluster"` // virtual cluster for pod to be scheduled in + Priority int32 `yaml:"priority"` // pod priority + PinnedCellId api.PinnedCellId `yaml:"pinnedCellId"` // pinned cell id to be scheduled + CellType string `yaml:"cellType"` // cell type to be used in pod, can be leaf or non-leaf cell defined in config, no higher than node level + CellNumber int32 `yaml:"cellNumber"` // cell number to be used in pod, cannot exceed node resource limit + GangReleaseEnable bool `yaml:"gangReleaseEnable"` // whether release in gang (all pods released at the same time) or not + LazyPreemptionEnable bool `yaml:"lazyPreemptionEnable"` // whether lazy preempt or not + PodRootGroup *PodGroupSpec `yaml:"podRootGroup"` // the hierarchical structure for whole pod group +} + +// PodGroupSpec represents a tree stucture of pod group spec. +type PodGroupSpec struct { + Name string `yaml:"name"` // pod group name + WithinOneCell api.CellType `yaml:"withinOneCell"` // within cell for all cells in current group, e.g., two GPU-cell within one numa-cell, two node-cell within one rack-cell + Pod *PodGroupMemberSpec `yaml:"pod"` // pod for current group, PodMinNumber instances of pod are gang with each other + ChildGroups []*PodGroupSpec `yaml:"childGroups"` // child group in the hierarchical structure, any two child groups in the list and pods in them are always gang with each other, ChildGroups are order sensitive + Pods []PodGroupMemberSpec // internal structure for PodGroupMemberSpec, flattened Pod in a list +} + +// PodGroupMemberSpec represents content of each node in tree stucture pod group. +// It contains pod number and cell spec for the pod. +type PodGroupMemberSpec struct { + PodMinNumber int32 `yaml:"podMinNumber"` // minumum number of pods to be gang scheduled in the group, TODO: only PodMinNumber=PodMaxNumber is supported currently + PodMaxNumber int32 `yaml:"podMaxNumber"` // total number of pods to be scheduled in the group, may not be gang scheduled + CellsPerPod PodGroupMemberCellSpec `yaml:"cellsPerPod"` // number of cells in each k8s pod + ContainsCurrentPod bool `yaml:"containsCurrentPod"` // whether current pod group member contains current pod request +} + +// PodGroupMemberCellSpec represents cell spec for each pod in pod group. +type PodGroupMemberCellSpec struct { + CellType api.CellType `yaml:"cellType"` // cell type to be used in pod, differnt group can have differnt cell types in the same chain, TODO: not support multiple chains yet + CellNumber int32 `yaml:"cellNumber"` // cell number to be used in pod, cannot exceed node resource limit +} + +type PodGroupState string + +type PodGroupStatus struct { + VC api.VirtualClusterName `json:"vc"` + Priority int32 `json:"priority"` + State PodGroupState `json:"state"` + PhysicalPlacement map[string][]int32 `json:"physicalPlacement,omitempty"` // node -> leaf cell indices + VirtualPlacement map[api.CellAddress][]api.CellAddress `json:"virtualPlacement,omitempty"` // preassigned cell -> leaf cells + AllocatedPods []types.UID `json:"allocatedPods,omitempty"` + PreemptingPods []types.UID `json:"preemptingPods,omitempty"` + LazyPreemptionStatus *api.LazyPreemptionStatus `json:"lazyPreemptionStatus,omitempty"` +} + +type podGroupSpecIterator struct { + pods []*PodGroupMemberSpec + index int + length int +} + +// Next returns the next item in iteration. +func (i *podGroupSpecIterator) Next() *PodGroupMemberSpec { + i.index++ + return i.pods[i.index-1] +} + +// HasNext return true if iteration not finishes. +func (i *podGroupSpecIterator) HasNext() bool { + return i.index < i.length +} + +// Iterator returns a stateful iterator for PodGroupSpec +func (podRootGroup *PodGroupSpec) Iterator() *podGroupSpecIterator { + pods := []*PodGroupMemberSpec{} + queue := []*PodGroupSpec{podRootGroup} + for len(queue) > 0 { + newQueue := []*PodGroupSpec{} + for _, podGroup := range queue { + for podIndex := range podGroup.Pods { + pods = append(pods, &podGroup.Pods[podIndex]) + } + newQueue = append(newQueue, podGroup.ChildGroups...) + } + queue = newQueue + } + return &podGroupSpecIterator{pods, 0, len(pods)} +} + +// SetCellType sets cell type for all pods in pod group. +func (podRootGroup *PodGroupSpec) SetCellType(cellType string) { + for iter := podRootGroup.Iterator(); iter.HasNext(); { + iter.Next().CellsPerPod.CellType = api.CellType(cellType) + } +} + +// GetCurrentPod returns level traverse index and current pod in pod group. +func (obj *PodSchedulingSpec) GetCurrentPod() (int32, PodGroupMemberSpec) { + index := int32(0) + queue := []*PodGroupSpec{obj.PodRootGroup} + for len(queue) > 0 { + newQueue := []*PodGroupSpec{} + for _, podGroup := range queue { + for _, pod := range podGroup.Pods { + if pod.ContainsCurrentPod == true { + return index, pod + } + } + index++ + newQueue = append(newQueue, podGroup.ChildGroups...) + } + queue = newQueue + } + return int32(-1), PodGroupMemberSpec{} +} + +// ConvertFromV1 converts a v1 pod scheduling request to v2 spec. +func (obj *PodSchedulingSpec) ConvertFromV1(objV1 *api.PodSchedulingSpec) { + obj.Version = "v2" + obj.VirtualCluster = objV1.VirtualCluster + obj.Priority = objV1.Priority + obj.PinnedCellId = objV1.PinnedCellId + obj.CellType = objV1.LeafCellType + obj.CellNumber = objV1.LeafCellNumber + obj.GangReleaseEnable = objV1.GangReleaseEnable + obj.LazyPreemptionEnable = objV1.LazyPreemptionEnable + if objV1.AffinityGroup != nil { + childGroups := []*PodGroupSpec{} + for _, memberV1 := range objV1.AffinityGroup.Members { + member := &PodGroupMemberSpec{ + PodMinNumber: memberV1.PodNumber, + PodMaxNumber: memberV1.PodNumber, + CellsPerPod: PodGroupMemberCellSpec{ + CellType: api.CellType(obj.CellType), + CellNumber: memberV1.LeafCellNumber, + }, + ContainsCurrentPod: bool(obj.CellNumber == memberV1.LeafCellNumber), + } + childGroups = append(childGroups, &PodGroupSpec{Pod: member}) + } + obj.PodRootGroup = &PodGroupSpec{ + Name: objV1.AffinityGroup.Name, + ChildGroups: childGroups, + } + } +} + +// SetDefaults sets default values for PodSchedulingSpec. +func (obj *PodSchedulingSpec) SetDefaults(pod *core.Pod) { + if obj.PodRootGroup == nil { + obj.PodRootGroup = &PodGroupSpec{ + Name: fmt.Sprintf("%v/%v", pod.Namespace, pod.Name), + Pod: &PodGroupMemberSpec{ + PodMinNumber: 1, + PodMaxNumber: 1, + CellsPerPod: PodGroupMemberCellSpec{ + CellType: api.CellType(obj.CellType), + CellNumber: obj.CellNumber, + }, + ContainsCurrentPod: true, + }, + } + } +} + +// Validate checks whether PodSchedulingSpec is ok. +func (obj *PodSchedulingSpec) Validate() (msg string, ok bool) { + if obj.VirtualCluster == "" { + return "VirtualCluster is empty", false + } + if obj.Priority < api.OpportunisticPriority { + return fmt.Sprintf("Priority is less than %v", api.OpportunisticPriority), false + } + if obj.Priority > api.MaxGuaranteedPriority { + return fmt.Sprintf("Priority is greater than %v", api.MaxGuaranteedPriority), false + } + if obj.CellNumber <= 0 { + return "CellNumber is non-positive", false + } + if obj.PodRootGroup.Name == "" { + return "PodRootGroup.Name is empty", false + } + + isPodInGroup := false + queue := []*PodGroupSpec{obj.PodRootGroup} + for len(queue) > 0 { + newQueue := []*PodGroupSpec{} + for _, podGroup := range queue { + if podGroup.Pods != nil && len(podGroup.Pods) > 0 { + return "Do not support PodGroup.Pods field, please specify PodGroup.Pod", false + } + if p := podGroup.Pod; p != nil { + if p.PodMinNumber <= 0 { + return "PodGroup.Pod have non-positive PodMinNumber", false + } + if p.PodMaxNumber <= 0 { + return "PodGroup.Pod have non-positive PodMaxNumber", false + } + if p.CellsPerPod.CellNumber <= 0 { + return "PodGroup.Pod have non-positive CellsPerPod.CellNumber", false + } + if p.ContainsCurrentPod == true { + if isPodInGroup == false { + isPodInGroup = true + } else { + return "PodGroup.Pod have multiple ContainsCurrentPod", false + } + } + podGroup.Pods = []PodGroupMemberSpec{*p} + } + newQueue = append(newQueue, podGroup.ChildGroups...) + } + queue = newQueue + } + if !isPodInGroup { + return "PodGroup.Pod does not contain current Pod", false + } + return "", true +} + +type PodBindInfo struct { + Version string `yaml:"version"` // version of HiveD PodBindInfo, currently supports v1, v2. + Node string `yaml:"node"` // k8s node name to bind + LeafCellIsolation []int32 `yaml:"leafCellIsolation"` // leaf cells for current pod's placement to bind + CellChain string `yaml:"cellChain"` // cell chain selected + PodRootGroupBindInfo *PodGroupBindInfo `yaml:"podRootGroupBindInfo"` // whole pod group bind info +} + +type PodGroupBindInfo struct { + PodPlacements []PodPlacementInfo `yaml:"podPlacements"` // pod placements in current group + ChildGroupBindingInfo []*PodGroupBindInfo `yaml:"childGroupBindingInfo"` // child pod group bind info +} + +type PodPlacementInfo struct { + PhysicalNode string `yaml:"physicalNode"` + PhysicalLeafCellIndices []int32 `yaml:"physicalLeafCellIndices"` + // preassigned cell types used by the pods. used to locate the virtual cells + // when adding an allocated pod + PreassignedCellTypes []api.CellType `yaml:"preassignedCellTypes"` +} + +type podGroupBindInfoIterator struct { + podPlacements []*PodPlacementInfo + index int + length int +} + +// Next returns the next item in iteration. +func (i *podGroupBindInfoIterator) Next() *PodPlacementInfo { + i.index++ + return i.podPlacements[i.index-1] +} + +// HasNext return true if iteration not finishes. +func (i *podGroupBindInfoIterator) HasNext() bool { + return i.index < i.length +} + +// Iterator returns a stateful iterator for PodGroupBindInfo +func (podRootGroupBindInfo *PodGroupBindInfo) Iterator(args ...int32) *podGroupBindInfoIterator { + index := int32(0) + podPlacements := []*PodPlacementInfo{} + queue := []*PodGroupBindInfo{podRootGroupBindInfo} + for len(queue) > 0 { + newQueue := []*PodGroupBindInfo{} + for _, podGroupBindInfo := range queue { + if len(args) == 1 && args[0] == index { + podPlacements = []*PodPlacementInfo{} + } + for podIndex := range podGroupBindInfo.PodPlacements { + podPlacements = append(podPlacements, &podGroupBindInfo.PodPlacements[podIndex]) + } + if len(args) == 1 && args[0] == index { + return &podGroupBindInfoIterator{podPlacements, 0, len(podPlacements)} + } + index++ + newQueue = append(newQueue, podGroupBindInfo.ChildGroupBindingInfo...) + } + queue = newQueue + } + return &podGroupBindInfoIterator{podPlacements, 0, len(podPlacements)} +} + +// ConvertFromV1 converts a v1 pod bind info to v2 spec. +func (obj *PodBindInfo) ConvertFromV1(objV1 *api.PodBindInfo) { + obj.Version = "v2" + obj.Node = objV1.Node + obj.LeafCellIsolation = append([]int32{}, objV1.LeafCellIsolation...) + obj.CellChain = objV1.CellChain + obj.PodRootGroupBindInfo = &PodGroupBindInfo{ + PodPlacements: []PodPlacementInfo{}, + ChildGroupBindingInfo: []*PodGroupBindInfo{}, + } + for _, affinityGroupMemberBindInfo := range objV1.AffinityGroupBindInfo { + for _, podPlacementInfo := range affinityGroupMemberBindInfo.PodPlacements { + obj.PodRootGroupBindInfo.PodPlacements = + append(obj.PodRootGroupBindInfo.PodPlacements, PodPlacementInfo(podPlacementInfo)) + } + } +} + +type PodGroupList struct { + Items []PodGroup `json:"items"` +} + +type PodGroup struct { + api.ObjectMeta `json:"metadata"` + Status PodGroupStatus `json:"status"` +} diff --git a/pkg/internal/types.go b/pkg/internal/types.go index f8e0fd0..418e12c 100644 --- a/pkg/internal/types.go +++ b/pkg/internal/types.go @@ -24,7 +24,9 @@ package internal import ( "fmt" + si "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" core "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/types" ei "k8s.io/kubernetes/pkg/scheduler/api" @@ -46,8 +48,8 @@ type ExtenderHandlers struct { } type InspectHandlers struct { - GetAllAffinityGroupsHandler func() si.AffinityGroupList - GetAffinityGroupHandler func(groupName string) si.AffinityGroup + GetAllPodGroupsHandler func() apiv2.PodGroupList + GetPodGroupHandler func(groupName string) apiv2.PodGroup GetClusterStatusHandler func() si.ClusterStatus GetPhysicalClusterStatusHandler func() si.PhysicalClusterStatus GetAllVirtualClustersStatusHandler func() map[si.VirtualClusterName]si.VirtualClusterStatus @@ -91,8 +93,8 @@ type SchedulerAlgorithm interface { DeleteAllocatedPod(pod *core.Pod) // Expose current scheduling status - GetAllAffinityGroups() si.AffinityGroupList - GetAffinityGroup(name string) si.AffinityGroup + GetAllPodGroups() apiv2.PodGroupList + GetPodGroup(name string) apiv2.PodGroup GetClusterStatus() si.ClusterStatus GetPhysicalClusterStatus() si.PhysicalClusterStatus GetAllVirtualClustersStatus() map[si.VirtualClusterName]si.VirtualClusterStatus @@ -132,7 +134,7 @@ const ( type PodScheduleResult struct { PodWaitInfo *PodWaitInfo PodPreemptInfo *PodPreemptInfo - PodBindInfo *si.PodBindInfo + PodBindInfo *apiv2.PodBindInfo } // PodUID -> PodScheduleStatus diff --git a/pkg/internal/utils.go b/pkg/internal/utils.go index a480cde..dcc02f6 100644 --- a/pkg/internal/utils.go +++ b/pkg/internal/utils.go @@ -28,6 +28,7 @@ import ( "strings" si "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" "github.com/microsoft/hivedscheduler/pkg/common" core "k8s.io/api/core/v1" meta "k8s.io/apimachinery/pkg/apis/meta/v1" @@ -169,7 +170,7 @@ func IsNodeHealthy(node *core.Node) bool { return false } -func NewBindingPod(pod *core.Pod, podBindInfo *si.PodBindInfo) *core.Pod { +func NewBindingPod(pod *core.Pod, podBindInfo *apiv2.PodBindInfo) *core.Pod { bindingPod := pod.DeepCopy() bindingPod.Spec.NodeName = podBindInfo.Node @@ -197,8 +198,8 @@ func convertOldAnnotation(annotation string) string { } // PodBindInfo comes from internal, so just need to assert when deserialization. -func ExtractPodBindInfo(allocatedPod *core.Pod) *si.PodBindInfo { - podBindInfo := si.PodBindInfo{} +func ExtractPodBindInfo(allocatedPod *core.Pod) *apiv2.PodBindInfo { + podBindInfo := apiv2.PodBindInfo{Version: "v2"} annotation := convertOldAnnotation(allocatedPod.Annotations[si.AnnotationKeyPodBindInfo]) if annotation == "" { @@ -207,7 +208,22 @@ func ExtractPodBindInfo(allocatedPod *core.Pod) *si.PodBindInfo { si.AnnotationKeyPodBindInfo)) } - common.FromYaml(annotation, &podBindInfo) + generalSpec := si.GeneralSpec{"version": "v1"} + common.FromYaml(annotation, &generalSpec) + switch generalSpec["version"] { + case "v1": + podBindInfoV1 := si.PodBindInfo{} + common.FromYaml(annotation, &podBindInfoV1) + // convert to v2 + podBindInfo.ConvertFromV1(&podBindInfoV1) + case "v2": + common.FromYaml(annotation, &podBindInfo) + default: + panic(fmt.Errorf( + "Pod contains unknown version %v in annotation: %v", + generalSpec["version"], si.AnnotationKeyPodBindInfo)) + } + return &podBindInfo } @@ -225,64 +241,82 @@ func ExtractPodBindAnnotations(allocatedPod *core.Pod) map[string]string { } } -// PodSchedulingSpec comes from external, so need more Defaulting and Validation -// when deserialization. -func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec { +// ExtractPodSchedulingSpec extracts pod scheduling request from k8s pod request. +// TODO: Need more defaulting and validation when deserialization, for example, +// check cell type hierarchies, cell type no higher than node level, cell number limit, etc. +func ExtractPodSchedulingSpec(pod *core.Pod) *apiv2.PodSchedulingSpec { // Consider all panics are BadRequestPanic. defer AsBadRequestPanic() errPfx := fmt.Sprintf("Pod annotation %v: ", si.AnnotationKeyPodSchedulingSpec) - podSchedulingSpec := si.PodSchedulingSpec{IgnoreK8sSuggestedNodes: true} - annotation := convertOldAnnotation(pod.Annotations[si.AnnotationKeyPodSchedulingSpec]) if annotation == "" { panic(fmt.Errorf(errPfx + "Annotation does not exist or is empty")) } - common.FromYaml(annotation, &podSchedulingSpec) - - // Defaulting - if podSchedulingSpec.AffinityGroup == nil { - podSchedulingSpec.AffinityGroup = &si.AffinityGroupSpec{ - Name: fmt.Sprintf("%v/%v", pod.Namespace, pod.Name), - Members: []si.AffinityGroupMemberSpec{{ - PodNumber: 1, - LeafCellNumber: podSchedulingSpec.LeafCellNumber}, - }, + podSchedulingSpec := apiv2.PodSchedulingSpec{Version: "v2"} + + generalSpec := si.GeneralSpec{"version": "v1"} + common.FromYaml(annotation, &generalSpec) + switch generalSpec["version"] { + case "v1": + podSchedulingSpecV1 := si.PodSchedulingSpec{IgnoreK8sSuggestedNodes: true} + common.FromYaml(annotation, &podSchedulingSpecV1) + // v1 Defaulting + if podSchedulingSpecV1.AffinityGroup == nil { + podSchedulingSpecV1.AffinityGroup = &si.AffinityGroupSpec{ + Name: fmt.Sprintf("%v/%v", pod.Namespace, pod.Name), + Members: []si.AffinityGroupMemberSpec{{ + PodNumber: 1, + LeafCellNumber: podSchedulingSpecV1.LeafCellNumber}, + }, + } } - } - - // Validation - if podSchedulingSpec.VirtualCluster == "" { - panic(fmt.Errorf(errPfx + "VirtualCluster is empty")) - } - if podSchedulingSpec.Priority < si.OpportunisticPriority { - panic(fmt.Errorf(errPfx+"Priority is less than %v", si.OpportunisticPriority)) - } - if podSchedulingSpec.Priority > si.MaxGuaranteedPriority { - panic(fmt.Errorf(errPfx+"Priority is greater than %v", si.MaxGuaranteedPriority)) - } - if podSchedulingSpec.LeafCellNumber <= 0 { - panic(fmt.Errorf(errPfx + "LeafCellNumber is non-positive")) - } - if podSchedulingSpec.AffinityGroup.Name == "" { - panic(fmt.Errorf(errPfx + "AffinityGroup.Name is empty")) - } - - isPodInGroup := false - for _, member := range podSchedulingSpec.AffinityGroup.Members { - if member.PodNumber <= 0 { - panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive PodNumber")) + // v1 Validation + if podSchedulingSpecV1.VirtualCluster == "" { + panic(fmt.Errorf(errPfx + "VirtualCluster is empty")) + } + if podSchedulingSpecV1.Priority < si.OpportunisticPriority { + panic(fmt.Errorf(errPfx+"Priority is less than %v", si.OpportunisticPriority)) + } + if podSchedulingSpecV1.Priority > si.MaxGuaranteedPriority { + panic(fmt.Errorf(errPfx+"Priority is greater than %v", si.MaxGuaranteedPriority)) } - if member.LeafCellNumber <= 0 { - panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive LeafCellNumber")) + if podSchedulingSpecV1.LeafCellNumber <= 0 { + panic(fmt.Errorf(errPfx + "LeafCellNumber is non-positive")) } - if member.LeafCellNumber == podSchedulingSpec.LeafCellNumber { - isPodInGroup = true + if podSchedulingSpecV1.AffinityGroup.Name == "" { + panic(fmt.Errorf(errPfx + "AffinityGroup.Name is empty")) } + isPodInGroup := false + for _, member := range podSchedulingSpecV1.AffinityGroup.Members { + if member.PodNumber <= 0 { + panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive PodNumber")) + } + if member.LeafCellNumber <= 0 { + panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive LeafCellNumber")) + } + if member.LeafCellNumber == podSchedulingSpecV1.LeafCellNumber { + isPodInGroup = true + } + } + if !isPodInGroup { + panic(fmt.Errorf(errPfx + "AffinityGroup.Members does not contains current Pod")) + } + // convert to v2 + podSchedulingSpec.ConvertFromV1(&podSchedulingSpecV1) + case "v2": + common.FromYaml(annotation, &podSchedulingSpec) + default: + panic(fmt.Errorf(errPfx+"Unknown version %v", generalSpec["version"])) } - if !isPodInGroup { - panic(fmt.Errorf(errPfx + "AffinityGroup.Members does not contains current Pod")) + + // Defaulting + podSchedulingSpec.SetDefaults(pod) + + // Validation + if msg, ok := podSchedulingSpec.Validate(); !ok { + panic(fmt.Errorf(errPfx + msg)) } return &podSchedulingSpec diff --git a/pkg/scheduler/scheduler.go b/pkg/scheduler/scheduler.go index 2a087a2..1e37f65 100644 --- a/pkg/scheduler/scheduler.go +++ b/pkg/scheduler/scheduler.go @@ -29,6 +29,7 @@ import ( "github.com/microsoft/hivedscheduler/pkg/algorithm" si "github.com/microsoft/hivedscheduler/pkg/api" + apiv2 "github.com/microsoft/hivedscheduler/pkg/api/v2" "github.com/microsoft/hivedscheduler/pkg/common" "github.com/microsoft/hivedscheduler/pkg/internal" "github.com/microsoft/hivedscheduler/pkg/webserver" @@ -181,8 +182,8 @@ func NewHivedScheduler() *HivedScheduler { PreemptHandler: s.preemptRoutine, }, internal.InspectHandlers{ - GetAllAffinityGroupsHandler: s.getAllAffinityGroups, - GetAffinityGroupHandler: s.getAffinityGroup, + GetAllPodGroupsHandler: s.getAllPodGroups, + GetPodGroupHandler: s.getPodGroup, GetClusterStatusHandler: s.getClusterStatus, GetPhysicalClusterStatusHandler: s.getPhysicalClusterStatus, GetAllVirtualClustersStatusHandler: s.getAllVirtualClustersStatus, @@ -383,7 +384,7 @@ func (s *HivedScheduler) generalScheduleAdmissionCheck( } func (s *HivedScheduler) validatePodBindInfo( - podBindInfo *si.PodBindInfo, suggestedNodes []string) error { + podBindInfo *apiv2.PodBindInfo, suggestedNodes []string) error { node := podBindInfo.Node // Check against existing nodes @@ -720,12 +721,12 @@ func (s *HivedScheduler) preemptRoutine(args ei.ExtenderPreemptionArgs) *ei.Exte } } -func (s *HivedScheduler) getAllAffinityGroups() si.AffinityGroupList { - return s.schedulerAlgorithm.GetAllAffinityGroups() +func (s *HivedScheduler) getAllPodGroups() apiv2.PodGroupList { + return s.schedulerAlgorithm.GetAllPodGroups() } -func (s *HivedScheduler) getAffinityGroup(name string) si.AffinityGroup { - return s.schedulerAlgorithm.GetAffinityGroup(name) +func (s *HivedScheduler) getPodGroup(name string) apiv2.PodGroup { + return s.schedulerAlgorithm.GetPodGroup(name) } func (s *HivedScheduler) getClusterStatus() si.ClusterStatus { diff --git a/pkg/webserver/webserver.go b/pkg/webserver/webserver.go index f8a0791..7a2d59c 100644 --- a/pkg/webserver/webserver.go +++ b/pkg/webserver/webserver.go @@ -26,16 +26,17 @@ import ( "context" "encoding/json" "fmt" + "net" + "net/http" + "strings" + "time" + si "github.com/microsoft/hivedscheduler/pkg/api" "github.com/microsoft/hivedscheduler/pkg/common" "github.com/microsoft/hivedscheduler/pkg/internal" "k8s.io/apimachinery/pkg/types" "k8s.io/klog" ei "k8s.io/kubernetes/pkg/scheduler/api" - "net" - "net/http" - "strings" - "time" ) const ( @@ -78,7 +79,7 @@ func NewWebServer(sConfig *si.Config, ws.route(si.FilterPath, ws.serve(ws.serveFilterPath)) ws.route(si.BindPath, ws.serve(ws.serveBindPath)) ws.route(si.PreemptPath, ws.serve(ws.servePreemptPath)) - ws.route(si.AffinityGroupsPath, ws.serve(ws.serveAffinityGroups)) + ws.route(si.PodGroupsPath, ws.serve(ws.servePodGroups)) ws.route(si.ClusterStatusPath, ws.serve(ws.serveClusterStatus)) ws.route(si.PhysicalClusterPath, ws.serve(ws.servePhysicalClusterStatus)) ws.route(si.VirtualClustersPath, ws.serve(ws.serveVirtualClustersStatus)) @@ -239,16 +240,16 @@ func (ws *WebServer) servePreemptPath(w http.ResponseWriter, r *http.Request) { w.Write(common.ToJsonBytes(ws.eHandlers.PreemptHandler(args))) } -func (ws *WebServer) serveAffinityGroups(w http.ResponseWriter, r *http.Request) { - name := strings.TrimPrefix(r.URL.Path, si.AffinityGroupsPath) +func (ws *WebServer) servePodGroups(w http.ResponseWriter, r *http.Request) { + name := strings.TrimPrefix(r.URL.Path, si.PodGroupsPath) if name == "" { if r.Method == http.MethodGet { - w.Write(common.ToJsonBytes(ws.iHandlers.GetAllAffinityGroupsHandler())) + w.Write(common.ToJsonBytes(ws.iHandlers.GetAllPodGroupsHandler())) return } } else { if r.Method == http.MethodGet { - w.Write(common.ToJsonBytes(ws.iHandlers.GetAffinityGroupHandler(name))) + w.Write(common.ToJsonBytes(ws.iHandlers.GetPodGroupHandler(name))) return } } diff --git a/test/config/group1/case1.yaml b/test/config/group1/case1.yaml new file mode 100644 index 0000000..70bf4b6 --- /dev/null +++ b/test/config/group1/case1.yaml @@ -0,0 +1,150 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod2 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [1] +- method: SchedulePod + parameters: + podName: group2_pod3 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod3 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [2] +- method: SchedulePod + parameters: + podName: group2_pod4 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod4 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [3] diff --git a/test/config/group1/case2.yaml b/test/config/group1/case2.yaml new file mode 100644 index 0000000..bd514b2 --- /dev/null +++ b/test/config/group1/case2.yaml @@ -0,0 +1,150 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-SOCKET" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [4] +- method: SchedulePod + parameters: + podName: group2_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-SOCKET" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod2 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [5] +- method: SchedulePod + parameters: + podName: group2_pod3 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-SOCKET" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod3 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [6] +- method: SchedulePod + parameters: + podName: group2_pod4 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-SOCKET" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod4 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [7] diff --git a/test/config/group1/case3.yaml b/test/config/group1/case3.yaml new file mode 100644 index 0000000..309d96e --- /dev/null +++ b/test/config/group1/case3.yaml @@ -0,0 +1,120 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [2, 3] +- method: SchedulePod + parameters: + podName: group2_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod2 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [4, 5] +- method: SchedulePod + parameters: + podName: group2_pod3 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod3 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [6, 7] diff --git a/test/config/group1/case4.yaml b/test/config/group1/case4.yaml new file mode 100644 index 0000000..068443e --- /dev/null +++ b/test/config/group1/case4.yaml @@ -0,0 +1,150 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0, 1] +- method: SchedulePod + parameters: + podName: group2_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod2 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [2, 3] +- method: SchedulePod + parameters: + podName: group2_pod3 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod3 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [4, 5] +- method: SchedulePod + parameters: + podName: group2_pod4 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100 + cellNumber: 2 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-NODE" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100 + cellNumber: 2 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod4 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [6, 7] diff --git a/test/config/group1/case5.yaml b/test/config/group1/case5.yaml new file mode 100644 index 0000000..76bf06b --- /dev/null +++ b/test/config/group1/case5.yaml @@ -0,0 +1,90 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-NODE + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-RACK" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-NODE + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1, 2, 3, 4, 5, 6, 7] +- method: SchedulePod + parameters: + podName: group2_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-NODE + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-RACK" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-NODE + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod2 + expectedResult: + node: 0.0.0.4 + leafCellIsolation: [0, 1, 2, 3, 4, 5, 6, 7] diff --git a/test/config/group1/case6.yaml b/test/config/group1/case6.yaml new file mode 100644 index 0000000..e24f918 --- /dev/null +++ b/test/config/group1/case6.yaml @@ -0,0 +1,60 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-SOCKET + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "V100-SOCKET" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100-SOCKET + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [4, 5, 6, 7] diff --git a/test/config/group1/case7.yaml b/test/config/group1/case7.yaml new file mode 100644 index 0000000..6686b22 --- /dev/null +++ b/test/config/group1/case7.yaml @@ -0,0 +1,123 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group3 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group3_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: DeallocatePod + parameters: + podName: group2_pod1 +- method: SchedulePod + parameters: + podName: group4_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-NODE + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group4 + withinOneCell: "V100-RACK" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-NODE + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodPreemptResult + parameters: + podName: group4_pod1 + expectedResult: + victimPods: + - group1_pod1 diff --git a/test/config/group1/case8.yaml b/test/config/group1/case8.yaml new file mode 100644 index 0000000..438759d --- /dev/null +++ b/test/config/group1/case8.yaml @@ -0,0 +1,153 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group3 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group3_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: DeallocatePod + parameters: + podName: group2_pod1 +- method: SchedulePod + parameters: + podName: group4_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-NODE + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group4 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-NODE + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group4_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0, 1, 2, 3, 4, 5, 6, 7] +- method: SchedulePod + parameters: + podName: group4_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 2 + pinnedCellId: "" + cellType: V100-NODE + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group4 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-NODE + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group4_pod2 + expectedResult: + node: 0.0.0.4 + leafCellIsolation: [0, 1, 2, 3, 4, 5, 6, 7] diff --git a/test/config/group1/case9.yaml b/test/config/group1/case9.yaml new file mode 100644 index 0000000..c86cb3d --- /dev/null +++ b/test/config/group1/case9.yaml @@ -0,0 +1,120 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 100 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 100 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 100 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group3 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group3_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: DeallocatePod + parameters: + podName: group2_pod1 +- method: SchedulePod + parameters: + podName: group4_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-NODE + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group4 + withinOneCell: "V100-RACK" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-NODE + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodWait + parameters: + podName: group4_pod1 diff --git a/test/config/group1/setting.yaml b/test/config/group1/setting.yaml new file mode 100644 index 0000000..8980075 --- /dev/null +++ b/test/config/group1/setting.yaml @@ -0,0 +1,34 @@ +kubeApiServerAddress: http://10.10.10.10:8080 + +physicalCluster: + + cellTypes: + V100-SWITCH: + childCellType: V100 + childCellNumber: 2 + V100-SOCKET: + childCellType: V100-SWITCH + childCellNumber: 2 + V100-NODE: + childCellType: V100-SOCKET + childCellNumber: 2 + isNodeLevel: true + V100-RACK: + childCellType: V100-NODE + childCellNumber: 2 + + physicalCells: + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.1 + - cellAddress: 0.0.0.2 + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.3 + - cellAddress: 0.0.0.4 + +virtualClusters: + VC1: + virtualCells: + - cellType: V100-RACK + cellNumber: 2 diff --git a/test/config/group2/case1.yaml b/test/config/group2/case1.yaml new file mode 100644 index 0000000..51ae73a --- /dev/null +++ b/test/config/group2/case1.yaml @@ -0,0 +1,237 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SOCKET + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SOCKET + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [0, 1, 2, 3] +- method: SchedulePod + parameters: + podName: group2_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SOCKET + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SOCKET + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod2 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [4, 5, 6, 7] +- method: SchedulePod + parameters: + podName: group2_pod3 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SOCKET + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SOCKET + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod3 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1, 2, 3] +- method: SchedulePod + parameters: + podName: group3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group3 + withinOneCell: "V100-RACK" + pod: + podMinNumber: 13 + podMaxNumber: 13 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodWait + parameters: + podName: group3_pod1 +- method: SchedulePod + parameters: + podName: group4_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group4 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group4_pod1 + expectedResult: + node: 0.0.0.5 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group4_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 5 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group4 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100 + cellNumber: 5 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group4_pod2 + expectedResult: + node: 0.0.0.6 + leafCellIsolation: [0, 1, 2, 3, 4] +- method: SchedulePod + parameters: + podName: group5_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group5 + withinOneCell: "" + pod: + podMinNumber: 13 + podMaxNumber: 13 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group5_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [5] diff --git a/test/config/group2/case2.yaml b/test/config/group2/case2.yaml new file mode 100644 index 0000000..7edec5c --- /dev/null +++ b/test/config/group2/case2.yaml @@ -0,0 +1,60 @@ +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: K80 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 1.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: K80 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: K80 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 1.0.0.3 + leafCellIsolation: [0] diff --git a/test/config/group2/case3.yaml b/test/config/group2/case3.yaml new file mode 100644 index 0000000..9c0a852 --- /dev/null +++ b/test/config/group2/case3.yaml @@ -0,0 +1,96 @@ +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.4 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.5 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.6 +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC2 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SOCKET + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group3 + withinOneCell: "" + pod: + podMinNumber: 4 + podMaxNumber: 4 + cellsPerPod: + cellType: V100-SOCKET + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodWait + parameters: + podName: group3_pod1 diff --git a/test/config/group2/setting.yaml b/test/config/group2/setting.yaml new file mode 100644 index 0000000..b6e8a6f --- /dev/null +++ b/test/config/group2/setting.yaml @@ -0,0 +1,68 @@ +kubeApiServerAddress: http://10.10.10.10:8080 + +physicalCluster: + + cellTypes: + V100-SWITCH: + childCellType: V100 + childCellNumber: 2 + V100-SOCKET: + childCellType: V100-SWITCH + childCellNumber: 2 + V100-NODE: + childCellType: V100-SOCKET + childCellNumber: 2 + isNodeLevel: true + V100-RACK: + childCellType: V100-NODE + childCellNumber: 2 + + K80-SWITCH: + childCellType: K80 + childCellNumber: 2 + K80-SOCKET: + childCellType: K80-SWITCH + childCellNumber: 2 + K80-NODE: + childCellType: K80-SOCKET + childCellNumber: 2 + isNodeLevel: true + K80-RACK: + childCellType: K80-NODE + childCellNumber: 2 + + physicalCells: + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.1 + - cellAddress: 0.0.0.2 + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.3 + - cellAddress: 0.0.0.4 + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.5 + - cellAddress: 0.0.0.6 + - cellType: K80-RACK + cellChildren: + - cellAddress: 1.0.0.1 + - cellAddress: 1.0.0.2 + - cellType: K80-RACK + cellChildren: + - cellAddress: 1.0.0.3 + - cellAddress: 1.0.0.4 + +virtualClusters: + VC1: + virtualCells: + - cellType: V100-RACK + cellNumber: 2 + - cellType: K80-RACK + cellNumber: 1 + VC2: + virtualCells: + - cellType: V100-RACK + cellNumber: 1 + - cellType: K80-RACK + cellNumber: 1 diff --git a/test/config/group3/case1.yaml b/test/config/group3/case1.yaml new file mode 100644 index 0000000..9b7e176 --- /dev/null +++ b/test/config/group3/case1.yaml @@ -0,0 +1,149 @@ +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.5 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.6 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.7 +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: V100-RACK + pod: + podMinNumber: 11 + podMaxNumber: 11 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: group2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group2 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group2_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [3] +- method: SchedulePod + parameters: + podName: group3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group3 + withinOneCell: "" + pod: + podMinNumber: 1 + podMaxNumber: 1 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group3_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [4] +- method: DeallocatePod + parameters: + podName: group2_pod1 +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup3 + withinOneCell: V100-NODE + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false +- method: AssertPodWait + parameters: + podName: multi_cascade_subgroup1_pod1 diff --git a/test/config/group3/case2.yaml b/test/config/group3/case2.yaml new file mode 100644 index 0000000..8150a94 --- /dev/null +++ b/test/config/group3/case2.yaml @@ -0,0 +1,189 @@ +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.5 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.6 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.7 +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: V100-RACK + pod: + podMinNumber: 11 + podMaxNumber: 11 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup3 + withinOneCell: V100-NODE + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup1_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + - name: multi_cascade_subgroup3 + withinOneCell: V100-NODE + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup2_pod1 + expectedResult: + node: 0.0.0.4 + leafCellIsolation: [0, 1] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup3 + withinOneCell: V100-NODE + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup3_pod1 + expectedResult: + node: 0.0.0.2 + leafCellIsolation: [4, 5] \ No newline at end of file diff --git a/test/config/group3/case3.yaml b/test/config/group3/case3.yaml new file mode 100644 index 0000000..5c2d57c --- /dev/null +++ b/test/config/group3/case3.yaml @@ -0,0 +1,240 @@ +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.5 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.6 +- method: SetNodeToBad + parameters: + nodeName: 0.0.0.7 +- method: SchedulePod + parameters: + podName: group1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100 + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: group1 + withinOneCell: V100-RACK + pod: + podMinNumber: 15 + podMaxNumber: 15 + cellsPerPod: + cellType: V100 + cellNumber: 1 + containsCurrentPod: true + childGroups: [] +- method: AssertPodBindResult + parameters: + podName: group1_pod1 + expectedResult: + node: 0.0.0.1 + leafCellIsolation: [0] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup1_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup3 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup1_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [0, 1] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup2_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true + - name: multi_cascade_subgroup3 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup2_pod1 + expectedResult: + node: 0.0.0.4 + leafCellIsolation: [0, 1] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup3_pod1 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup3 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup3_pod1 + expectedResult: + node: 0.0.0.3 + leafCellIsolation: [6, 7] +- method: SchedulePod + parameters: + podName: multi_cascade_subgroup3_pod2 + phase: "Preempting" + podGroupSchedulingRequest: + version: v2 + virtualCluster: VC1 + priority: 1 + pinnedCellId: "" + cellType: V100-SWITCH + cellNumber: 1 + gangReleaseEnable: false + lazyPreemptionEnable: false + podRootGroup: + name: multi_cascade + withinOneCell: V100-RACK + childGroups: + - name: multi_cascade_subgroup1 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup2 + withinOneCell: V100-NODE + pod: + podMinNumber: 3 + podMaxNumber: 3 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: false + - name: multi_cascade_subgroup3 + withinOneCell: "" + pod: + podMinNumber: 2 + podMaxNumber: 2 + cellsPerPod: + cellType: V100-SWITCH + cellNumber: 1 + containsCurrentPod: true +- method: AssertPodBindResult + parameters: + podName: multi_cascade_subgroup3_pod2 + expectedResult: + node: 0.0.0.4 + leafCellIsolation: [6, 7] + \ No newline at end of file diff --git a/test/config/group3/setting.yaml b/test/config/group3/setting.yaml new file mode 100644 index 0000000..0456504 --- /dev/null +++ b/test/config/group3/setting.yaml @@ -0,0 +1,38 @@ +kubeApiServerAddress: http://10.10.10.10:8080 + +physicalCluster: + + cellTypes: + V100-SWITCH: + childCellType: V100 + childCellNumber: 2 + V100-SOCKET: + childCellType: V100-SWITCH + childCellNumber: 2 + V100-NODE: + childCellType: V100-SOCKET + childCellNumber: 2 + isNodeLevel: true + V100-RACK: + childCellType: V100-NODE + childCellNumber: 4 + + physicalCells: + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.1 + - cellAddress: 0.0.0.2 + - cellAddress: 0.0.0.3 + - cellAddress: 0.0.0.4 + - cellType: V100-RACK + cellChildren: + - cellAddress: 0.0.0.5 + - cellAddress: 0.0.0.6 + - cellAddress: 0.0.0.7 + - cellAddress: 0.0.0.8 + +virtualClusters: + VC1: + virtualCells: + - cellType: V100-RACK + cellNumber: 2