From 3bc2f1fe72982e5c8dc4b28eba3422a6e3b06085 Mon Sep 17 00:00:00 2001 From: "Kobayashi, Daisuke" Date: Fri, 20 Dec 2024 16:38:34 +0900 Subject: [PATCH] KEP-5007: DRA Device Attach Before Pod Scheduled --- .../README.md | 1043 +++++++++++++++++ .../kep.yaml | 46 + 2 files changed, 1089 insertions(+) create mode 100644 keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md create mode 100644 keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml diff --git a/keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md b/keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md new file mode 100644 index 00000000000..e72280c67fa --- /dev/null +++ b/keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md @@ -0,0 +1,1043 @@ + +# [KEP-5007](https://github.com/kubernetes/enhancements/issues/5007): DRA Device Attach Before Pod Scheduled + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [DRA scheduler plugin Design overview](#dra-scheduler-plugin-design-overview) + - [Composable Controlelr Design Overview](#composable-controlelr-design-overview) + - [Proposal 1: The composable controller publishes ResourceSlices with NodeName set within the pool](#proposal-1-the-composable-controller-publishes-resourceslices-with-nodename-set-within-the-pool) + - [Proposal 2: Attached devices are published by the vendor's plugin](#proposal-2-attached-devices-are-published-by-the-vendors-plugin) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +To achieve efficient management of fabric devices, we propose adding the following features to the Kubernetes scheduler's DRA plugin. Fabric devices are those that are not directly connected to the server and require attachment to the server for use. + +In the DRA current implementation, fabric devices are attached after the scheduling decision, which leads to the following issues: + +Fabric devices may be contested by other clusters. In scenarios where attachment occurs after scheduling, there is a risk that the resource cannot be attached at the time of attachment, causing the container to remain in the "Container Creating" state. + +To address this issue, we propose a feature that allows the DRA scheduler plugin to wait for the device to be attached. + +## Motivation + + + +As AI and ML become popular in container (K8s) environment, enormous computational resources are required more and more. On the other hand, efforts toward energy efficiency are also required for the realization of a sustainable society. It is expected to achieve the conflicting requirements that providing higher performance and reducing power consumption simultaneously. Recently, a new server architecture called Composable Disaggregated Infrastructure is emerged. + +In a traditional server, hardware resources such as CPUs, memory, and GPUs reside within the server. Composable Disaggregated Infrastructure decomposes these hardware resources and makes them available as resource pools. We can combine these resource by software definition so that we can create custom-made servers. + +Composable system is composed of resource pool and Composable Manager software. In Resource pool all components are connected to PCIe or CXL switches. Composable Manager controls the switches so as to create composed baremetals by software definition. It has Composable API and Operator or Kubernetes may call the API. Once composed baremetals are created user can install any operating system or container infrastructure. + +This flexibility extends further with the use of fabric devices. Fabric devices can be used by multiple Kubernetes clusters, not just a single one. Each cluster expose the device as a ResourceSlice, allowing for efficient utilization of the device. + +In this scenario, the ResourceSlice representing a same fabric device might be selected in multiple Kubernetes clusters simultaneously. If the attachment fails in one cluster, the pod will remain in a failed state in kubelet. + +By having the scheduler wait for the fabric device to be attached, we can reschedule the pod if the attachment fails. This approach is superior because it avoids unnecessary waiting and allows for immediate rescheduling. + +### Goals + + + +1. **Enhance the DRA Scheduling Process**: +Implement a feature that allows the scheduling process to wait for the completion of fabric device attachment. This ensures that pods are only scheduled once the necessary fabric devices are successfully attached, improving reliability and efficiency. + +2. **Attribute Information for Fabric Devices**: +Add attribute information that clearly distinguishes fabric devices requiring attachment. This will help in accurately identifying and managing these devices within the Kubernetes environment. + +3. **Prioritize Device Allocation**: +Implement a prioritization mechanism for device allocation, favoring devices directly connected to the node over attached fabric devices. This hierarchy ensures optimal performance and resource utilization. For example, the order of preference would be: Node-local devices > Attached fabric devices > Pre-attached fabric devices. + +### Non-Goals + + + +## Proposal + + + +The basic idea is the following: + +1. **Add a Ready Flag to ResourceClaim**: + - Add a flag to `ResourceClaim` that indicates the readiness state of the device. The `PreBind` phase will be held until this flag is set to "Ready". + +2. **Wait for Device Attachment Completion in the PreBind() Process**: + The overall flow of the PreBind() process is as follows: + + - **Update ResourceClaim**: + - The scheduler updates the `ResourceClaim` to notify the vendor's driver that the device needs to be prepared. This process is the same as the existing `PreBind`. + - After updating the `ResourceClaim`, if the flag is set to "Preparing", the completion of the `PreBind` phase will be held until the flag is set to "Ready". + + - **Monitoring and Preparation by Composable DRA Controllers**: + - Composable DRA Controllers monitor the `ResourceClaim`. If a `ResourceSlice` that requires preparation is associated with the `ResourceClaim`, they perform the necessary preparations. Once the preparation is complete, they set the flag to "Ready". + + - **Completion of the PreBind Phase**: + - Once the flag is set to "Ready", the `PreBind` phase is completed, and the scheduler proceeds to the next step. + +### User Stories (Optional) + + + +#### Story 1 + +#### Story 2 + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + + +### DRA scheduler plugin Design overview + +Add a flag to the `Device` within `ResourceSlice` to indicate whether it represents a fabric device. This flag will be used by the controller that exposes the `ResourceSlice` to notify whether the device is a fabric device. To avoid impacting existing DRA functionality, the default value of this flag is set to `false`. + +```go +// Device represents one individual hardware instance that can be selected based +// on its attributes. Besides the name, exactly one field must be set. +type Device struct { + // Name is unique identifier among all devices managed by + // the driver in the pool. It must be a DNS label. + // + // +required + Name string + + // Basic defines one device instance. + // + // +optional + // +oneOf=deviceType + Basic *BasicDevice + + // FabricDevice represents whether this device is a fabric device or not. + // If true, it indicates that the device is connected via a fabric network. + // This flag helps in distinguishing fabric devices from other types of devices. + // + // +optional + FabricDevice string +} +``` + +**Additions to the DRA Scheduler Plugin** + +In the current implementation, the `PreBind` phase waits until the `ResourceClaim` update is completed. This proposal adds functionality to block the completion of the `PreBind` phase until the device is attached if a fabric device is included in the `ResourceClaim`. + +To communicate the completion of fabric device attachment to the scheduler, a flag will be added to the `Status` of the `ResourceClaim`. + +```go +// AllocatedDeviceStatus contains the status of an allocated device, if the +// driver chooses to report it. This may include driver-specific information. +type AllocatedDeviceStatus struct { +... + // DeviceAttached represents whether the device has been successfully attached. + // + // +optional + DeviceAttached string +} +``` + +This addition ensures that the scheduler only proceeds once the necessary fabric devices are properly attached, enhancing the reliability and efficiency of the scheduling process. + + +To facilitate the discussion on the KEP, we would like to share the design of the composable controller we are considering as a component utilizing the fabric-oriented scheduler function. By sharing this, we believe we can deepen the discussion on the optimal implementation of the scheduler function. Additionally, we would like to verify whether the controller design matches the DRA design. + +### Composable Controlelr Design Overview +Our controller's philosophy is to efficiently utilize fabric devices. Therefore, we prefer to allocate devices directly connected to the node over attached fabric devices. (e.g., Node-local devices > Attached fabric devices > Pre-attached fabric devices) + +This design aims to efficiently utilize fabric devices, prioritizing node-local devices to improve performance. The composable controller manages fabric devices that can be attached and detached. Therefore, it publishes a list of fabric devices as ResourceSlices. + +The structure we are considering is as follows: + +```yaml +# composable controller publish this pool +kind: ResourceSlice +pool: composable-device +driver: gpu.nvidia.com +nodeSelector: fabric1 +devices: + - name: device1 + ... + - name: device2 + ... +``` + +The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as ResourceSlices. + +```yaml +# vendor DRA kubelet plugin publish this pool +kind: ResourceSlice +pool: Node1 +driver: gpu.nvidia.com +nodeName: Node1 +devices: + - name: device3 + ... +``` + +Here, when the scheduler selects the fabric device `device1`, it waits for the attachment of the fabric device during PreBind. The composable controller performs the attachment operation by checking the flag of the ResourceClaim. After successful attachment, the composable controller changes the flag of the ResourceClaim. + +We are considering the following two methods for handling ResourceSlices upon completion of the attachment. We would like to hear your opinions and feasibility on these two composable controller proposals. + +#### Proposal 1: The composable controller publishes ResourceSlices with NodeName set within the pool + +Multiple ResourceSlices are published with the same pool name. One indicates the devices included in the fabric, and the other indicates the devices attached to the node. + +```yaml +# composable controller publish this pool +kind: ResourceSlice +pool: composable-device +driver: gpu.nvidia.com +nodeSelector: fabric1 +devices: + - name: device2 + ... +--- +kind: ResourceSlice +pool: composable-device +driver: gpu.nvidia.com +nodeName: Node1 +devices: + - name: device1 + ... +``` + +If the vendor's plugin responds to hotplug, `device1` will appear in the ResourceSlice published by the vendor. + +```yaml +# vendor DRA kubelet plugin publish this pool +kind: ResourceSlice +pool: Node1 +driver: gpu.nvidia.com +nodeName: Node1 +devices: + - name: device3 + ... + - name: device1 + ... +``` + +This may cause device duplication issues between ResourceSlices. To prevent multiple ResourceSlices from publishing duplicate devices, we plan to define a deny list and standardize it with DRA. + +**Advantages** +- No need to change the allocationResult by the scheduler or composable controller. +- Can distinguish attached fabric devices and maintain prioritization. + +**Disadvantages** +- ResourceSlices created by the composable controller may not be understood by the vendor kubelet plugin. (NVIDIA drivers use internal information, so cooperation is needed) +- Attached and unattached fabric devices are mixed in one pool. (https://github.com/kubernetes/kubernetes/issues/124042#issuecomment-2527279157) +- A mechanism to prevent device duplication is needed (e.g., deny list). + +#### Proposal 2: Attached devices are published by the vendor's plugin + +In this case, devices are removed from the composable-device pool. + +```yaml +# composable controller publish this pool +kind: ResourceSlice +pool: composable-device +driver: gpu.nvidia.com +nodeSelector: fabric1 +devices: + - name: device2 + ... +``` + +If the vendor's plugin responds to hotplug, `device1` will appear in the ResourceSlice published by the vendor. + +```yaml +# vendor DRA kubelet plugin publish this pool +kind: ResourceSlice +pool: Node1 +driver: gpu.nvidia.com +nodeName: Node1 +devices: + - name: device3 + ... + - name: device1 + ... +``` + +This breaks the linkage between ResourceClaim and ResourceSlice. Therefore, it is necessary to modify the AllocationResult of the ResourceClaim. + +**Advantages** +- Simplifies device management. +- Centralizes management as the vendor's plugin directly publishes devices. +- No need for mechanisms to prevent device duplication (e.g., deny list). + +**Disadvantages** +- Cannot distinguish attached fabric devices, making prioritization difficult. +- Requires modification of the linkage between ResourceClaim and ResourceSlice (expected to be done by the scheduler or DRA controller. Which is more appropriate?). +- Until the linkage is fixed, the device being used may be published as a ResourceSlice and reserved by other Pods. + + + + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml b/keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml new file mode 100644 index 00000000000..de10d155921 --- /dev/null +++ b/keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml @@ -0,0 +1,46 @@ +title: DRA Device Attach Before Pod Scheduled +kep-number: 5007 +authors: + - "@KobayashiD27" +owning-sig: sig-scheduling +#participating-sigs: +# - sig-aaa +# - sig-bbb +status: provisional +#|implementable|implemented|deferred|rejected|withdrawn|replaced +creation-date: 2024-12-20 +reviewers: + - "@pohly" +approvers: + - TBD + +see-also: + - "/keps/sig-node/4381-dra-structured-parameters" + - https://github.com/kubernetes/kubernetes/issues/124042#issuecomment-2548068135 + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha +#|beta|stable + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.33" + beta: "v1.34" + stable: "v1.35" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRADeviceAttachBeforePodScheduled + components: + - kube-scheduler +disable-supported: true + +# The following PRR answers are required at beta release +metrics: +# - my_feature_metric