-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically inject GPU resources and node selector for NIM deployment #134
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shiva Krishna, Merla <[email protected]>
Signed-off-by: Shiva Krishna, Merla <[email protected]>
Signed-off-by: Shiva Krishna, Merla <[email protected]>
cafc346
to
d94c1a3
Compare
// Reusable error handler to log and return errors | ||
handleError := func(err error, reason string, resource string) (ctrl.Result, error) { | ||
logger.Error(err, fmt.Sprintf("Failed to reconcile %s", resource), "resource", resource) | ||
return ctrl.Result{}, err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not in line with Commit message or PR description
// Setup PVC for model store | ||
modelPVC, err := r.setupPVC(ctx, nimService) | ||
if err != nil { | ||
return handleError(err, "Failed to setup PVC", "PVC") | ||
} | ||
|
||
// Setup deployment parameters | ||
deploymentParams := nimService.GetDeploymentParams() | ||
if err := r.setupDeploymentParams(ctx, nimService, modelPVC, deploymentParams); err != nil { | ||
return handleError(err, "Failed to setup deployment parameters", "DeploymentParams") | ||
} | ||
|
||
// Sync deployment | ||
err = r.renderAndSyncResource(ctx, nimService, &renderer, &appsv1.Deployment{}, func() (client.Object, error) { | ||
return renderer.Deployment(deploymentParams) | ||
}, "deployment", conditions.ReasonDeploymentFailed) | ||
if err != nil { | ||
return ctrl.Result{}, err | ||
} | ||
|
||
// Wait for deployment readiness | ||
msg, ready, err := r.isDeploymentReady(ctx, &namespacedName) | ||
if err != nil { | ||
return handleError(err, "Failed to check deployment readiness", "Deployment") | ||
} | ||
|
||
// Update status | ||
if !ready { | ||
if err := r.updater.SetConditionsNotReady(ctx, nimService, conditions.NotReady, msg); err != nil { | ||
return handleError(err, "Failed to update NotReady status", "Status") | ||
} | ||
} else { | ||
if err := r.updater.SetConditionsReady(ctx, nimService, conditions.Ready, msg); err != nil { | ||
return handleError(err, "Failed to update Ready status", "Status") | ||
} | ||
} | ||
|
||
return ctrl.Result{}, nil | ||
} | ||
|
||
func (r *NIMServiceReconciler) setupPVC(ctx context.Context, nimService *appsv1alpha1.NIMService) (*appsv1alpha1.PersistentVolumeClaim, error) { | ||
logger := log.FromContext(ctx) | ||
|
||
// Initialize variable for the PVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not in line with Commit message or PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring was necessary as we had to customize deployment params for multiple reasons now. (PVC, node selector, resources etc). Earlier all were done in reconcileNIMService call as only PVC was customized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying I am against it, just please have it as a separate commit or a separate PR all together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, let me refactor this into TP change and one for node selector itself.
@@ -0,0 +1,81 @@ | |||
/* | |||
Copyright 2024. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyright 2024. | |
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
} | ||
|
||
if !crdExists { | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return a new error saying that NFD CRD does not exist instead of nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intent of this function is to check if a specific NodeFeatureRule CR exists, so if the CRD itself is missing we can assume CR is not present as well. Hence error is not thrown in this case.
// TODO: Make the resource name configurable | ||
const gpuResourceName = corev1.ResourceName("nvidia.com/gpu") | ||
|
||
deploymentParams.Resources.Requests[corev1.ResourceName(gpuResourceName)] = gpuQuantity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the customer set a gpu resource request / limit in the nim service spec? (I believe yes) and this logic may overwrite the value, is it correct? Also, do we have unit test coverage for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if a profile is selected, then we populate the number of GPUs and schedule them to matching GPU type. For generic profile selection, they can specify resources to be consumed.
|
||
func (r *NIMServiceReconciler) getDeviceIDByProfile(ctx context.Context, profile *appsv1alpha1.NIMProfile) (string, error) { | ||
deviceID := "" | ||
if device, exists := profile.Config["gpu_device"]; exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the model point of view, the same (model, profile) should be able to run on multiple types of gpus. But here the logic seem to indicate that it is always the 1:1 mapping. Does nim enforce the latter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slu2011 models are optimized for specific GPU SKUs, they cannot be run on other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about different mig profiles from the same physical gpu? like 1g.10gb vs 2g.20gb. Does nim provide different model even for different mig profiles, or it would be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slu2011 currently NIMs need a full GPU, they don't support MIG.
No description provided.