Automatically inject GPU resources and node selector for NIM deployment #134

shivamerla · 2024-09-10T03:43:03Z

No description provided.

Signed-off-by: Shiva Krishna, Merla <[email protected]>

ArangoGutierrez · 2024-09-10T08:59:39Z

internal/controller/platform/standalone/nimservice.go

+	// Reusable error handler to log and return errors
+	handleError := func(err error, reason string, resource string) (ctrl.Result, error) {
+		logger.Error(err, fmt.Sprintf("Failed to reconcile %s", resource), "resource", resource)
+		return ctrl.Result{}, err
+	}
+


This change is not in line with Commit message or PR description

ArangoGutierrez · 2024-09-10T08:59:51Z

internal/controller/platform/standalone/nimservice.go

+	// Setup PVC for model store
+	modelPVC, err := r.setupPVC(ctx, nimService)
+	if err != nil {
+		return handleError(err, "Failed to setup PVC", "PVC")
+	}
+
+	// Setup deployment parameters
 	deploymentParams := nimService.GetDeploymentParams()
+	if err := r.setupDeploymentParams(ctx, nimService, modelPVC, deploymentParams); err != nil {
+		return handleError(err, "Failed to setup deployment parameters", "DeploymentParams")
+	}
+
+	// Sync deployment
+	err = r.renderAndSyncResource(ctx, nimService, &renderer, &appsv1.Deployment{}, func() (client.Object, error) {
+		return renderer.Deployment(deploymentParams)
+	}, "deployment", conditions.ReasonDeploymentFailed)
+	if err != nil {
+		return ctrl.Result{}, err
+	}
+
+	// Wait for deployment readiness
+	msg, ready, err := r.isDeploymentReady(ctx, &namespacedName)
+	if err != nil {
+		return handleError(err, "Failed to check deployment readiness", "Deployment")
+	}
+
+	// Update status
+	if !ready {
+		if err := r.updater.SetConditionsNotReady(ctx, nimService, conditions.NotReady, msg); err != nil {
+			return handleError(err, "Failed to update NotReady status", "Status")
+		}
+	} else {
+		if err := r.updater.SetConditionsReady(ctx, nimService, conditions.Ready, msg); err != nil {
+			return handleError(err, "Failed to update Ready status", "Status")
+		}
+	}
+
+	return ctrl.Result{}, nil
+}
+
+func (r *NIMServiceReconciler) setupPVC(ctx context.Context, nimService *appsv1alpha1.NIMService) (*appsv1alpha1.PersistentVolumeClaim, error) {
+	logger := log.FromContext(ctx)
+
+	// Initialize variable for the PVC


This change is not in line with Commit message or PR description

The refactoring was necessary as we had to customize deployment params for multiple reasons now. (PVC, node selector, resources etc). Earlier all were done in reconcileNIMService call as only PVC was customized.

I am not saying I am against it, just please have it as a separate commit or a separate PR all together

sounds good, let me refactor this into TP change and one for node selector itself.

ArangoGutierrez · 2024-09-10T09:00:32Z

internal/nfdutil/nfdutil.go

@@ -0,0 +1,81 @@
+/*
+Copyright 2024.


Suggested change

Copyright 2024.

* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

slu2011 · 2024-09-10T23:41:18Z

internal/nfdutil/nfdutil.go

+	}
+
+	if !crdExists {
+		return false, nil


return a new error saying that NFD CRD does not exist instead of nil?

the intent of this function is to check if a specific NodeFeatureRule CR exists, so if the CRD itself is missing we can assume CR is not present as well. Hence error is not thrown in this case.

slu2011 · 2024-09-10T23:49:58Z

internal/controller/platform/standalone/nimservice.go

+		// TODO: Make the resource name configurable
+		const gpuResourceName = corev1.ResourceName("nvidia.com/gpu")
+
+		deploymentParams.Resources.Requests[corev1.ResourceName(gpuResourceName)] = gpuQuantity


Can the customer set a gpu resource request / limit in the nim service spec? (I believe yes) and this logic may overwrite the value, is it correct? Also, do we have unit test coverage for this.

Yes, if a profile is selected, then we populate the number of GPUs and schedule them to matching GPU type. For generic profile selection, they can specify resources to be consumed.

slu2011 · 2024-09-11T00:03:07Z

internal/controller/platform/standalone/nimservice.go

+
+func (r *NIMServiceReconciler) getDeviceIDByProfile(ctx context.Context, profile *appsv1alpha1.NIMProfile) (string, error) {
+	deviceID := ""
+	if device, exists := profile.Config["gpu_device"]; exists {


from the model point of view, the same (model, profile) should be able to run on multiple types of gpus. But here the logic seem to indicate that it is always the 1:1 mapping. Does nim enforce the latter?

@slu2011 models are optimized for specific GPU SKUs, they cannot be run on other types.

How about different mig profiles from the same physical gpu? like 1g.10gb vs 2g.20gb. Does nim provide different model even for different mig profiles, or it would be the same?

@slu2011 currently NIMs need a full GPU, they don't support MIG.

shivamerla requested review from ArangoGutierrez, slu2011 and visheshtanksale as code owners September 10, 2024 03:43

shivamerla marked this pull request as draft September 10, 2024 03:43

shivamerla added 3 commits September 9, 2024 20:44

Automatically inject GPU resources and node selector for NIM deployment

b8c27f4

Signed-off-by: Shiva Krishna, Merla <[email protected]>

Add vendor dependencies

24e2117

Signed-off-by: Shiva Krishna, Merla <[email protected]>

Update comments and TODOs

d94c1a3

Signed-off-by: Shiva Krishna, Merla <[email protected]>

shivamerla force-pushed the detect_gpus branch from cafc346 to d94c1a3 Compare September 10, 2024 03:45

ArangoGutierrez requested changes Sep 10, 2024

View reviewed changes

slu2011 reviewed Sep 10, 2024

View reviewed changes

slu2011 reviewed Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically inject GPU resources and node selector for NIM deployment #134

Automatically inject GPU resources and node selector for NIM deployment #134

shivamerla commented Sep 10, 2024

ArangoGutierrez Sep 10, 2024

ArangoGutierrez Sep 10, 2024

shivamerla Sep 10, 2024 •

edited

Loading

ArangoGutierrez Sep 10, 2024

shivamerla Sep 10, 2024

ArangoGutierrez Sep 10, 2024

slu2011 Sep 10, 2024

shivamerla Sep 11, 2024

slu2011 Sep 10, 2024

shivamerla Sep 11, 2024

slu2011 Sep 11, 2024

shivamerla Sep 11, 2024

slu2011 Sep 11, 2024

shivamerla Sep 16, 2024

	Copyright 2024.
	* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

Automatically inject GPU resources and node selector for NIM deployment #134

Are you sure you want to change the base?

Automatically inject GPU resources and node selector for NIM deployment #134

Conversation

shivamerla commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivamerla Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivamerla Sep 10, 2024 •

edited

Loading