Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(csi): support multiple model registries #508

Conversation

Al-Pragliola
Copy link
Contributor

@Al-Pragliola Al-Pragliola commented Oct 22, 2024

Description

As it stands, CSI is designed around the idea that there is only one model registry on the cluster and each model's metadata is registered here. In this PR I have modified the code to allow users to specify a different MR url in the InferenceService to have support for multiple model registries, again this change does not break the current behavior because the url from the env var MODEL_REGISTRY_BASE_URL is used as a fallback.

Here's an example of an InferenceService using the new option:

    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "iris-model"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "model-registry://custom-mr-svc.test-namespace.svc.cluster.local:8080/iris/v1"

I have also added more scenarios to the e2e tests to check if every supported URI combination is working correctly:

  • model-registry://{modelName}/{modelVersion}
  • model-registry://{modelName}
  • model-registry://{modelRegistryUrl}/{modelName}/{modelVersion}
  • model-registry://{modelRegistryUrl}/{modelName}

How Has This Been Tested?

cd csi
kind create cluster
docker build . -f Dockerfile -t docker.io/mr/mr-csi:0.1.0
kind load docker-image docker.io/mr/mr-csi:0.1.0
KSERVE_VERSION=0.14 MR_CSI_IMG=docker.io/mr/mr-csi:0.1.0 ./test/e2e_test.sh

Merge criteria:

  • All the commits have been signed-off (To pass the DCO check)
  • The commits have meaningful messages; the author will squash them after approval or in case of manual merges will ask to merge with squash.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work.
  • Code changes follow the kubeflow contribution guidelines.

@Al-Pragliola Al-Pragliola force-pushed the feat/multi-mr-registries-csi-support branch from 399abc3 to 1f3cb5d Compare October 24, 2024 11:55
@Al-Pragliola Al-Pragliola marked this pull request as ready for review October 24, 2024 12:46
Copy link
Contributor Author

@Al-Pragliola Al-Pragliola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Alessio Pragliola <[email protected]>
lampajr

This comment was marked as outdated.

Copy link
Member

@lampajr lampajr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR I have modified the code to allow users to specify a different MR url in the InferenceService to have support for multiple model registries, again this change does not break the current behavior because the url from the env var MODEL_REGISTRY_BASE_URL is used as a fallback.

I really like this approach and the fact that it is backward compatible by falling back to the default modelregistry.

I left just one comment that I think should be sorted out.

// Possible URIs:
// (1) model-registry://{modelName}
// (2) model-registry://{modelName}/{modelVersion}
// (3) model-registry://{modelRegistryUrl}/{modelName}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not completely sure this is actually supported, because if you have just 2 tokens (after the trim) it will return the default apiClient (i.e., using the default env model registry).

I think here the problem would be, how do you know whether the user is providing option 2 or 3?

I tried running make SOURCE_URI=model-registry://model-registry-url/model DEST_PATH=./ run and in fact model-registry-url is interpreted as being the registered model, which is actually not what the user was trying to do.

I think that we have two options here:

  1. If the model-registry-url is provided, users cannot omit the version
  2. We use a different delimiter for the model-registry-url, e.g., model-registry://{modelRegistryUrl}:{modelName}

Copy link
Contributor Author

@Al-Pragliola Al-Pragliola Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I approached this problem can be found here:

https://github.com/Al-Pragliola/model-registry/blob/feat/multi-mr-registries-csi-support/csi/pkg/modelregistry/api_client.go#L24

We try to reach (token 0) as a model registry and on failure we assume that it is a model name and not a valid mr url (falling back to the default url from env var), it's not perfect but I think it might be a fair compromise

by doing this in the function from this comment:

	// Check if the first token is the host and remove it so that we reduce cases (3) and (4) to (1) and (2)
	if len(tokens) >= 2 && p.Client.GetConfig().Host == tokens[0] {
		tokens = tokens[1:]
	}
  • case (1) stays the same
  • case (2) stays the same
  • case (3) by removing token[0] becomes (1)
  • case (4) by removing token[0] becomes (2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed that, I tried and I can confirm that if the "host" is reachable it works as expected:

$ make SOURCE_URI=model-registry://localhost:8080/mymodel DEST_PATH=./ run                                                                                             [17:59:56]
"/usr/bin/go" fmt ./...
"/usr/bin/go" vet ./...
"/usr/bin/go" run ./main.go model-registry://localhost:8080/mymodel ./
2024/10/24 18:00:02 Initializing, args: src_uri [model-registry://localhost:8080/mymodel] dest_path[ [./]
2024/10/24 18:00:02 Download model indexed in model registry: modelName=, storageUri=model-registry://localhost:8080/mymodel, modelDir=./
2024/10/24 18:00:02 Fetching model: registeredModelName=mymodel, versionName=<nil>
2024/10/24 18:00:02 404 Not Found
exit status 1
make: *** [Makefile:59: run] Error 1

With a model registry running at localhost:8080.

My main concern with this assumption is that, if there is a model registry running but for any reason it is not accessible/reachable we are going to make the wrong assumption that it is a registeredModel name , right? And the error would be misleading to the user as it will find in the logs.

What do you think?

Copy link
Contributor Author

@Al-Pragliola Al-Pragliola Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's true, the only hint that user have from the logs is this:

log.Printf("Falling back to base url %s for model registry service", cfg.Host)

We can improve this message telling the user that it failed to reach url from uri and it's going to use fallback url


I wanted this to be as retrocompatible as possible, otherwise there are many alternatives, like a query parameter approach:

model-registry://url?modelName=x&modelVersion=y
model-registry://?modelName=x&modelVersion=y

or the other delimiter you mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment below for line 129 my2c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw forgot to mention; options 3-4 also consistent with standard URLs and also KServe examples

    - regex: "https://(.+?).blob.core.windows.net/(.+)"
    - regex: "https://(.+?).file.core.windows.net/(.+)"

much appreciated @Al-Pragliola

Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

with couple of comments; lmk your view?

Comment on lines 1 to 6
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
- role: worker
- role: worker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've missed why this Kind context is needed/ introduced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I forgot to mention it in the PR description, when adding more scenarios in the e2e tests there are now 4 inference services with the associated deployments/pods etcetera and the scheduler in the CI complained with errors like Warning FailedScheduling 1s default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes available: 1 No preemption victims found for incoming pod.., so I had to add workers to the kind test cluster. Now that I think about it, I can also clean up inferenceservices after each scenario, wdyt @tarilabs ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, I can also clean up inferenceservices after each scenario, wdyt @tarilabs ?

I believe this option is preferable in general, so to have a cleanup after each run, but it's nice to know in case one day if we want a more "extended" testings! my2c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +125 to +129
// Check if the first token is the host and remove it so that we reduce cases (3) and (4) to (1) and (2)
if len(tokens) >= 2 && p.Client.GetConfig().Host == tokens[0] {
tokens = tokens[1:]
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this assumption is that, if there is a model registry running but for any reason it is not accessible/reachable we are going to make the wrong assumption that it is a registeredModel name , right? And the error would be misleading to the user as it will find in the logs.

We can improve this message telling the user that it failed to reach url from uri and it's going to use fallback url

Here (line 129) you have done all the storageURI parsing to determine in which case you are.
To me, add a Log info here that shows ~circa

Parsed storageUri=... as: modelRegistryUrl=... modelName=... modelVersion=...

This way, we're being explicit on how we interpret what we have received based on the documentation provided. wdyt?

Copy link
Member

@lampajr lampajr Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @tarilabs @Al-Pragliola , I think that providing a more meaningful and explicit log message would be enough for this use case!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

	log.Printf("Parsed storageUri=%s as: modelRegistryUrl=%s, registeredModelName=%s, versionName=%v",
		storageUri,
		p.Client.GetConfig().Host,
		registeredModelName,
		versionName,
	)

// Possible URIs:
// (1) model-registry://{modelName}
// (2) model-registry://{modelName}/{modelVersion}
// (3) model-registry://{modelRegistryUrl}/{modelName}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment below for line 129 my2c

Copy link
Member

@lampajr lampajr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @Al-Pragliola 🚀

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Oct 25, 2024
Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment then I think it's good to go; thanks a lot for keeping tabs on this

Comment on lines 5 to 6
- role: worker
- role: worker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file left if one day we want to add feature gates, etc?

Because to my understanding, the way this file is right now, is a default setting by looking up in the doc. If confirmed, I'd suggest making a comment in this file (so we don't have one day to wonder "why is this config file necessary") or just leave the 2 workers even if the test scenarios can run efficiently without requiring 2 actual nodes. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third option, I forgot to remove it after successfully testing the e2e tests only with the control plane 😞 🙏🏼, I just removed the file in the last commit, I think it just brings confusion and adds nothing, if in the future we might need to add workers we'll recreate it at that time.

@google-oss-prow google-oss-prow bot removed the lgtm label Oct 28, 2024
Copy link
Member

@tarilabs tarilabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @Al-Pragliola
and thanks to @lampajr for chiming in

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Oct 28, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lampajr, tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 2422e85 into kubeflow:main Oct 28, 2024
17 checks passed
@Al-Pragliola Al-Pragliola deleted the feat/multi-mr-registries-csi-support branch October 28, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants