feat(csi): support multiple model registries #508

Al-Pragliola · 2024-10-22T17:08:02Z

Description

As it stands, CSI is designed around the idea that there is only one model registry on the cluster and each model's metadata is registered here. In this PR I have modified the code to allow users to specify a different MR url in the InferenceService to have support for multiple model registries, again this change does not break the current behavior because the url from the env var MODEL_REGISTRY_BASE_URL is used as a fallback.

Here's an example of an InferenceService using the new option:

    apiVersion: "serving.kserve.io/v1beta1"
    kind: "InferenceService"
    metadata:
      name: "iris-model"
    spec:
      predictor:
        model:
          modelFormat:
            name: sklearn
          storageUri: "model-registry://custom-mr-svc.test-namespace.svc.cluster.local:8080/iris/v1"

I have also added more scenarios to the e2e tests to check if every supported URI combination is working correctly:

model-registry://{modelName}/{modelVersion}
model-registry://{modelName}
model-registry://{modelRegistryUrl}/{modelName}/{modelVersion}
model-registry://{modelRegistryUrl}/{modelName}

How Has This Been Tested?

cd csi
kind create cluster
docker build . -f Dockerfile -t docker.io/mr/mr-csi:0.1.0
kind load docker-image docker.io/mr/mr-csi:0.1.0
KSERVE_VERSION=0.14 MR_CSI_IMG=docker.io/mr/mr-csi:0.1.0 ./test/e2e_test.sh

Merge criteria:

All the commits have been signed-off (To pass the DCO check)

The commits have meaningful messages; the author will squash them after approval or in case of manual merges will ask to merge with squash.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work.
Code changes follow the kubeflow contribution guidelines.

Signed-off-by: Alessio Pragliola <[email protected]>

Al-Pragliola

cc @tarilabs @lampajr

Signed-off-by: Alessio Pragliola <[email protected]>

lampajr

In this PR I have modified the code to allow users to specify a different MR url in the InferenceService to have support for multiple model registries, again this change does not break the current behavior because the url from the env var MODEL_REGISTRY_BASE_URL is used as a fallback.

I really like this approach and the fact that it is backward compatible by falling back to the default modelregistry.

I left just one comment that I think should be sorted out.

lampajr · 2024-10-24T13:47:44Z

csi/pkg/storage/modelregistry_provider.go

+// Possible URIs:
+// (1) model-registry://{modelName}
+// (2) model-registry://{modelName}/{modelVersion}
+// (3) model-registry://{modelRegistryUrl}/{modelName}


I am not completely sure this is actually supported, because if you have just 2 tokens (after the trim) it will return the default apiClient (i.e., using the default env model registry).

I think here the problem would be, how do you know whether the user is providing option 2 or 3?

I tried running make SOURCE_URI=model-registry://model-registry-url/model DEST_PATH=./ run and in fact model-registry-url is interpreted as being the registered model, which is actually not what the user was trying to do.

I think that we have two options here:

If the model-registry-url is provided, users cannot omit the version

We use a different delimiter for the model-registry-url, e.g., model-registry://{modelRegistryUrl}:{modelName}

The way I approached this problem can be found here:

https://github.com/Al-Pragliola/model-registry/blob/feat/multi-mr-registries-csi-support/csi/pkg/modelregistry/api_client.go#L24

We try to reach (token 0) as a model registry and on failure we assume that it is a model name and not a valid mr url (falling back to the default url from env var), it's not perfect but I think it might be a fair compromise

by doing this in the function from this comment:

// Check if the first token is the host and remove it so that we reduce cases (3) and (4) to (1) and (2) if len(tokens) >= 2 && p.Client.GetConfig().Host == tokens[0] { tokens = tokens[1:] }

case (1) stays the same

case (2) stays the same

case (3) by removing token[0] becomes (1)

case (4) by removing token[0] becomes (2)

Sorry missed that, I tried and I can confirm that if the "host" is reachable it works as expected:

$ make SOURCE_URI=model-registry://localhost:8080/mymodel DEST_PATH=./ run [17:59:56] "/usr/bin/go" fmt ./... "/usr/bin/go" vet ./... "/usr/bin/go" run ./main.go model-registry://localhost:8080/mymodel ./ 2024/10/24 18:00:02 Initializing, args: src_uri [model-registry://localhost:8080/mymodel] dest_path[ [./] 2024/10/24 18:00:02 Download model indexed in model registry: modelName=, storageUri=model-registry://localhost:8080/mymodel, modelDir=./ 2024/10/24 18:00:02 Fetching model: registeredModelName=mymodel, versionName=<nil> 2024/10/24 18:00:02 404 Not Found exit status 1 make: *** [Makefile:59: run] Error 1

With a model registry running at localhost:8080.

My main concern with this assumption is that, if there is a model registry running but for any reason it is not accessible/reachable we are going to make the wrong assumption that it is a registeredModel name , right? And the error would be misleading to the user as it will find in the logs.

What do you think?

Yep that's true, the only hint that user have from the logs is this:

log.Printf("Falling back to base url %s for model registry service", cfg.Host)

We can improve this message telling the user that it failed to reach url from uri and it's going to use fallback url

I wanted this to be as retrocompatible as possible, otherwise there are many alternatives, like a query parameter approach:

model-registry://url?modelName=x&modelVersion=y
model-registry://?modelName=x&modelVersion=y

or the other delimiter you mentioned.

Added comment below for line 129 my2c

btw forgot to mention; options 3-4 also consistent with standard URLs and also KServe examples

- regex: "https://(.+?).blob.core.windows.net/(.+)" - regex: "https://(.+?).file.core.windows.net/(.+)"

much appreciated @Al-Pragliola

tarilabs

/lgtm

with couple of comments; lmk your view?

tarilabs · 2024-10-25T07:27:36Z

csi/test/kind_config.yaml

+apiVersion: kind.x-k8s.io/v1alpha4
+kind: Cluster
+nodes:
+  - role: control-plane
+  - role: worker
+  - role: worker


I've missed why this Kind context is needed/ introduced?

Sorry I forgot to mention it in the PR description, when adding more scenarios in the e2e tests there are now 4 inference services with the associated deployments/pods etcetera and the scheduler in the CI complained with errors like Warning FailedScheduling 1s default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes available: 1 No preemption victims found for incoming pod.., so I had to add workers to the kind test cluster. Now that I think about it, I can also clean up inferenceservices after each scenario, wdyt @tarilabs ?

Now that I think about it, I can also clean up inferenceservices after each scenario, wdyt @tarilabs ?

I believe this option is preferable in general, so to have a cleanup after each run, but it's nice to know in case one day if we want a more "extended" testings! my2c

tarilabs · 2024-10-25T07:35:51Z

csi/pkg/storage/modelregistry_provider.go

+	// Check if the first token is the host and remove it so that we reduce cases (3) and (4) to (1) and (2)
+	if len(tokens) >= 2 && p.Client.GetConfig().Host == tokens[0] {
+		tokens = tokens[1:]
+	}
+


My main concern with this assumption is that, if there is a model registry running but for any reason it is not accessible/reachable we are going to make the wrong assumption that it is a registeredModel name , right? And the error would be misleading to the user as it will find in the logs.

We can improve this message telling the user that it failed to reach url from uri and it's going to use fallback url

Here (line 129) you have done all the storageURI parsing to determine in which case you are.
To me, add a Log info here that shows ~circa

Parsed storageUri=... as: modelRegistryUrl=... modelName=... modelVersion=...

This way, we're being explicit on how we interpret what we have received based on the documentation provided. wdyt?

I agree with you @tarilabs @Al-Pragliola , I think that providing a more meaningful and explicit log message would be enough for this use case!

added

log.Printf("Parsed storageUri=%s as: modelRegistryUrl=%s, registeredModelName=%s, versionName=%v", storageUri, p.Client.GetConfig().Host, registeredModelName, versionName, )

tarilabs · 2024-10-25T07:37:29Z

csi/pkg/storage/modelregistry_provider.go

+// Possible URIs:
+// (1) model-registry://{modelName}
+// (2) model-registry://{modelName}/{modelVersion}
+// (3) model-registry://{modelRegistryUrl}/{modelName}


Added comment below for line 129 my2c

…arsing Signed-off-by: Alessio Pragliola <[email protected]>

lampajr

Thanks a lot @Al-Pragliola 🚀

/lgtm

tarilabs

small comment then I think it's good to go; thanks a lot for keeping tabs on this

tarilabs · 2024-10-28T09:29:23Z

csi/test/kind_config.yaml

-  - role: worker
-  - role: worker


Is this file left if one day we want to add feature gates, etc?

Because to my understanding, the way this file is right now, is a default setting by looking up in the doc. If confirmed, I'd suggest making a comment in this file (so we don't have one day to wonder "why is this config file necessary") or just leave the 2 workers even if the test scenarios can run efficiently without requiring 2 actual nodes. wdyt?

Third option, I forgot to remove it after successfully testing the e2e tests only with the control plane 😞 🙏🏼, I just removed the file in the last commit, I think it just brings confusion and adds nothing, if in the future we might need to add workers we'll recreate it at that time.

Signed-off-by: Alessio Pragliola <[email protected]>

tarilabs

thanks a lot @Al-Pragliola
and thanks to @lampajr for chiming in

/lgtm
/approve

google-oss-prow · 2024-10-28T10:41:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lampajr, tarilabs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tarilabs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 22, 2024

google-oss-prow bot requested review from andreyvelich, ckadner and Tomcli October 22, 2024 17:08

github-actions bot added the Area/CSI label Oct 22, 2024

google-oss-prow bot added the size/L label Oct 22, 2024

Al-Pragliola force-pushed the feat/multi-mr-registries-csi-support branch 2 times, most recently from 57cd0ea to 648f4a0 Compare October 23, 2024 22:05

github-actions bot added the Area/GitHub label Oct 23, 2024

Al-Pragliola force-pushed the feat/multi-mr-registries-csi-support branch from 55ae816 to 5ad3b2b Compare October 23, 2024 22:33

google-oss-prow bot added size/XL and removed size/L labels Oct 24, 2024

feat(csi): support multiple model registries

1f3cb5d

Signed-off-by: Alessio Pragliola <[email protected]>

Al-Pragliola force-pushed the feat/multi-mr-registries-csi-support branch from 399abc3 to 1f3cb5d Compare October 24, 2024 11:55

chore(csi): add info about the uri in readme

1278ab9

Signed-off-by: Alessio Pragliola <[email protected]>

Al-Pragliola marked this pull request as ready for review October 24, 2024 12:46

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 24, 2024

Al-Pragliola commented Oct 24, 2024

View reviewed changes

fix(csi): remove unnecessary comment

1a6a412

Signed-off-by: Alessio Pragliola <[email protected]>

This comment was marked as outdated.

Sign in to view

lampajr suggested changes Oct 24, 2024

View reviewed changes

tarilabs reviewed Oct 25, 2024

View reviewed changes

google-oss-prow bot assigned tarilabs Oct 25, 2024

google-oss-prow bot added the lgtm label Oct 25, 2024

chore(csi): cleanup after each test + improve log message after uri p…

4d9780f

…arsing Signed-off-by: Alessio Pragliola <[email protected]>

google-oss-prow bot removed the lgtm label Oct 25, 2024

Al-Pragliola requested review from lampajr and tarilabs October 25, 2024 13:30

lampajr approved these changes Oct 25, 2024

View reviewed changes

google-oss-prow bot assigned lampajr Oct 25, 2024

google-oss-prow bot added the lgtm label Oct 25, 2024

tarilabs reviewed Oct 28, 2024

View reviewed changes

chore(csi): remove unnecessary kind external config in CI

d057a31

Signed-off-by: Alessio Pragliola <[email protected]>

google-oss-prow bot removed the lgtm label Oct 28, 2024

tarilabs approved these changes Oct 28, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Oct 28, 2024

google-oss-prow bot added the approved label Oct 28, 2024

google-oss-prow bot merged commit 2422e85 into kubeflow:main Oct 28, 2024
17 checks passed

tarilabs mentioned this pull request Oct 28, 2024

periodic sync upstream KF to midstream ODH opendatahub-io/model-registry#141

Merged

Al-Pragliola deleted the feat/multi-mr-registries-csi-support branch October 28, 2024 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(csi): support multiple model registries #508

feat(csi): support multiple model registries #508

Al-Pragliola commented Oct 22, 2024 •

edited

Loading

Al-Pragliola left a comment

This comment was marked as outdated.

lampajr left a comment

lampajr Oct 24, 2024

Al-Pragliola Oct 24, 2024 •

edited

Loading

lampajr Oct 24, 2024

Al-Pragliola Oct 24, 2024 •

edited

Loading

tarilabs Oct 25, 2024

tarilabs Oct 25, 2024

tarilabs left a comment

tarilabs Oct 25, 2024

Al-Pragliola Oct 25, 2024

tarilabs Oct 25, 2024

Al-Pragliola Oct 25, 2024

tarilabs Oct 25, 2024

lampajr Oct 25, 2024 •

edited

Loading

Al-Pragliola Oct 25, 2024

tarilabs Oct 25, 2024

lampajr left a comment

tarilabs left a comment

tarilabs Oct 28, 2024

Al-Pragliola Oct 28, 2024

tarilabs left a comment

google-oss-prow bot commented Oct 28, 2024

feat(csi): support multiple model registries #508

feat(csi): support multiple model registries #508

Conversation

Al-Pragliola commented Oct 22, 2024 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

Al-Pragliola left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

lampajr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Al-Pragliola Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Al-Pragliola Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarilabs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lampajr Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lampajr left a comment

Choose a reason for hiding this comment

tarilabs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarilabs left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Oct 28, 2024

Al-Pragliola commented Oct 22, 2024 •

edited

Loading

Al-Pragliola Oct 24, 2024 •

edited

Loading

Al-Pragliola Oct 24, 2024 •

edited

Loading

lampajr Oct 25, 2024 •

edited

Loading