Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simplify token audiences config #947

Open
wants to merge 27 commits into
base: incubation
Choose a base branch
from

Conversation

cam-garrison
Copy link
Contributor

https://issues.redhat.com/browse/RHOAIENG-4855

Description

This PR brings automation to populating the audiences field for token review. Previously, if the audiences needed were different than the default kubernetes.default.svc, the user had to configure it manually thru the AuthSpec in the DSCI.

Now, if the user does not specify audiences in their DSCI, we will use an audiences value fetched while bootstrapping the operator. If the user specifies some non-default audiences, that value will be used over what is fetched while bootstrapping the operator.

This specifically should make testing/deploying on ROSA easier, as previously we needed to manually fetch the audiences and set this in the DSCI, and now if you leave it as default the operator will fetch it for you.

How Has This Been Tested?

Tested on ROSA created thru clusterbot - created a DSCI without specifying audiences under .spec.serviceMesh.auth.audiences. Then, checked created ConfigMap auth-refs to ensure that AUTH_AUDIENCE was populated with the correct audience for ROSA cluster, not the default. Then, did this basic test https://gist.github.com/israel-hdez/af374562ef9e5b9d80890aa6f0bce20d to ensure kserve functionality. (note - used odh-model-controller release v.0.12.0 in operator build to test this aspect).

Also, tested with a specified audience to ensure that if the user specifies some audiences in the DSCI authspec, they are not overwritten.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

pkg/feature/servicemesh/resources.go Outdated Show resolved Hide resolved
controllers/dscinitialization/servicemesh_setup.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bartoszmajsak bartoszmajsak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great improvement.

I shared a few things which we should think through before merging it.

And on the related note: would it be difficult to write a test for it?

controllers/dscinitialization/servicemesh_setup.go Outdated Show resolved Hide resolved
controllers/dscinitialization/servicemesh_setup.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@@ -67,6 +67,7 @@ type DSCInitializationReconciler struct { //nolint:golint,revive // Readability
Log logr.Logger
Recorder record.EventRecorder
ApplicationsNamespace string
Audiences []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if there's some other way of passing this config down the chain. Somewhat does not feel like this should belong here, it's unrelated to the reconcile at this level. Let's think about possible ways and sync tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it! here
c99d6d7

@cam-garrison
Copy link
Contributor Author

@bartoszmajsak added some basic unit tests, not mocking the full functionality here

controllers/dscinitialization/servicemesh_setup.go Outdated Show resolved Hide resolved
controllers/dscinitialization/servicemesh_setup.go Outdated Show resolved Hide resolved
controllers/dscinitialization/servicemesh_test.go Outdated Show resolved Hide resolved
controllers/dscinitialization/servicemesh_setup.go Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stick to the correct terminology. We are providing canned responses, so this is stubbing, not mocking.

A bit more about it in this great blog post https://martinfowler.com/articles/mocksArentStubs.html

pkg/cluster/audiences_test.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added the lgtm label Apr 10, 2024
Copy link

openshift-ci bot commented Apr 10, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bartoszmajsak
Once this PR has been reviewed and has the lgtm label, please assign asanzgom for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot removed the lgtm label Apr 15, 2024
Copy link

openshift-ci bot commented Apr 15, 2024

New changes are detected. LGTM label has been removed.

@zdtsw zdtsw requested a review from israel-hdez April 15, 2024 15:09
@zdtsw
Copy link
Member

zdtsw commented Apr 15, 2024

I do not have too many opinions on this PR, @israel-hdez if you have time do a review on it?

@cam-garrison
Copy link
Contributor Author

/retest-required

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some questions.

I haven't tested. I'd test after having answers. This is, anyway, going to be OK for me as long as auth is still working (and I do see that it was already tested).

},
}

if err = cli.Create(context.Background(), tokenReview, &client.CreateOptions{}); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to use the ServiceAccount mounted by the cluster, as a pivot to find the audientes.

I wonder if it isn't enough to parse the ServiceAccount token?
Just by parsing the one from CRC, I see the audiences claim in the payload:

  "aud": [
    "https://kubernetes.default.svc"
  ],

I'm not sure how it would be for ROSA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, originally we were using opaque token for the bash "workaround" and this solution is based on this, so perhaps, using SA (jwt) token aud is simpler and more efficient. We could replace this impl and use https://github.com/golang-jwt/jwt instead to fetch standard claims without verification (so passing nil as the key func ). @israel-hdez @aslakknutsen do you think it's better to rely on the fact that SA token is JWT?

I just checked in ROSA and we can also parse JWT instead. For opaque (openshift oauth) tokens it's the same audience. We can get it by using the exact call to TokenReview API.

This behavior is only invoked when you do not specify anything in DSCI.ServiceMesh.Auth part. If you later do, it will update underlying config map which is now used by odh-model-controller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@israel-hdez @aslakknutsen do you think it's better to rely on the fact that SA token is JWT?

AFAIK, core Kubernetes only supports JWT (read 1st sentence of this). So, I'd expect any SA token to be JWT. I understand we have control about the mounted/projected token so, as long as the setup does not change, we should be safe to assume it is a JWT.

About the opaque tokens, my understanding is that in core Kubernetes these are only supported when a 3rd party issues the token and the API Server would validate the token via a Webhook to the 3rd party (i.e. Kubernetes does not validate by its own). OpenShift's OAuth server may fall in this category.

return defaultAudiences
}

return tokenReview.Status.Audiences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading these docs: https://kubernetes.io/docs/reference/access-authn-authz/authentication/.

There is a note that the audiences field is optional and that, if not present, it may still mean the token is valid for the cluster. So, I'm not sure if this is safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underlying AuthConfig will be defaulted if not configured specifically.

// If omitted, Authorino will review tokens expecting the host name of the requested protected service amongst the audiences.

Current implementation defaults it to "https://kubernetes.default.svc" if not specified.

Copy link
Contributor

@israel-hdez israel-hdez Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs that I linked, there is this text over status.audiences of the TokenReview:

# If this is omitted, the token is considered to be valid to authenticate to the Kubernetes API server.

Underlying AuthConfig will be defaulted if not configured specifically.
Current implementation defaults it to "https://kubernetes.default.svc" if not specified.

IIRC, this default is on odh-model-controller side. This is OK.

My concern is that my understanding of the kubernetes docs is that the cluster may choose to omit the status.audiences field as an indication that the token is valid for the API Server. Yet, in that case, I see no statement that the audience would be https://kubernetes.default.svc in case the audiences field is omitted. Thus, we may wrongly assume it is https://kubernetes.default.svc.

The writing is somewhat confusing. So, I may be understanding things badly.

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Apr 19, 2024

Let's keep it on hold until we figure out the right path how reliably fetch audiences or fall back if they're not present.

Copy link

openshift-ci bot commented Apr 24, 2024

@cam-garrison: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-index 65887e4 link true /test ci-index
ci/prow/opendatahub-operator-pr-image-mirror 65887e4 link true /test opendatahub-operator-pr-image-mirror
ci/prow/opendatahub-operator-e2e 65887e4 link true /test opendatahub-operator-e2e
ci/prow/images 65887e4 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@israel-hdez
Copy link
Contributor

@cam-garrison @bartoszmajsak We saw an instance of this, and we'd prefer to have this PR merged soon, if possible.
cc @Jooho

@bartoszmajsak
Copy link
Contributor

bartoszmajsak commented Jun 13, 2024

@cam-garrison @bartoszmajsak We saw an instance of this, and we'd prefer to have this PR merged soon, if possible. cc @Jooho

@israel-hdez can you elaborate more on what you mean by "instance" and what is the requirement that this PR can be part of?

@israel-hdez
Copy link
Contributor

israel-hdez commented Jun 13, 2024

We saw in ROSA that the audience is not configured correctly. Thus, in the automated setup the inference requests do not go through for protected models.

We, anyway, need docs for the manual setup, but it is a weird experience that for the automated setup users need to be careful of setting the right audience.

@zdtsw
Copy link
Member

zdtsw commented Jun 14, 2024

@israel-hdez do we expect this PR for downstream 2.11?

@israel-hdez
Copy link
Contributor

@israel-hdez do we expect this PR for downstream 2.11?

AFAIK, we don't have a target. Also, I'm not sure how much we support ROSA. If we do support it, we should try to have this in 2.11 if it fits... But I see this as best effort for 2.11.

BTW, @Jooho provided some feedback in the Jira.

@bartoszmajsak
Copy link
Contributor

We saw in ROSA that the audience is not configured correctly. Thus, in the automated setup the inference requests do not go through for protected models.

We, anyway, need docs for the manual setup, but it is a weird experience that for the automated setup users need to be careful of setting the right audience.

Understood, but this is a known issue and the reason why audiences are configurable through DSCI. We will look at how to improve the experience but we need to prioritize this work first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

5 participants