-
Notifications
You must be signed in to change notification settings - Fork 29
[0.1.6] Deploying argoflow-aws #227
Comments
Wow, thanks a lot for this! Very helpful! |
I would also add: Delete the Might be possible to fix with:
|
Any idea why |
yes there are a few applications/scenarios that need to happen in the correct order it seems. One is definitely the |
There are a few apps that are fighting with K8s - fields going out of sync - I had this with the Knative install in our regular compute cluster too. Below is the spec:
ignoreDifferences:
- group: rbac.authorization.k8s.io
kind: ClusterRole
jsonPointers:
- /rules
- group: admissionregistration.k8s.io
kind: ValidatingWebhookConfiguration
jsonPointers:
- /webhooks/0/rules
- group: admissionregistration.k8s.io
kind: MutatingWebhookConfiguration
jsonPointers:
- /webhooks/0/rules The argoflow-aws Knative ArgoCD Application is going out of sync on the following objects:
Am I the only one having these go out of sync? This isn't the only app - have a few of them, will post the list. |
@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management. Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on. |
ArgoCD Applications that are flip-flopping - not sure what the technical term is. Basically ArgoCD installs one manifest the the cluster seems to override some values, causing an update tug-of-war kind of thing. I will post details of which resources are causing this:
|
Does argoflow/argoflow-aws use vanilla Knative? If I understand what you're saying, we would have to maintain a Helm repo with the Knative manifests, which sounds like one more thing to maintain. Is there a way we can point it at the Knative Operator and then just install a CRD? I might be way off base since I've only been working with Argoflow/Kubeflow for a couple of weeks 😂 |
What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing. |
I'm at an early-stage startup so my availability is super patchy - I wouldn't want to start something and leave it hanging halfway. I will poke around at the KFServing/Knative parts and see what's going on - no promises I can take this on but I will always do what I can! |
Update - also running into this issue: kserve/kserve#848 |
Update - I think I've whittled it down to stuff that I think can be addressed by |
Update - ignoreDifferences is done, I'm currently validating and will submit PRs. Sorry for the long lead time! |
We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.
I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.
Current issues (can be triaged and split into separate issues or merged into existing issues)
❌ OPEN ISSUES
These are mainly based off of broken functionality or application statuses in ArgoCD
knative
mpi-operator
(https://github.com/kubeflow/mpi-operator)mpi-operator
does:Crashes
Logs
aws-eks-resources
ignoreDifferences
)✅ SOLVED ISSUES
[✅ SOLVED]
oauth2-proxy
kubeflow_oidc_cookie_secret
output variable)[✅ SOLVED]
pipelines
setup.conf
must NOT be quoted[✅ SOLVED]
aws-load-balancer-controller
[✅ SOLVED] Central Dashboard
aws-load-balancer-controller
has an issue (see below)kubectl port-forward
but rather needs to be accessed through the proper URL of<<__subdomain_dashboard__>.<<__domain__>>
[✅ SOLVED]
kube-prometheus-stack
kube-prometheus-stack-grafana
ConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as wellmaster
(b90cb8a)The text was updated successfully, but these errors were encountered: