Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

[0.1.6] Deploying argoflow-aws #227

Open
jai opened this issue Sep 14, 2021 · 13 comments
Open

[0.1.6] Deploying argoflow-aws #227

jai opened this issue Sep 14, 2021 · 13 comments

Comments

@jai
Copy link
Contributor

jai commented Sep 14, 2021

We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.

I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

These are mainly based off of broken functionality or application statuses in ArgoCD

knative

mpi-operator (https://github.com/kubeflow/mpi-operator)

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.

  • Crashes

  • Logs

    flag provided but not defined: -kubectl-delivery-image
    Usage of /opt/mpi-operator:
      -add_dir_header
        	If true, adds the file directory to the header
    ...

aws-eks-resources

  • Impact: Low
  • ArgoCD resources out of sync (probably needs ignoreDifferences)
  • Auto Sync currently turned off to debug

✅ SOLVED ISSUES

[✅ SOLVED] oauth2-proxy

  • Impact: Unknown
  • Problem - CreateContainerConfigError: secret "oauth2-proxy" not found
    • Solution - Secrets need to be manually updated in AWS Secrets Manager for oauth2-proxy:
      • client-id and client-secret (GCP link)
      • cookie-secret (generated by Terraform - see the kubeflow_oidc_cookie_secret output variable)
  • Problem - cannot contact the Redis cluster

[✅ SOLVED] pipelines

  • Impact: High
  • Crash Loop
  • Logs:
F0914 02:03:01.977497       7 main.go:240] Fatal error config file: While parsing config: invalid character 't' after object key:value pair
  • Solution - values in setup.conf must NOT be quoted

[✅ SOLVED] aws-load-balancer-controller

  • Impact: High
  • Blocks accessing UI/dashboard
  • Load Balancer isn't being created, logs:
2021/09/14 09:46:15 http: TLS handshake error from 172.31.39.152:54030: remote error: tls: bad certificate

{"level":"error","ts":1631613104.4709718,"logger":"controller","msg":"Reconciler error","controller":"service","name":"istio-ingressgateway","namespace":"istio-system","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}

[✅ SOLVED] Central Dashboard

> [email protected] start /app
> npm run serve


> [email protected] serve /app
> node dist/server.js

Initializing Kubernetes configuration
Unable to fetch Application information: 404 page not found

"aws" is not a supported platform for Metrics
Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam
Server listening on port http://localhost:8082 (in production mode)
Unable to fetch Application information: 404 page not found
2021-09-14T02:39:12.655692792Z
  • Update - seems we shouldn't port-forward into the dashboard. However aws-load-balancer-controller has an issue (see below)
  • Solution: the dashboard cannot be accessed using kubectl port-forward but rather needs to be accessed through the proper URL of <<__subdomain_dashboard__>.<<__domain__>>

[✅ SOLVED] kube-prometheus-stack

  • Impact: Low
  • kube-prometheus-stack-grafana ConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as well
  • Was an issue on v0.1.6, resolved by deploying master (b90cb8a)
@jai jai changed the title [0.1.6] deploying argoflow-aws [0.1.6] Deploying argoflow-aws Sep 14, 2021
@EKami
Copy link

EKami commented Sep 19, 2021

Wow, thanks a lot for this! Very helpful!

@EKami
Copy link

EKami commented Sep 19, 2021

I would also add:
Ext-dns record not created in route53:

Delete the istio app in the argocd dashboard, it will recreate the resource and update the DNS entries.
The deployment of ext-dns must happen in this order:
istio-operator-> external-dns -> istio-resources -> istio
to properly update the DNS entries.

Might be possible to fix with:

  annotations:
    argocd.argoproj.io/sync-wave: "2"

@EKami
Copy link

EKami commented Sep 19, 2021

Any idea why knative is not synchronizing properly?

@jai
Copy link
Contributor Author

jai commented Sep 21, 2021

I would also add:
Ext-dns record not created in route53:

Delete the istio app in the argocd dashboard, it will recreate the resource and update the DNS entries.
The deployment of ext-dns must happen in this order:
istio-operator-> external-dns -> istio-resources -> istio
to properly update the DNS entries.

Might be possible to fix with:

  annotations:
    argocd.argoproj.io/sync-wave: "2"

yes there are a few applications/scenarios that need to happen in the correct order it seems. One is definitely the external-dns - I am trying to be diligent and write down the others as I see them but it's super hard to make sure of one order of events given the sheer number of ArgoCD Applications!

@jai
Copy link
Contributor Author

jai commented Sep 21, 2021

Any idea why knative is not synchronizing properly?

There are a few apps that are fighting with K8s - fields going out of sync - I had this with the Knative install in our regular compute cluster too. Below is the ignoreDifferences we use for our Knative install:

spec:
  ignoreDifferences:
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
    jsonPointers:
    - /rules
  - group: admissionregistration.k8s.io
    kind: ValidatingWebhookConfiguration
    jsonPointers:
    - /webhooks/0/rules
  - group: admissionregistration.k8s.io
    kind: MutatingWebhookConfiguration
    jsonPointers:
    - /webhooks/0/rules

The argoflow-aws Knative ArgoCD Application is going out of sync on the following objects:

  • MutatingWebhookConfiguration
    • webhook.domainmapping.serving.knative.dev - webhooks.0.rules.0.resources.1 (domainmappings/status)

Google Chrome knative  Applications - Argo CD2021-09-21 at 17 25 09

  • webhook.serving.knative.dev

Google Chrome knative  Applications - Argo CD2021-09-21 at 17 26 44

  • ValidatingWebhookConfiguration
    • validation.webhook.domainmapping.serving.knative.dev

Google Chrome knative  Applications - Argo CD2021-09-21 at 17 27 18

  • validation.webhook.serving.knative.dev

Google Chrome knative  Applications - Argo CD2021-09-21 at 17 27 52

  • ClusterRole
    • knative-serving-admin

Google Chrome knative  Applications - Argo CD2021-09-21 at 17 28 46

  • knative-serving-aggregated-addressable-resolver

Google Chrome knative  Applications - Argo CD2021-09-21 at 17 29 12


Am I the only one having these go out of sync? This isn't the only app - have a few of them, will post the list.

@davidspek
Copy link
Member

@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.

Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.

@jai
Copy link
Contributor Author

jai commented Sep 21, 2021

ArgoCD Applications that are flip-flopping - not sure what the technical term is. Basically ArgoCD installs one manifest the the cluster seems to override some values, causing an update tug-of-war kind of thing. I will post details of which resources are causing this:

  • aws-eks-resources
  • istio-resources
  • kfserving
  • knative
  • notebook-controller
  • pipelines
  • pod-defaults
  • roles

@jai
Copy link
Contributor Author

jai commented Sep 21, 2021

@jai Thanks for the very detailed issue thread you've started here. Sadly I haven't had much time to dedicate to the ArgoFlow repositories since starting my new job. However, there are a lot of very big Kubeflow improvements I'm working on. Basically it's a completely redesigned architecture that simplifies Kubeflow and adds better security and more advanced features around User/Group/Project management.

Regarding the KNative manifests, they are quite a pain, especially with Kustomize. I've got a Helm chart that should be usable instead, that would should get rid of this continuous syncing issue. Would you like to help move the KNative deployment over to Helm? If so, I can clean up the chart a little bit and add it to a registry for you to depend on.

Does argoflow/argoflow-aws use vanilla Knative? If I understand what you're saying, we would have to maintain a Helm repo with the Knative manifests, which sounds like one more thing to maintain. Is there a way we can point it at the Knative Operator and then just install a CRD? I might be way off base since I've only been working with Argoflow/Kubeflow for a couple of weeks 😂

@davidspek
Copy link
Member

What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.

@jai
Copy link
Contributor Author

jai commented Sep 21, 2021

What you're saying is completely correct. The Knative Operator is probably a good fit to reduce the maintenance overhead. However, I haven't yet had time to look into it. The Istio <-> Knative <-> KFServing interplay is very fragile and took a couple weeks to get working properly (which also hasn't been upstreamed yet), so implementing the Knative Operator would need some special attention and testing.

I'm at an early-stage startup so my availability is super patchy - I wouldn't want to start something and leave it hanging halfway. I will poke around at the KFServing/Knative parts and see what's going on - no promises I can take this on but I will always do what I can!

@jai
Copy link
Contributor Author

jai commented Sep 28, 2021

Update - also running into this issue: kserve/kserve#848

@jai
Copy link
Contributor Author

jai commented Oct 25, 2021

Update - I think I've whittled it down to stuff that I think can be addressed by ignoreDifferences in the ArgoCD Application CRD. I'll open a draft PR to see if that's the best way to address these issues or if there's a better way to fix them upstream/in other areas.

@jai
Copy link
Contributor Author

jai commented Jan 19, 2022

Update - ignoreDifferences is done, I'm currently validating and will submit PRs. Sorry for the long lead time!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants