Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup workload identity via terraform #1130

Merged
merged 86 commits into from
Apr 1, 2022

Conversation

yuvipanda
Copy link
Member

@yuvipanda yuvipanda commented Mar 18, 2022

The GKE config connector was helpful in letting us deploy
Google Cloud Service Accounts with permissions for cloud storage
directly just from helm. However, it has been difficult to debug,
and in #669
we decided to move away from it and towards creating these
cloud resources via Terraform.

This commit adds:

  • Terraform code that will create a Google Service Account,
    bind it to a given Kubernetes Service Account, for a list of
    hub namespaces passed in. This means that some hub initial deployments
    now can not be done just with CD, but need manual work with
    terraform. I think this would be any hub that wants to use
    requestor pays or scratch buckets. This would need to be
    documented.
  • Move meom-ige to use this new scheme. metadata concealment
    (https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment)
    which is what we were using earlier as alternative to config-connector
    • workload identity, is no longer supported by the terraform
      google provider. In b7b42ce,
      we changed the default from 'SECURE' to 'UNSPECIFIED', but
      it looks like 'UNSPECIFIED' really means 'use workload identity'
      haha. When Switch to using pd-balanced for all user & dask nodes #1124 was
      deployed to meom-ige yesterday, it seems to have enabled workload
      identity, causing cloud access to stop working, leading to
      https://2i2c.freshdesk.com/a/tickets/107. Further investigation on
      what happened here is needed, but I've currently fixed it by
      just deploying this change for meom-ige.
  • Documentation is provided for each 'feature', and how 2i2c engineers can
    enable these on a per-hub basis. They are designed to be 'feature flags'
    in terraform wherever possible. This still involves copying some terraform
    outputs to helm values files unfortunately, as we use helm to create namespaces
    (and terraform doesn't have a way to say 'create this namespace if required',
    unlike eksctl)
  • Add a 'features/' subdirectory to documentation, that lists specifically just
    hub features, with a corresponding entry in the topic page. Will slowly move
    other documentation about features to this structure.
  • Rename current cloud-access.md document to clarify it's about auth
    for 2i2c engineers, not about accessing cloud features from the hub
  • Remove scratch bucket from the 3 hubs that have it on the 2i2c cluster -
    dask-staging, ohw and catalyst-cooperative. None of these actually have
    used their bucket. requestor_pays is enabled for all of them.

TODO:

  • Separate requestor pays & scratch bucket access into feature flags
  • Document how to create new hubs that have either of these features
  • Figure out how to deal with creating namespace for the secret to live in.
    Perhaps we can move the Kubernetes secret creation away from terraform
    and into helm to simplify this?
  • Apply this to existing pangeo-hubs
  • Apply this to hubs that use this right now in 2i2c cluster

Ref #669
Ref #1046

@yuvipanda yuvipanda marked this pull request as draft March 18, 2022 10:00
@yuvipanda
Copy link
Member Author

This was caught by @sgibson91 in #1074, but then https://2i2c.freshdesk.com/a/tickets/107 happened today as well so I had to fix that. Automating this seemed the best way to go forward.

@yuvipanda
Copy link
Member Author

Looking at the cloud buckets in https://console.cloud.google.com/storage/browser?forceOnBucketsSortingFiltering=false&project=two-eye-two-see&prefix=&forceOnObjectsSortingFiltering=false, none of them actually have any content in them! So nobody is actually using scratch buckes in the default pilot-hubs cluster.

@sgibson91
Copy link
Member

Thank you for your work following this up @yuvipanda! 🙌

The GKE config connector was helpful in letting us deploy
Google Cloud Service Accounts with permissions for cloud storage
directly just from helm. However, it has been difficult to debug,
and in 2i2c-org#669
we decided to move away from it and towards creating these
cloud resources via Terraform.

This commit adds:
- Terraform code that will create a Google Service Account,
  bind it to a given Kubernetes Service Account, for a list of
  hub namespaces passed in. This means that some hub initial deployments
  now *can not be done just with CD*, but need manual work with
  terraform. I think this would be any hub that wants to use
  requestor pays or scratch buckets. This would need to be
  documented.
- Move meom-ige to use this new scheme. metadata concealment
  (https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment)
  which is what we were using earlier as alternative to config-connector
  + workload identity, is no longer supported by the terraform
  google provider. In b7b42ce,
  we changed the default from 'SECURE' to 'UNSPECIFIED', but
  it looks like 'UNSPECIFIED' really means 'use workload identity'
  haha. When 2i2c-org#1124 was
  deployed to meom-ige yesterday, it seems to have enabled workload
  identity, causing cloud access to stop working, leading to
  https://2i2c.freshdesk.com/a/tickets/107. Further investigation on
  what happened here is needed, but I've currently fixed it by
  just deploying this change for meom-ige.
- All hubs are given access to all buckets we create. This is
  inadequete, and needs to be more fine grained.

Ref 2i2c-org#669
Ref 2i2c-org#1046
Much cleaner than trying to just 'do it all' based on a
list of hubs.
Service Accounts have name length limitations that are
busted by us using the full hub name there (looking at you,
catalyst-cooperative hub). Instead, we explicitly specify the
hub namespace.
@yuvipanda yuvipanda force-pushed the no-config-connector branch 2 times, most recently from c03ab51 to 69ec676 Compare March 29, 2022 03:31
- Write docs on how to setup workload-identity
- Create features/ subsection of the docs
- Remove older instructions
@yuvipanda yuvipanda requested a review from a team March 29, 2022 03:41
@yuvipanda yuvipanda marked this pull request as ready for review March 29, 2022 03:41
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really love the level of documentation in this PR and I'm very grateful for understanding what's happening here with little knowledge of the terraform syntax. 🌼 Thank you, Yuvi!

config/clusters/meom-ige/common.values.yaml Outdated Show resolved Hide resolved
terraform/gcp/buckets.tf Show resolved Hide resolved
Copy link
Member

@sgibson91 sgibson91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments/typo fixes, but this looks good to me! 🚀

docs/howto/features/cloud-access.md Outdated Show resolved Hide resolved
docs/howto/features/cloud-access.md Outdated Show resolved Hide resolved
docs/howto/features/cloud-access.md Outdated Show resolved Hide resolved
@@ -1,3 +1,7 @@
userServiceAccount:
enabled: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to true? I'm not sure if it's better to explicitly enable features or rely on default config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about that - but the short version is, we always have a service account attached to pods, regardless of wether they have any permissions assigned or not. I think if we don't, the default service account gets attached.

yuvipanda and others added 3 commits March 30, 2022 09:28
Note that we're *removing* the scratch bucket feature from
these hubs - those buckets are not being used at all. Simplify!
Co-authored-by: Sarah Gibson <[email protected]>
sgibson91 and others added 25 commits April 1, 2022 13:44
'Missing' means a prod hub job was generated but no support or staging
Its not something that is common in the codebase and we don't check them,
so we shouldn't use them for now. This decision can be reverted later if
required.
Co-authored-by: choldgraf <null>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
These are no longer being set by the automatic config-connector,
so we need to set them. But I think pangeo-hubs wasn't using
this anyway so maybe not a biggie.
@yuvipanda
Copy link
Member Author

pangeo-hubs I've preserved current access, as deploying this there would require #1153 to be done too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants