Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to enable Requester Pays access to Pangeo data for GCP hubs #662

Merged
merged 14 commits into from
Sep 11, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/howto/configure/data-access.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Data Access

Here we will document various ways to grant hubs access to external data.

## Data Access via Requester Pays

For some hubs, such as our Pangeo deployments, the communities they serve require access to data stored in other projects.
Accessing data normally comes with a charge that the folks _hosting_ the data have to take care of.
However, there is a method by which those making the request are responsible for the charges instead: [Requester Pays](https://cloud.google.com/storage/docs/requester-pays).
This section demonstrates the steps required to setup this method.

### Setting up Requester Pays Access on GCP

```{note}
We may automate these steps in the future.
```

Make sure you are logged into the `gcloud` CLI and have set the default project to be the one you wish to work with.

```{note}
These steps should be run every time a new hub is added to a cluster, to avoid sharing of credentials.
```

1. Create a new Service Account

```bash
gcloud iam service-accounts create {{ NAMESPACE }}-user-sa \
--description="Service Account to allow access to external data stored elsewhere in the cloud" \
--display-name="Requester Pays Service Account"
```

where:

- `{{ NAMESPACE }}-user-sa` will be the name of the Service Account, and;
- `{{ NAMESPACE }}` is the name of the deployment, e.g. `staging`.

```{note}
We create a separate service account for this so as to avoid granting excessive permissions to any single service account.
We may change this policy in the future.
```

2. Grant the Service Account roles on the project

We will need to grant the [Service Usage Consumer](https://cloud.google.com/iam/docs/understanding-roles#service-usage-roles) and [Storage Object Viewer](https://cloud.google.com/iam/docs/understanding-roles#cloud-storage-roles) roles on the project to the new service account.

```bash
gcloud projects add-iam-policy-binding \
--role roles/serviceusage.serviceUsageConsumer \
--member "serviceAccount:{{ NAMESPACE }}-user-sa@{{ PROJECT_ID }}.iam.gserviceaccount.com" \
{{ PROJECT_ID }}

gcloud projects add-iam-policy-binding \
--role roles/storage.objectViewer \
--member "serviceAccount:{{ NAMESPACE }}-user-sa@{{ PROJECT_ID }}.iam.gserviceaccount.com" \
{{ PROJECT_ID }}
```

where:

- `{{ PROJECT_ID }}` is the ID of the Google Cloud project, **not** the display name!
- `{{ NAMESPACE }}` is the deployment namespace

````{note}
If you're not sure what `{{ PROJECT_ID }}` should be, you can run:

```bash
gcloud config get-value project
```
````

3. Grant the Service Account the `workloadIdentityUser` role on the cluster

We will now grant the [Workload Identity User](https://cloud.google.com/iam/docs/understanding-roles#service-accounts-roles) role to the cluster to act on behalf of the users.

```bash
gcloud iam service-accounts add-iam-policy-binding \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:{{ PROJECT_ID }}.svc.id.goog[{{ NAMESPACE }}/{{ SERVICE_ACCOUNT }}]" \
{{ NAMESPACE }}-user-sa@{{ PROJECT_ID }}.iam.gserviceaccount.com
```

Where:

- `{{ PROJECT_ID }}` is the project ID of the Google Cloud Project.
Note: this is the **ID**, not the display name!
- `{{ NAMESPACE }}` is the Kubernetes namespace/deployment to grant access to
- `{{ SERVICE_ACCOUNT }}` is the _Kubernetes_ service account to grant access to.
Usually, this is `user-sa`.
Run `kubectl --namespace {{ NAMESPACE }} get serviceaccount` if you're not sure.

4. Link the Google Service Account to the Kubernetes Service Account

We now link the two service accounts together so Kubernetes can use the Google API.

```bash
kubectl annotate serviceaccount \
--namespace {{ NAMESPACE }} \
{{ SERVICE_ACCOUNT }} \
iam.gke.io/gcp-service-account={{ NAMESPACE }}-user-sa@{{ PROJECT_ID }}.iam.gserviceaccount.com
```

Where:

- `{{ NAMESPACE }}` is the target Kubernetes namespace
- `{{ SERVICE_ACCOUNT }}` is the target Kubernetes service account name.
Usually, this is `user-sa`.
Run `kubectl --namespace {{ NAMESPACE }} get serviceaccount` if you're not sure.
- `{{ PROJECT_ID }}` is the project ID of the Google Cloud Project.
Note: this is the **ID**, not the display name!

5. RESTART THE HUB

This is a very important step.
If you don't do this you won't see the changes applied.

You can restart the hub by heading to `https://{{ hub_url }}/hub/admin` (you need to be logged in as admin), clicking the "Shutdown Hub" button, and waiting for it to come back up.

You can now test the requester pays access by starting a server on the hub and running the below code in a script or Notebook.

```python
from intake import open_catalog

cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean/altimetry.yaml")
ds = cat['j3'].to_dask()
```
1 change: 1 addition & 0 deletions docs/howto/configure/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@
auth-management.md
update-env.md
culling.md
data-access.md
```