Skip to content

Commit

Permalink
reboot with a simpler setup and using lessons from pyPI work
Browse files Browse the repository at this point in the history
  • Loading branch information
brabster committed Jan 4, 2024
1 parent e6d85e0 commit 89c7ab2
Show file tree
Hide file tree
Showing 25 changed files with 264 additions and 138 deletions.
2 changes: 0 additions & 2 deletions .envs/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1 @@
This directory contains environment-specific configurations for use in pipeline deployment.

Example to follow...
2 changes: 1 addition & 1 deletion .envs/prod.env → .envs/prod/.env
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
export DBT_DATASET=pypi
export DBT_DATASET=bbc_news_example
export DBT_LOCATION=US
export DBT_PROJECT=pypi-408816
30 changes: 30 additions & 0 deletions .envs/prod/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
DBT does not directly manage datasets/schemas and their permissions.

If you want to manage your dataset ACL as part of the build,
you can provide a JSON document describing the permissions you want as dataset_acl.json
and uncomment the commented-out `bq update` command in the workflow file dataset job.

See https://cloud.google.com/bigquery/docs/control-access-to-resources-iam#grant_access_to_a_dataset

```json
{
"access": [

{
"role": "READER",
"specialGroup": "projectReaders"
},
{
"role": "WRITER",
"specialGroup": "projectWriters"
},
{
"role": "OWNER",
"specialGroup": "projectOwners"
}
]
}
```

Terraform is the other obvious option to manage datasets, but this adds complexity and a new toolset/supply chain

3 changes: 0 additions & 3 deletions .envs/test.env

This file was deleted.

99 changes: 99 additions & 0 deletions .github/GCP_WIF_SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
Based on https://cloud.google.com/blog/products/identity-security/enabling-keyless-authentication-from-github-actions

Setting up a Workload Identity Federation for GitHub action.
Assumes $DBT_PROJECT is set to the project you want the pool/provider in.

# Setup WIF in-project

Unsure whether setting up a WIF pool/provider for each project is the best way, but it seems like the least risky.

## Gather some info

```console
export WIF_PROJECT_NUMBER=$(gcloud projects describe "${DBT_PROJECT}" --format="value(projectNumber)")
export WIF_POOL=dbt-pool
export WIF_PROVIDER=dbt-provider
export WIF_GITHUB_REPO=$(git remote get-url origin|cut -d: -f2|cut -d. -f1)
export WIF_SERVICE_ACCOUNT=pypi-vulnerabilities
```
## Ensure IAM APIs enabled

```console
gcloud services enable iamcredentials.googleapis.com --project "${DBT_PROJECT}"
```

## Setup Service Account

```console
gcloud iam service-accounts create "${WIF_SERVICE_ACCOUNT}" \
--project="${DBT_PROJECT}" \
--description="DBT service account" \
--display-name="${WIF_SERVICE_ACCOUNT}"
```

## Setup Workload Identity Provider

```console
gcloud iam workload-identity-pools create "${WIF_POOL}" \
--project="${DBT_PROJECT}" \
--location="global" \
--display-name="DBT Pool"
```

```console
gcloud iam workload-identity-pools providers create-oidc "${WIF_PROVIDER}" \
--project="${DBT_PROJECT}" \
--location="global" \
--workload-identity-pool="${WIF_POOL}" \
--display-name="DBT provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com"
```

## Collect up IDs of the Workload Identity Pool and Provider

```console
export WIF_POOL_PROVIDER_ID=$(gcloud iam workload-identity-pools providers describe "${WIF_PROVIDER}" --location=global --project "${DBT_PROJECT}" --workload-identity-pool "${WIF_POOL}" --format="value(name)")
export WIF_POOL_ID=$(gcloud iam workload-identity-pools describe "${WIF_POOL}" --location=global --project "${DBT_PROJECT}" --format="value(name)")
```

## Setup IAM to allow GitHub to assume role

```console
gcloud iam service-accounts add-iam-policy-binding "${WIF_SERVICE_ACCOUNT}@${DBT_PROJECT}.iam.gserviceaccount.com" \
--project="${DBT_PROJECT}" \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/${WIF_POOL_ID}/attribute.repository/${WIF_GITHUB_REPO}"
```

```console
gcloud iam service-accounts add-iam-policy-binding "${WIF_SERVICE_ACCOUNT}@${DBT_PROJECT}.iam.gserviceaccount.com" \
--project="${DBT_PROJECT}" \
--role="roles/iam.serviceAccountTokenCreator" \
--member="serviceAccount:${WIF_SERVICE_ACCOUNT}@${DBT_PROJECT}.iam.gserviceaccount.com"
```

## Grant Service Account BigQuery admin in the project

(You may need to make this policy more specific!)

```console
gcloud projects add-iam-policy-binding "${DBT_PROJECT}" \
--role="roles/bigquery.admin" \
--member="serviceAccount:${WIF_SERVICE_ACCOUNT}@${DBT_PROJECT}.iam.gserviceaccount.com"
```

## Recover Secrets for GitHub

Populate secrets for this build as described below

```console
echo "GitHub Secret: GCP_WORKLOAD_IDENTITY_PROVIDER"
gcloud iam workload-identity-pools providers describe "${WIF_PROVIDER}" --location=global --project "${DBT_PROJECT}" --workload-identity-pool "${WIF_POOL}" --format="value(name)"
```

```console
echo "GitHub Secret: GCP_SERVICE_ACCOUNT"
echo "${WIF_SERVICE_ACCOUNT}@${DBT_PROJECT}.iam.gserviceaccount.com"
```

5 changes: 2 additions & 3 deletions .github/actions/dbt_build/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,12 @@ runs:
shell: bash
run: |
source .venv/bin/activate
source .envs/${{ inputs.env }}.env
source .envs/${{ inputs.env }}/.env
rm -rf logs
dbt clean
dbt deps
dbt debug
dbt run
echo "dbt test goes here"
dbt build
dbt docs generate
- name: upload target artifacts
uses: actions/upload-artifact@v3
Expand Down
7 changes: 6 additions & 1 deletion .github/actions/setup_dbt/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ runs:
python --version
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -U setuptools pip safety
pip install -U -r requirements.txt
- name: safety-check
shell: bash
run: |
source .venv/bin/activate
safety check
15 changes: 9 additions & 6 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@ jobs:
pages: write
steps:
- uses: actions/checkout@v4
with:
base-ref: ref
- uses: ./.github/actions/setup_dbt
- uses: google-github-actions/auth@v2
with:
Expand All @@ -23,10 +21,15 @@ jobs:
- uses: google-github-actions/setup-gcloud@v2
with:
version: '>= 363.0.0'
- uses: ./.github/actions/dbt_build
with:
env: test
- uses: ./.github/actions/dbt_build
- name: ensure prod dataset exists
run: |
source .venv/bin/activate
source .envs/prod/.env
dbt deps
dbt run-operation ensure_target_dataset_exists
# bq update --source .envs/prod/dataset_acl.json "${DBT_PROJECT}:${DBT_DATASET}"
- name: prod dbt build
uses: ./.github/actions/dbt_build
with:
env: prod
- uses: actions/upload-pages-artifact@v3
Expand Down
16 changes: 10 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
# Python virtualenv files
.venv/
/.venv/

# User's environment settings
.env
/.env

# DBT logs
logs/
/logs/

# DBT target dir
target/
/target/

# DBT packages
dbt_packages/
package-lock.yml
/dbt_packages/
/package-lock.yml

# files that we don't want committed
/uncommitted/*
!/uncommitted/README.md
14 changes: 2 additions & 12 deletions .vscode/tasks.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,15 @@
// for the documentation about the tasks.json format
"version": "2.0.0",
"tasks": [
{
"label": "init_venv",
"type": "shell",
"command": "python",
"args": ["-m", "venv", ".venv"]
},
{
"label": "ensure_pip_version",
"type": "shell",
"command": "pip",
"args": ["install", "--upgrade", "pip"],
"dependsOn": ["init_venv"]
"command": "pip install --upgrade pip"
},
{
"label": "ensure_python_deps_updated",
"type": "shell",
"command": "pip",
"args": ["install", "-U", "-r", "${workspaceFolder}/requirements.txt"],
"dependsOn": ["init_venv"]
"command": "pip install -U -r ${workspaceFolder}/requirements.txt"
},
{
"label": "load_user_env",
Expand Down
18 changes: 9 additions & 9 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,21 +29,21 @@ clean-targets: # directories to be removed by `dbt clean`
# columns: true

models:
+grant_access_to:
- project: '{{ target.project }}'
dataset: '{{ target.schema }}'
+labels:
# add labels to database objects
stability: stable
data_classification: public
+persist_docs:
# push any model/column descriptions to the target database
relation: true
columns: true

tests:
product:
daily:
# tests in the daily folder should have this clause appended to their where clause
# used here to limit the data scanned by the query
# note the predicate pushdown to the underlying timestamp column in the source data
+where: download_date BETWEEN DATE_SUB(CURRENT_DATE, INTERVAL 4 DAY) AND DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)
# example used here to limit the data scanned by the query
# note the predicate pushdown to the underlying timestamp column in the source data if possible
# +where: download_date BETWEEN DATE_SUB(CURRENT_DATE, INTERVAL 4 DAY) AND DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)

on-run-start:
- "{{ ensure_target_dataset_exists() }}" # may or may not be appropriate for your environmental constraints
- "{{ ensure_target_dataset_exists() }}" # may or may not be appropriate for your environmental constraints
- "{{ ensure_udfs() }}"
12 changes: 10 additions & 2 deletions macros/ensure_target_dataset_exists.sql
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,19 @@
{% set dataset_name = target.schema %}
{% set dataset_location = target.location %}

{% do log("Ensuring dataset " ~ project_id ~ "." ~ dataset_name ~ " exists in location " ~ dataset_location ) %}
{{ print("Ensuring dataset " ~ project_id ~ "." ~ dataset_name ~ " exists in location " ~ dataset_location ) }}

{% set create_dataset_query %}
CREATE SCHEMA IF NOT EXISTS `{{ project_id }}`.`{{ dataset_name }}`
OPTIONS (
location = '{{ dataset_location }}'
description = 'Exploring vulnerable PyPI downloads. Managed by https://github.com/brabster/pypi_vulnerabilities',
location = '{{ dataset_location }}',
labels = [('data_classification', 'public')]
)
{% endset %}

{% if execute %}
{% set results = run_query(create_dataset_query) %}
{% endif %}

{% endmacro %}
5 changes: 5 additions & 0 deletions macros/ensure_target_dataset_exists.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: 2

macros:
- name: ensure_target_dataset_exists
description: Creates the specified dataset if it does not exist and the executor has permission
10 changes: 10 additions & 0 deletions macros/ensure_udfs.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{% macro ensure_udfs() %}
-- See https://www.equalexperts.com/blog/our-thinking/testing-and-deploying-udfs-with-dbt
CREATE OR REPLACE FUNCTION {{ target.schema }}.shout(say STRING)
RETURNS STRING
OPTIONS (description='Shouts the say string. NULL when argument is NULL')
AS (
UPPER(say) || '!'
);

{% endmacro %}
5 changes: 5 additions & 0 deletions macros/ensure_udfs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: 2

macros:
- name: ensure_udfs
description: Creates UDFs specified in the macro. Does not clean up any UDFs that are removed.
5 changes: 5 additions & 0 deletions models/categories.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
SELECT
category,
COUNT(1) article_count
FROM {{ source('bbc_news', 'fulltext') }}
GROUP BY category
20 changes: 20 additions & 0 deletions models/categories.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: 2

models:
- name: categories
description: News article categories and counts
columns:
- name: category
description: Category name
tests:
- dbt_utils.at_least_one
- unique
- not_null
- name: article_count
description: Number of articles in category
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0


14 changes: 0 additions & 14 deletions models/daily/package_downloads.sql

This file was deleted.

Loading

0 comments on commit 89c7ab2

Please sign in to comment.