Skip to content

Commit

Permalink
Merge pull request #389 from cagov/info_arch
Browse files Browse the repository at this point in the history
Reorganized documentation
  • Loading branch information
britt-allen authored Oct 1, 2024
2 parents a4b293c + 65131cc commit 1899aa7
Show file tree
Hide file tree
Showing 23 changed files with 527 additions and 542 deletions.
File renamed without changes.
6 changes: 3 additions & 3 deletions docs/codespaces.md → docs/code/codespaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@ Go to the "Code" dropdown from the main repository page,
select the three dot dropdown, and select "New with options..."
This will allow more configuration than the default codespace.

![Create a new codespace](images/create-new-codespace.png)
![Create a new codespace](../images/create-new-codespace.png)

In the codespace configuration form, you will have an option to add "Recommended Secrets".
This is where you can add your personal Snowflake credentials to your codespace,
allowing for development against our Snowflake warehouse, including using dbt.
You should only add credentials for accounts that are protected by multi-factor authentication (MFA).

![Add codespace secrets](images/codespace-secrets.png)
![Add codespace secrets](../images/codespace-secrets.png)

After you have added your secrets, click "Create Codespace".
Building it may take a few minutes,
Expand All @@ -30,7 +30,7 @@ Once your codespace is created, you should be able to launch it
without re-creating it every time using the "Code" dropdown,
going to "Open in...", and selecting "Open in browser":

![Launch codespace](images/launch-codespace.png)
![Launch codespace](../images/launch-codespace.png)

## Using a Codespace

Expand Down
2 changes: 1 addition & 1 deletion docs/local-setup.md → docs/code/local-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ dse_snowflake:
!!! note
The target name (`dev`) in the above example can be anything.
However, we treat targets named `prd` differently in generating
custom dbt schema names (see [here](./dbt.md#custom-schema-names)).
custom dbt schema names (see [here](../dbt/dbt.md#custom-schema-names)).
We recommend naming your local development target `dev`, and only
include a `prd` target in your profiles under rare circumstances.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,52 +1,6 @@
# Cloud Infrastructure

The DSE team uses Terraform to manage cloud infrastructure.
Our stack includes:

* An [AWS Batch](https://aws.amazon.com/batch/) environment for running arbitrary containerized jobs
* A [Managed Workflows on Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) environment for orchestrating jobs
* A VPC and subnets for the above
* An ECR repository for hosting Docker images storing code and libraries for jobs
* A bot user for running AWS operations in GitHub Actions
* An S3 scratch bucket

## Architecture

```mermaid
flowchart TD
subgraph AWS
J[GitHub CD\nbot user]
G[Artifact in S3]
subgraph VPC
subgraph Managed Airflow
K1[Scheduler]
K2[Worker]
K3[Webserver]
end
F[AWS Batch Job\n on Fargate]
end
E[AWS ECR Docker\nRepository]
end
subgraph GitHub
A[Code Repository]
end
E --> F
A -- Code quality check\n GitHub action --> A
A -- Job submission\nvia GitHub Action --> F
A -- Docker build \nGitHub Action --> E
A --> H[CalData\nadministrative\nuser]
H -- Terraform -----> AWS
K2 -- Job submission\nvia Airflow --> F
K1 <--> K2
K3 <--> K1
K3 <--> K2
F --> G
J -- Bot Credentials --> A
```

## Setup
# Terraform Setup

### Installation
## Installation

This project requires Terraform to run.
You might use a different package manager to install it depending on your system.
Expand Down Expand Up @@ -83,7 +37,7 @@ You can manually run the pre-commit checks using:
pre-commit run --all-files
```

### Bootstrapping remote state
## Bootstrapping remote state

When deploying a new version of your infrastrucutre, Terraform diffs the current state
against what you have specified in your infrastructure-as-code.
Expand Down
File renamed without changes.
100 changes: 0 additions & 100 deletions docs/dbt.md

This file was deleted.

4 changes: 2 additions & 2 deletions docs/dbt-performance.md → docs/dbt/dbt-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@ This is extremely useful during development in order to understand potential pro
### Getting model timing: dbt Cloud
dbt Cloud has a nicer interface for finding which models in a project are running longest. Visit the Deploy > Runs section of dbt Cloud. You'll see a full list of jobs and how long each one toook. To drill down to the model timing level click on a run name. You can expand the "Invoke dbt build" section under "Run Summary" to get a detailed summary of your run as well as timing for each model and test. There is also a "Debug logs" section for even more detail, including the exact queries run and an option to download the logs for easier viewing. Of course this is also where you go to find model and test errors and warnings!

![dbt Model Run Summary](images/dbt_run_summary.png)
![dbt Model Run Summary](../images/dbt_run_summary.png)

For a quick visual reference of which models take up the most time in a run, click on the "Model Timing" tab. If you hover over a model you will be shown the specific timing.

![dbt Model Timing Graph](images/dbt_model_timing.png)
![dbt Model Timing Graph](../images/dbt_model_timing.png)

### Getting model timing: Snowflake
Snowflake has quite a lot of performance data readily available through it's `information_schema.QUERY_HISTORY()` table function and several views in the Account Usage schema. This is great not only for finding expensive queries regardless of source and of course for all sorts of analytics on Snowflake usage, such as credits.
Expand Down
52 changes: 52 additions & 0 deletions docs/dbt/dbt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# dbt on the Data Services and Engineering team

## Architecture

We broadly follow the architecture described in
[this dbt blog post](https://www.getdbt.com/blog/how-we-configure-snowflake/)
for our Snowflake dbt project.

It is described in more detail in our [Snowflake docs](../infra/snowflake.md#architecture).

## Naming conventions

Models in a data warehouse do not follow the same naming conventions as [raw cloud resources](../learning/naming-conventions.md#general-approach),
as their most frequent consumers are analytics engineers and data analysts.

The following conventions are used where appropriate:

**Dimension tables** are prefixed with `dim_`.

**Fact tables** are prefixed with `fct_`.

**Staging tables** are prefixed with `stg_`.

**Intermediate tables** are prefixed with `int_`.

We may adopt additional conventions for denoting aggregations, column data types, etc. in the future.
If during the course of a project's model development we determine that simpler human-readable names
work better for our partners or downstream consumers, we may drop the above prefixing conventions.

## Custom schema names

dbt's default method for generating [custom schema names](https://docs.getdbt.com/docs/build/custom-schemas)
works well for a single-database setup:

* It allows development work to occur in a separate schema from production models.
* It allows analytics engineers to develop side-by-side without stepping on each others toes.

A downside of the default is that production models all get a prefix,
which may not be an ideal naming convention for end-users.

Because our architecture separates development and production databases,
and has strict permissions protecting the `RAW` database,
there is less danger of breaking production models.
So we use our own custom schema name following a modified
[approach from the GitLab Data Team](https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/macros/utils/override/generate_schema_name.sql).

In production, each schema is just the custom schema name without any prefix.
In non-production environments the default is used, where analytics engineers
get the custom schema name prefixed with their target schema name (i.e. `dbt_username_schemaname`),
and CI runs get the custom schema name prefixed with a CI job name.

This approach may be reevaluated as the project matures.
26 changes: 5 additions & 21 deletions docs/architecture.md → docs/infra/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We follow an adapted version of the project architecture described in
for our Snowflake dbt project.

It is described in some detail in our
[Snowflake docs](https://cagov.github.io/data-infrastructure/snowflake/#architecture) as well.
[Snowflake docs](snowflake.md/#architecture) as well.

```mermaid
flowchart TB
Expand Down Expand Up @@ -142,49 +142,33 @@ So we use our own custom schema name modified from the
In production, each schema is just the custom schema name without any prefix.
In non-production environments, dbt developers use their own custom schema based on their name: `dbt_username`.

## Developing against production data

Our Snowflake architecture allows for reasonably safe `SELECT`ing
from the production `RAW_PRD` database while developing models.
While this could be expensive for large tables,
it also allows for faster and more reliable model development.

To develop against raw production data, first you need someone with the `USERADMIN` role
to grant rights to the `TRANSFORMER_DEV` role
(this need only be done once, and can be revoked later):

```sql
USE ROLE USERADMIN;
GRANT ROLE RAW_PRD_READ TO ROLE TRANSFORMER_DEV;
```

## Examples

### User personae

To make the preceding more concrete, let's consider the six databases,
`RAW`, `TRANSFORM`, and `ANALYTICS`, for both `DEV` and `PRD`:

![databases](images/databases.png)
![databases](../images/databases.png)

If you are a developer, you are doing most of your work in `TRANSFORM_DEV`
and `ANALYTICS_DEV`, assuming the role `TRANSFORMER_DEV`.
*However*, you also have the ability to select the production data from `RAW_PRD` for your development.
So your data access looks like the following:

![developer](images/developer.png)
![developer](../images/developer.png)

Now let's consider the nigthly production build. This service account builds the production models
in `TRANSFORM_PRD` and `ANALYTICS_PRD` based on the raw data in `RAW_PRD`.
The development environment effectively doesn't exist to this account, and data access looks like the following:

![nightly](images/nightly.png)
![nightly](../images/nightly.png)

Finally, let's consider an external consumer of a mart from PowerBI.
This user has no access to any of the raw or intermediate models (which might contain sensitive data!).
To them, the whole rest of the architecture doesn't exist, and they can only see the marts in `ANALYTICS_PRD`:

![consumer](images/consumers.png)
![consumer](../images/consumers.png)

### Scenario: adding a new data source

Expand Down
45 changes: 45 additions & 0 deletions docs/infra/cloud-infrastructure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Cloud Infrastructure

The DSE team [uses Terraform](../code/terraform-local-setup.md) to manage cloud infrastructure.
Our stack includes:

* An [AWS Batch](https://aws.amazon.com/batch/) environment for running arbitrary containerized jobs
* A [Managed Workflows on Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) environment for orchestrating jobs
* A VPC and subnets for the above
* An ECR repository for hosting Docker images storing code and libraries for jobs
* A bot user for running AWS operations in GitHub Actions
* An S3 scratch bucket

## Architecture

```mermaid
flowchart TD
subgraph AWS
J[GitHub CD\nbot user]
G[Artifact in S3]
subgraph VPC
subgraph Managed Airflow
K1[Scheduler]
K2[Worker]
K3[Webserver]
end
F[AWS Batch Job\n on Fargate]
end
E[AWS ECR Docker\nRepository]
end
subgraph GitHub
A[Code Repository]
end
E --> F
A -- Code quality check\n GitHub action --> A
A -- Job submission\nvia GitHub Action --> F
A -- Docker build \nGitHub Action --> E
A --> H[CalData\nadministrative\nuser]
H -- Terraform -----> AWS
K2 -- Job submission\nvia Airflow --> F
K1 <--> K2
K3 <--> K1
K3 <--> K2
F --> G
J -- Bot Credentials --> A
```
Loading

0 comments on commit 1899aa7

Please sign in to comment.