reorganized documentation

cagov · Oct 1, 2024 · dbc22f5 · dbc22f5
1 parent add3e10
commit dbc22f5
Show file tree

Hide file tree

Showing 23 changed files with 532 additions and 547 deletions.
diff --git a/docs/code-review.md → docs/code/code-review.md b/docs/code-review.md → docs/code/code-review.md
diff --git a/docs/codespaces.md → docs/code/codespaces.md b/docs/codespaces.md → docs/code/codespaces.md
@@ -11,14 +11,14 @@ Go to the "Code" dropdown from the main repository page,
 select the three dot dropdown, and select "New with options..."
 This will allow more configuration than the default codespace.
 
-![Create a new codespace](images/create-new-codespace.png)
+![Create a new codespace](../images/create-new-codespace.png)
 
 In the codespace configuration form, you will have an option to add "Recommended Secrets".
 This is where you can add your personal Snowflake credentials to your codespace,
 allowing for development against our Snowflake warehouse, including using dbt.
 You should only add credentials for accounts that are protected by multi-factor authentication (MFA).
 
-![Add codespace secrets](images/codespace-secrets.png)
+![Add codespace secrets](../images/codespace-secrets.png)
 
 After you have added your secrets, click "Create Codespace".
 Building it may take a few minutes,
@@ -30,7 +30,7 @@ Once your codespace is created, you should be able to launch it
 without re-creating it every time using the "Code" dropdown,
 going to "Open in...", and selecting "Open in browser":
 
-![Launch codespace](images/launch-codespace.png)
+![Launch codespace](../images/launch-codespace.png)
 
 ## Using a Codespace
 

diff --git a/docs/local-setup.md → docs/code/local-setup.md b/docs/local-setup.md → docs/code/local-setup.md
@@ -179,7 +179,7 @@ dse_snowflake:
 !!! note
     The target name (`dev`) in the above example can be anything.
     However, we treat targets named `prd` differently in generating
-    custom dbt schema names (see [here](./dbt.md#custom-schema-names)).
+    custom dbt schema names (see [here](../dbt/dbt.md#custom-schema-names)).
     We recommend naming your local development target `dev`, and only
     include a `prd` target in your profiles under rare circumstances.
 

diff --git a/docs/cloud-infrastructure.md → docs/code/terraform-local-setup.md b/docs/cloud-infrastructure.md → docs/code/terraform-local-setup.md
@@ -1,52 +1,6 @@
-# Cloud Infrastructure
-
-The DSE team uses Terraform to manage cloud infrastructure.
-Our stack includes:
-
-* An [AWS Batch](https://aws.amazon.com/batch/) environment for running arbitrary containerized jobs
-* A [Managed Workflows on Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) environment for orchestrating jobs
-* A VPC and subnets for the above
-* An ECR repository for hosting Docker images storing code and libraries for jobs
-* A bot user for running AWS operations in GitHub Actions
-* An S3 scratch bucket
-
-## Architecture
-
-```mermaid
-flowchart TD
-  subgraph AWS
-    J[GitHub CD\nbot user]
-    G[Artifact in S3]
-    subgraph VPC
-      subgraph Managed Airflow
-        K1[Scheduler]
-        K2[Worker]
-        K3[Webserver]
-      end
-      F[AWS Batch Job\n on Fargate]
-    end
-    E[AWS ECR Docker\nRepository]
-  end
-  subgraph GitHub
-    A[Code Repository]
-  end
-  E --> F
-  A -- Code quality check\n GitHub action --> A
-  A -- Job submission\nvia GitHub Action --> F
-  A -- Docker build \nGitHub Action --> E
-  A --> H[CalData\nadministrative\nuser]
-  H -- Terraform -----> AWS
-  K2 -- Job submission\nvia Airflow --> F
-  K1 <--> K2
-  K3 <--> K1
-  K3 <--> K2
-  F --> G
-  J -- Bot Credentials --> A
-```
-
-## Setup
+# Terraform Setup
 
-### Installation
+## Installation
 
 This project requires Terraform to run.
 You might use a different package manager to install it depending on your system.
@@ -83,7 +37,7 @@ You can manually run the pre-commit checks using:
 pre-commit run --all-files
 ```
 
-### Bootstrapping remote state
+## Bootstrapping remote state
 
 When deploying a new version of your infrastrucutre, Terraform diffs the current state
 against what you have specified in your infrastructure-as-code.

diff --git a/docs/writing-documentation.md → docs/code/writing-documentation.md b/docs/writing-documentation.md → docs/code/writing-documentation.md
diff --git a/docs/dbt.md b/docs/dbt.md
diff --git a/docs/dbt-performance.md → docs/dbt/dbt-performance.md b/docs/dbt-performance.md → docs/dbt/dbt-performance.md
@@ -32,11 +32,11 @@ This is extremely useful during development in order to understand potential pro
 ### Getting model timing: dbt Cloud
 dbt Cloud has a nicer interface for finding which models in a project are running longest. Visit the Deploy > Runs section of dbt Cloud. You'll see a full list of jobs and how long each one toook. To drill down to the model timing level click on a run name. You can expand the "Invoke dbt build" section under "Run Summary" to get a detailed summary of your run as well as timing for each model and test. There is also a "Debug logs" section for even more detail, including the exact queries run and an option to download the logs for easier viewing. Of course this is also where you go to find model and test errors and warnings!
 
-![dbt Model Run Summary](images/dbt_run_summary.png)
+![dbt Model Run Summary](../images/dbt_run_summary.png)
 
 For a quick visual reference of which models take up the most time in a run, click on the "Model Timing" tab. If you hover over a model you will be shown the specific timing.
 
-![dbt Model Timing Graph](images/dbt_model_timing.png)
+![dbt Model Timing Graph](../images/dbt_model_timing.png)
 
 ### Getting model timing: Snowflake
 Snowflake has quite a lot of performance data readily available through it's `information_schema.QUERY_HISTORY()` table function and several views in the Account Usage schema. This is great not only for finding expensive queries regardless of source and of course for all sorts of analytics on Snowflake usage, such as credits.

diff --git a/docs/dbt/dbt.md b/docs/dbt/dbt.md
@@ -0,0 +1,52 @@
+# dbt on the Data Services and Engineering team
+
+## Architecture
+
+We broadly follow the architecture described in
+[this dbt blog post](https://www.getdbt.com/blog/how-we-configure-snowflake/)
+for our Snowflake dbt project.
+
+It is described in more detail in our [Snowflake docs](../infra/snowflake.md#architecture).
+
+## Naming conventions
+
+Models in a data warehouse do not follow the same naming conventions as [raw cloud resources](../learning/naming-conventions.md#general-approach),
+as their most frequent consumers are analytics engineers and data analysts.
+
+The following conventions are used where appropriate:
+
+**Dimension tables** are prefixed with `dim_`.
+
+**Fact tables** are prefixed with `fct_`.
+
+**Staging tables** are prefixed with `stg_`.
+
+**Intermediate tables** are prefixed with `int_`.
+
+We may adopt additional conventions for denoting aggregations, column data types, etc. in the future.
+If during the course of a project's model development we determine that simpler human-readable names
+work better for our partners or downstream consumers, we may drop the above prefixing conventions.
+
+## Custom schema names
+
+dbt's default method for generating [custom schema names](https://docs.getdbt.com/docs/build/custom-schemas)
+works well for a single-database setup:
+
+* It allows development work to occur in a separate schema from production models.
+* It allows analytics engineers to develop side-by-side without stepping on each others toes.
+
+A downside of the default is that production models all get a prefix,
+which may not be an ideal naming convention for end-users.
+
+Because our architecture separates development and production databases,
+and has strict permissions protecting the `RAW` database,
+there is less danger of breaking production models.
+So we use our own custom schema name following a modified
+[approach from the GitLab Data Team](https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/macros/utils/override/generate_schema_name.sql).
+
+In production, each schema is just the custom schema name without any prefix.
+In non-production environments the default is used, where analytics engineers
+get the custom schema name prefixed with their target schema name (i.e. `dbt_username_schemaname`),
+and CI runs get the custom schema name prefixed with a CI job name.
+
+This approach may be reevaluated as the project matures.
diff --git a/docs/architecture.md → docs/infra/architecture.md b/docs/architecture.md → docs/infra/architecture.md
@@ -9,7 +9,7 @@ We follow an adapted version of the project architecture described in
 for our Snowflake dbt project.
 
 It is described in some detail in our
-[Snowflake docs](https://cagov.github.io/data-infrastructure/snowflake/#architecture) as well.
+[Snowflake docs](snowflake.md/#architecture) as well.
 
 ```mermaid
 flowchart TB
@@ -142,49 +142,33 @@ So we use our own custom schema name modified from the
 In production, each schema is just the custom schema name without any prefix.
 In non-production environments, dbt developers use their own custom schema based on their name: `dbt_username`.
 
-## Developing against production data
-
-Our Snowflake architecture allows for reasonably safe `SELECT`ing
-from the production `RAW_PRD` database while developing models.
-While this could be expensive for large tables,
-it also allows for faster and more reliable model development.
-
-To develop against raw production data, first you need someone with the `USERADMIN` role
-to grant rights to the `TRANSFORMER_DEV` role
-(this need only be done once, and can be revoked later):
-
-```sql
-USE ROLE USERADMIN;
-GRANT ROLE RAW_PRD_READ TO ROLE TRANSFORMER_DEV;
-```
-
 ## Examples
 
 ### User personae
 
 To make the preceding more concrete, let's consider the six databases,
 `RAW`, `TRANSFORM`, and `ANALYTICS`, for both `DEV` and `PRD`:
 
-![databases](images/databases.png)
+![databases](../images/databases.png)
 
 If you are a developer, you are doing most of your work in `TRANSFORM_DEV`
 and `ANALYTICS_DEV`, assuming the role `TRANSFORMER_DEV`.
 *However*, you also have the ability to select the production data from `RAW_PRD` for your development.
 So your data access looks like the following:
 
-![developer](images/developer.png)
+![developer](../images/developer.png)
 
 Now let's consider the nigthly production build. This service account builds the production models
 in `TRANSFORM_PRD` and `ANALYTICS_PRD` based on the raw data in `RAW_PRD`.
 The development environment effectively doesn't exist to this account, and data access looks like the following:
 
-![nightly](images/nightly.png)
+![nightly](../images/nightly.png)
 
 Finally, let's consider an external consumer of a mart from PowerBI.
 This user has no access to any of the raw or intermediate models (which might contain sensitive data!).
 To them, the whole rest of the architecture doesn't exist, and they can only see the marts in `ANALYTICS_PRD`:
 
-![consumer](images/consumers.png)
+![consumer](../images/consumers.png)
 
 ### Scenario: adding a new data source
 

diff --git a/docs/infra/cloud-infrastructure.md b/docs/infra/cloud-infrastructure.md
@@ -0,0 +1,45 @@
+# Cloud Infrastructure
+
+The DSE team [uses Terraform](../code/terraform-local-setup.md) to manage cloud infrastructure.
+Our stack includes:
+
+* An [AWS Batch](https://aws.amazon.com/batch/) environment for running arbitrary containerized jobs
+* A [Managed Workflows on Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) environment for orchestrating jobs
+* A VPC and subnets for the above
+* An ECR repository for hosting Docker images storing code and libraries for jobs
+* A bot user for running AWS operations in GitHub Actions
+* An S3 scratch bucket
+
+## Architecture
+
+```mermaid
+flowchart TD
+  subgraph AWS
+    J[GitHub CD\nbot user]
+    G[Artifact in S3]
+    subgraph VPC
+      subgraph Managed Airflow
+        K1[Scheduler]
+        K2[Worker]
+        K3[Webserver]
+      end
+      F[AWS Batch Job\n on Fargate]
+    end
+    E[AWS ECR Docker\nRepository]
+  end
+  subgraph GitHub
+    A[Code Repository]
+  end
+  E --> F
+  A -- Code quality check\n GitHub action --> A
+  A -- Job submission\nvia GitHub Action --> F
+  A -- Docker build \nGitHub Action --> E
+  A --> H[CalData\nadministrative\nuser]
+  H -- Terraform -----> AWS
+  K2 -- Job submission\nvia Airflow --> F
+  K1 <--> K2
+  K3 <--> K1
+  K3 <--> K2
+  F --> G
+  J -- Bot Credentials --> A
+```