-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
add3e10
commit dbc22f5
Showing
23 changed files
with
532 additions
and
547 deletions.
There are no files selected for viewing
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# dbt on the Data Services and Engineering team | ||
|
||
## Architecture | ||
|
||
We broadly follow the architecture described in | ||
[this dbt blog post](https://www.getdbt.com/blog/how-we-configure-snowflake/) | ||
for our Snowflake dbt project. | ||
|
||
It is described in more detail in our [Snowflake docs](../infra/snowflake.md#architecture). | ||
|
||
## Naming conventions | ||
|
||
Models in a data warehouse do not follow the same naming conventions as [raw cloud resources](../learning/naming-conventions.md#general-approach), | ||
as their most frequent consumers are analytics engineers and data analysts. | ||
|
||
The following conventions are used where appropriate: | ||
|
||
**Dimension tables** are prefixed with `dim_`. | ||
|
||
**Fact tables** are prefixed with `fct_`. | ||
|
||
**Staging tables** are prefixed with `stg_`. | ||
|
||
**Intermediate tables** are prefixed with `int_`. | ||
|
||
We may adopt additional conventions for denoting aggregations, column data types, etc. in the future. | ||
If during the course of a project's model development we determine that simpler human-readable names | ||
work better for our partners or downstream consumers, we may drop the above prefixing conventions. | ||
|
||
## Custom schema names | ||
|
||
dbt's default method for generating [custom schema names](https://docs.getdbt.com/docs/build/custom-schemas) | ||
works well for a single-database setup: | ||
|
||
* It allows development work to occur in a separate schema from production models. | ||
* It allows analytics engineers to develop side-by-side without stepping on each others toes. | ||
|
||
A downside of the default is that production models all get a prefix, | ||
which may not be an ideal naming convention for end-users. | ||
|
||
Because our architecture separates development and production databases, | ||
and has strict permissions protecting the `RAW` database, | ||
there is less danger of breaking production models. | ||
So we use our own custom schema name following a modified | ||
[approach from the GitLab Data Team](https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/macros/utils/override/generate_schema_name.sql). | ||
|
||
In production, each schema is just the custom schema name without any prefix. | ||
In non-production environments the default is used, where analytics engineers | ||
get the custom schema name prefixed with their target schema name (i.e. `dbt_username_schemaname`), | ||
and CI runs get the custom schema name prefixed with a CI job name. | ||
|
||
This approach may be reevaluated as the project matures. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Cloud Infrastructure | ||
|
||
The DSE team [uses Terraform](../code/terraform-local-setup.md) to manage cloud infrastructure. | ||
Our stack includes: | ||
|
||
* An [AWS Batch](https://aws.amazon.com/batch/) environment for running arbitrary containerized jobs | ||
* A [Managed Workflows on Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) environment for orchestrating jobs | ||
* A VPC and subnets for the above | ||
* An ECR repository for hosting Docker images storing code and libraries for jobs | ||
* A bot user for running AWS operations in GitHub Actions | ||
* An S3 scratch bucket | ||
|
||
## Architecture | ||
|
||
```mermaid | ||
flowchart TD | ||
subgraph AWS | ||
J[GitHub CD\nbot user] | ||
G[Artifact in S3] | ||
subgraph VPC | ||
subgraph Managed Airflow | ||
K1[Scheduler] | ||
K2[Worker] | ||
K3[Webserver] | ||
end | ||
F[AWS Batch Job\n on Fargate] | ||
end | ||
E[AWS ECR Docker\nRepository] | ||
end | ||
subgraph GitHub | ||
A[Code Repository] | ||
end | ||
E --> F | ||
A -- Code quality check\n GitHub action --> A | ||
A -- Job submission\nvia GitHub Action --> F | ||
A -- Docker build \nGitHub Action --> E | ||
A --> H[CalData\nadministrative\nuser] | ||
H -- Terraform -----> AWS | ||
K2 -- Job submission\nvia Airflow --> F | ||
K1 <--> K2 | ||
K3 <--> K1 | ||
K3 <--> K2 | ||
F --> G | ||
J -- Bot Credentials --> A | ||
``` |
Oops, something went wrong.