Resolution of the Superside challenge for the Lead Data Engineer role.
This is the structure of the project.
.
├── .dockerignore
├── .env.dbt.local.example
├── .env.services.local.example
├── .github
│ └── workflows
│ └── pull-request.workflow.yaml
├── .gitignore
├── .markdownlint.json
├── .pre-commit-config.yaml
├── .python-version
├── .sqlfluffignore
├── .vscode
│ ├── extensions.json
│ └── settings.json
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── dags
│ ├── .airflowignore
│ ├── settings.py
│ └── transformations.py
├── dbt
│ └── superside
│ ├── .sqlfluff
│ ├── README.md
│ ├── analyses
│ │ └── .gitkeep
│ ├── dbt_project.yml
│ ├── logs
│ ├── macros
│ │ ├── .gitkeep
│ │ ├── generate_schema_name.sql
│ │ ├── intermediate
│ │ │ ├── extract_number.sql
│ │ │ └── map_categories.sql
│ │ └── macros.yml
│ ├── models
│ │ ├── intermediate
│ │ │ ├── _intermediate__models.yml
│ │ │ └── int_engagement_metrics.sql
│ │ ├── marts
│ │ │ ├── _marts__models.yml
│ │ │ ├── fct_engagement_metrics.sql
│ │ │ └── project_engagement.sql
│ │ └── staging
│ │ ├── _clients__models.yml
│ │ ├── _clients__sources.yml
│ │ └── stg_clients__engagement_metrics.sql
│ ├── packages.yml
│ ├── profiles.yml
│ ├── seeds
│ │ ├── .gitkeep
│ │ ├── marts
│ │ │ └── dim_project.csv
│ │ └── seeds.yml
│ ├── snapshots
│ │ └── .gitkeep
│ └── tests
│ └── .gitkeep
├── dbt_packages
├── diagrams
│ ├── airbyte.png
│ ├── diagram.py
│ └── kubernetes.png
├── docker-compose.yml
├── images
│ ├── airbyte_ui.png
│ ├── cluster.png
│ ├── make.png
│ ├── transformations.png
│ ├── uk.png
│ └── workspace.png
├── infra
│ ├── .terraform.lock.hcl
│ ├── airbyte-values.yml
│ ├── airbyte.tf
│ ├── airflow-values.yml
│ ├── airflow.tf
│ ├── providers.tf
│ └── variables.tf
├── mypy.ini
├── noxfile.py
├── poetry.lock
├── pyproject.toml
├── registries.yaml
├── scripts
│ ├── clients_postgres_init.sh
│ └── warehouse_postgres_init.sh
└── source_data
└── engagement_metrics_raw.csv
26 directories, 68 files
📙 It's recommended to take your time and read this whole guide before running any command
This solution runs in a local kubernetes cluster, so is containerized. You'll need the following mandatory tools in your local machine:
- k3d for the local k8s cluster
- kubectl to manage the k8s cluster through the CLI
- Docker and docker-compose
- Beware that you'll need around 10 GB of RAM available to allocate (check this link to see how in Docker Desktop)
- Terraform
- GNU Make
- poetry to handle python dependencies
- There's an useful make rule for this one, so you can skip its installation
Depending on your OS, the installation process will be different. If you're in macOS then run:
brew install k3d kubectl docker docker-compose tfenv
tfenv install 1.5.6
tfenv use 1.5.6
There are other optional dependencies:
- Lens to easily manage the k8s cluster
- DBeaver as a desktop SQL client
- The recommended VS Code extensions
The selected data stack is as follows:
- Airbyte for data movement
- The Airbyte metadata DB is external to the cluster
- Airflow for workflow orchestration
- The Airflow cluster is deployed with the
CeleryExecutor
and a Redis database working as a message broker between Celery and the worker - The Airflow metadata DB is external to the cluster
- The Airflow cluster is deployed with the
- dbt for data modeling
- The Airflow and dbt integration was made through cosmos
- Postgres for data storage
- This DB was selected just for simplicity
- A local registry for storing and distributing container images
Airbyte and Airflow are installed in the kubernetes cluster via Helm through its Terraform provider. Also, the Airbyte source, destination, and connection are handled by Terraform.
This is a simplified diagram of the architecture:
Each platform runs in its own node and namespace. These nodes are labeled with the component: [platform]
label, where platform
can be either airbyte
or airflow
. Then, the nodeSelector
property is set to component: [platform]
in each platform's values files.
Both platforms will run its jobs in ephemeral pods, which will be scheduled in a third node with label component: jobs
. This is convenient for these reasons:
- If using a node provisioner like karpenter, this architectue allows to provide ephemeral nodes just to run this workloads an then remove them, saving costs.
- As the pods runs in isolated environments, any kind of disruption won't affect the other platform's components.
- The nodes and its pods' resources, requests, and limits can be managed separately.
- The ephemeral pods' resources can be modified through Airflow variables, as I've used the kubernetesPodOperator in the
transformations
DAG, making it easier to manage them.
Moreover, the databases were deployed as external services (outside the cluster) in order to ensure the statelessness of the cluster and seamlessly persist the platforms' states.
The data flow is as follows (the provided raw data is in the source_data
directory):
- The raw
engagement_metrics_raw.csv
data is loaded into theclients
DB through thescripts/clients_postgres_init.sh
script. This DB is considered as a source.- This was done to better emulate a production environment, and to allow me to use Airbyte, because otherwise, I would need these credentials which I don't have, in order to sync the data directly from the Google Sheets.
- Another option could have been to create a Python script that runs on a DAG in Airflow and loads the data into the DB. I considered Airbyte a much more robust, resilient, and practical option.
- Once Airbyte runs its sync, the raw data is moved to the
warehouse
DB, which is the destination. You'll find the data in theclients.engagement_metrics
landing table. - Then, Airflow triggers the dbt transformations and tests, and the models are materialized in the
warehouse
DB, in separate schemas:staging
: materialized as a view, where simple casting and renaming is done, and has a 1-1 relation with the landing table.intermediate
: materialized as a view, where more complex transformations are done to normalize and prepare data for downstream consumption.marts
: materialized as a table, where thedim_project.csv
data is loaded as a seed, and then joined with thefct_engagement_metrics
table in a model namedproject_engagement
.
Please go ahead and check the red arrows in the architecture diagram.
Before you begin, please disconnect any VPN you might be connected to, if any.
Please, take into account that the whole process will take at least 30 minutes, depending if you have previously pulled the Docker images, and your internet connectivity.
Open a shell in your machine, and navigate to this directory. Then run:
make generate-dotenv
This will generate two .env
files with predefined values. Please, go ahead and open it! If you want to modify some values, just take into account that this may break some things.
If you want to run any dbt command without the --target prod
option, please fill the DBT_SCHEMA
variable in the .env.dbt.local
file.
Run these commands in this sequence (beware if you've poetry
already installed in your machine, as version 1.6.1 will be installed by default):
make install-poetry
make install-project
make dbt-install-pkgs
Optionally, if you've cloned the repo, you can run:
make install-pre-commit
To install the pre-commit hooks and play around with them.
⚠️ Remember to assign the 10 GB of RAM in Docker Desktop.
Run this command and wait a while for the cluster to be ready:
make cluster-create
You can monitor its state with Lens. Anyway, you can check the pods status from the terminal:
watch -d kubectl get pods -A
Wait until they're in the Running
state.
Also, this command creates some useful services which you can check that are running both with Docker Desktop or by running:
docker ps
Each time you run any command related to the cluster through a make rule, the current context is switched to the local cluster, to avoid any conflicts with other ones you may have in your
~/.kube/config
file.
To deploy Airbyte and Airflow in the cluster, run:
make cluster-install-apps
This will take a while (10-15 minutes, depending on your machine), but you can monitor the state the same way you did before. Please wait until this process finishes before continuing.
Go ahead and port-forward the following services to these local ports (verify that you aren't using them already):
- Airbyte web server: 8085
- Airbyte API server: 8001
- Airflow web server: 8090
You can do this manually with Lens, or by running:
make cluster-port-forward
Verify that you can access the web servers by going to http://localhost:8085 and http://localhost:8090.
Then, please complete the Airbyte's initial setup.
Once done, please copy the workspace id from the Airbyte UI.
Or by running:
curl -u airbyte:airbyte http://localhost:8001/api/public/v1/workspaces
Also, just if you want, go to settings and upgrade all the sources and destinations connectors' versions.
Then, please fill the workspace_id
in the infra/variables.tf
file and run:
make cluster-setup-airbyte
This will trigger some Airbyte jobs that will run in some pods, so it will take a while to complete (around 5 minutes).
Once finished, go to the Airbyte's connections, and you'll see a new one named Clients
. Please trigger a sync manually and wait until it finishes (around 5-10 minutes). You can trigger the sync as many times as you want, since the sync mode is Full Refresh | Overwrite.
As the dbt models will run in ephemeral pods via the kubernetesPodOperator, you'll need to provide an image to the containers. To create it, please run:
make cluster-local-image
💡 Any time you change something in a dbt model, you need to re-run this command to update the image.
Go to http://localhost:8090, and login with the default credentials airflow:airflow
.
Then, unpause the transformations
DAG. You should see how the dbt models and tests are running in the ephemeral pods (scheduled in the node with label component=jobs
). Please check this with Lens, or by running:
watch -d kubectl get pods \
-n airflow \
--field-selector spec.nodeName=$(
kubectl get nodes \
-l component=jobs \
-o jsonpath='{.items[0].metadata.name}' \
)
Then, wait around 3 minutes until the models and tests run. The DAG looks as follows:
Open your SQL client and connect to the warehouse. These are the credentials:
- User:
warehouse
- Password:
warehouse
- Database:
warehouse
- Host:
localhost
- Port:
5470
Then run the following query to fetch the results of the joined data:
SELECT * FROM marts.project_engagement;
Please go ahead and check the tables and views in the others schemas.
There are 3 more DDBB:
-
Database:
Clients
- User:
clients
- Password:
clients
- Host:
localhost
- Port:
5460
- User:
-
Database:
Airflow Metadata
- User:
airflow
- Password:
airflow
- Host:
localhost
- Port:
5450
- User:
-
Database:
Airbyte Metadata
- User:
airbyte
- Password:
airbyte
- Host:
localhost
- Port:
5440
- User:
Please run make help
for more information.
The following command will stop both the cluster and the external services:
make cluster-stop
If you want to destroy these resources, you can run directly:
make cluster-delete
If you want to uninstall the deployed platforms, please run:
make cluster-uninstall-apps
Before starting to build the models, I explored the raw data to understand its nature and relation with the dim_project
(which I verified is in a clean format, ready to be consumed as is). To do this, I loaded the raw data in a source table in the Clients
DB, engagement_metrics
table. Some things that came up from the analysis are in the following section.
If you want to check the data and you have few resources in your machine, you don't need to keep the k8s cluster up and running for this. You can stop it with:
make cluster-stop
And then just start the external services (DDBB and registry) with:
make up
This will set up a much lighter environment that you can use for data exploration.
I realized that the table's unique key is (project_id, engagement_id)
:
I check these 2 cases with this query:
SELECT *
FROM engagement_metrics
WHERE project_id IN ('0bf50700-a93a-4e7d-8a04-30a6116acbeb', 'e34525ca-b73d-41b0-8e83-4ba9e983555d')
AND engagement_id IN ('035ae529-0ace-4a6b-b0bf-95c85ff5af03', '11089782-22e7-43fd-8ace-221538ea812a')
And noticed that:
- The row with
customer_name = 'Customer_305'
is duplicated, since all its columns has the same values. - The row with
customer_name = 'Customer_561'
has all its columns with the same values, except for theservice
, which differs betweenConsulting
andDesign
, andStrateby
andFrontend
in thesub_service
column.
Because of this, I've decided to deduplicate with the ROW_NUMBER()
window function, because both cases seems duplicates to me.
All these transformations happen in the intermediate layer.
After further exploring the data, I noticed that the dates are in various formats, which is another issue that needs to be fixed.
Something similar happened with the names of the clients. They had typos and inconsistencies which were fixed.
On the other hand, the employee_count
column contained numbers but in 2 cases it contained the words fifty
and hundred
, so they were replaced by their associated numbers because I thought that these values would be more useful as integers rather than varchar
.
All the columns associated with monetary values had the same characteristics. In addition to the number, they contained the currency symbol, and in some cases the suffix k
. I assumed that:
- All currencies were the same
- The suffix
k
meant 103 - These values would need to be converted to integers
So I adjusted this data accordingly.
The most interesting part was that almost all columns seem to contain some categories, but these contain spelling errors in different positions of the sequences. To fix this issue more cleanly, I've used the Levenshtein distance.
As dbt doesn't provide an out-of-the-box package or method for this, I've installed the fuzzystrmatch extension in the warehouse
.
After some tests, I found that all the misspelled categories were within a Levenshtein distance of less than 2 from their correct categories. You can find all these categories in the dbt project-level variables declared in the dbt_project.yml
file.
You can check the results of this process in the warehouse.marts.project_engament
table, for example, by running (analogous for the other columns):
All this logic related to data normalization was the same for almost all the columns, so I've encapsulated it in some macros. Please go ahead and check the dbt/superside/macros
directory.
I checked that all the values in the warehouse.clients.engagement_metrics.project_id
column corresponds to a value in the warehouse.marts.dim_project.project_id
column, and vice versa. On the other hand, if I use this criterion to join them, some values look odd. For example, in some cases, the engagement_date
was greater than the date_project_ended
column, which is confusing.
Another thing that I found is that, in the joined table, the customer_id
differs from the one in the dim_project
column for the same project_id
. If I add customer_id
as a join condition, it happens that only ~90 rows match it. This also seemed weird to me, so I left it without this condition.
After further analysis, and as there was no source documentation provided, I wasn't able to determine if there was a problem with the data or if I was missing something, so I decided to perform the join just for project_id
.
I tried to enforce contracts both in the staging and marts layer, but it turned out that dbt 1.5 has a bug with this feature when the data types are UUID.
I have implemented tests to ensure unique keys, non-nullity, and categories in the intermediate and final models. I can implement more, but I consider that my understanding of the subject is demonstrated.
Do you want to know more about the dbt project? Please go ahead and serve its documentation in you machine:
make dbt-docs-generate
make dbt-docs-serve port="9000"
This will open the documentation in your browser. Adjust the port number if needed.
Please take a look at the scripts/hide_dbt_resources_from_docs.py
file. It contains an useful CLI to hide dbt resources from the docs that can't be hidden because of some bugs with dbt.
I've deployed a simple CI pipeline to run the pre-commit hooks in a Github runner. Pleas go ahead and check the workflow file, and the previous runs in the repository.
If you're struggling with some commands, please run make help
to get all the available commands and its usage examples.
I've used poetry to manage the project's dependencies. If you want to install it in your local machine, please run:
make install-poetry
And then run:
make install-project
Then you'll have all the dependencies installed, and a virtual environment created in this very directory. This is useful, for example, if you're using VS Code and want to explore the code. Also, you might want to use pyenv to install Python 3.10.12.
All the code in this project has been linted and formatted with these tools:
Just cloned the repo and want to play around with the pre-commit framework? Then run:
make nox-hooks
Curious about nox? Check out the nox docs.