In this document, you'll find information and instructions about my solution to the analytics engineer assignment.
This is the structure of the project.
.
├── .dockerignore
├── .env.airflow.local.example
├── .env.dbt.local.example
├── .env.postgres.local.example
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── .sqlfluff
├── .sqlfluffignore
├── .vscode
│ ├── extensions.json
│ └── settings.json
├── Analytics Engineer Assessment.docx
├── Dockerfile.airflow
├── Dockerfile.dbt
├── LICENSE
├── Makefile
├── README.md
├── analytics_engineer_assessment.pdf
├── dags
│ ├── .airflowignore
│ ├── settings.py
│ └── transformations.py
├── dbt
│ ├── analyses
│ │ └── .gitkeep
│ ├── dbt_project.yml
│ ├── macros
│ │ ├── .gitkeep
│ │ ├── generate_raw_data.sql
│ │ ├── generate_schema_name.sql
│ │ └── macros.yml
│ ├── models
│ │ ├── intermediate
│ │ │ ├── int_commits.sql
│ │ │ ├── int_events.sql
│ │ │ ├── int_repos.sql
│ │ │ ├── int_users.sql
│ │ │ └── intermediate.yml
│ │ ├── landing.yml
│ │ ├── marts
│ │ │ ├── marts.yml
│ │ │ └── reporting
│ │ │ ├── dim_commits.sql
│ │ │ ├── dim_repos.sql
│ │ │ ├── dim_users.sql
│ │ │ ├── fct_events.sql
│ │ │ └── reporting.yml
│ │ └── staging
│ │ ├── staging.yml
│ │ ├── stg_actors.sql
│ │ ├── stg_commits.sql
│ │ ├── stg_events.sql
│ │ └── stg_repos.sql
│ ├── packages.yml
│ ├── profiles.yml
│ ├── seeds
│ │ ├── .gitkeep
│ │ ├── raw
│ │ │ ├── actors.csv
│ │ │ ├── commits.csv
│ │ │ ├── events.csv
│ │ │ └── repos.csv
│ │ └── seeds.yml
│ ├── snapshots
│ │ └── .gitkeep
│ └── tests
│ └── .gitkeep
├── docker-compose.yml
├── images
│ ├── airflow_dag.png
│ ├── lineage.png
│ └── raw_erd.png
├── mypy.ini
├── noxfile.py
├── poetry.lock
├── pyproject.toml
└── scripts
└── postgres_init.sh
17 directories, 63 files
This solution is containerized, so you'll need to install docker and docker-compose.
Also, it's recommended to have a desktop SQL client like DBeaver.
On a secondary stage, you can install the recommended VS Code extensions.
Let's dive into the setup process.
Open a shell in your machine, and navigate to this directory. Then run:
make generate-dotenv
This will generate three .env
files with predefined values. Please, go ahead and open it! If you want to modify some values, just take into account that this may break some things.
Run these commands in this sequence:
make install-poetry
make install-project
make dbt-install-pkgs
Optionally, if you've cloned the repo, you can run:
make install-pre-commit
To install the pre-commit hooks and play around with them.
Run:
make build services="postgres bootstrap-dbt"
This will build all the required images.
Run:
make up services="postgres bootstrap-dbt"
This will create a PostgreSQL database, and all the raw tables, and run a command that populates those tables with the provided data.
Open DBeaver, and set up the connection to the database. If you didn't modify the .env
files, you can use these credentials:
- User:
clara
- Password:
clara
- Host:
localhost
- Port:
5440
- DB:
clara
Then, please open the queries.sql
and view.sql
files and run queries in DBeaver to verify the results.
If you don't have DBeaver, you can run your queries from PostgreSQL's terminal with psql. To do this, please run:
make execute-sql
Then you can run them from the terminal.
In this section, we'll materialize the data model with dbt
.
In your terminal, run:
make dbt-run-model node="--target prod"
And wait until all the models are finished.
First of all, I've manually inspected the provided raw data by digging into it. Then, I took a look at the GitHub Events API docs.
Once I had that in mind, I understood the relations between the provided data. Here's an ERD:
The relationship highlights are:
- One actor/user can have multiple events (e.g.,
event_type = 'PushEvent'
and different commit SHAs) - One repository can have multiple events
- One commit represents one single transaction
Taking a closer look into the raw data, I realized that there were some duplicates in the repos
and users
tables, and I've found (mainly) 2 strange things in those tables.
First, there are 2 different usernames with the same id (59176384
):
SELECT
id
, COUNT(DISTINCT username) AS num_of_users_per_id
FROM raw.actors
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5
The usernames are:
id | username |
---|---|
59176384 | starnetwifi |
59176384 | starwifi88 |
So I decided to use DISTINCT ON
in the pipeline as deduplication logic, so the first row remains.
Second, there are 14 repositories ID repeated with different names:
SELECT
id
, COUNT(DISTINCT name) AS num_of_names_per_id
FROM raw.repos
GROUP BY 1
ORDER BY 2 DESC
LIMIT 15
For example, the ID 230999134
has the following names:
id | name |
---|---|
230999134 | hseera/dynamodb-billing-mode |
230999134 | hseera/dynamodb-update-capacity |
230999134 | hseera/update-dynamodb-capacity |
So I took the same logic into account in the pipeline.
These decisions were taken because no further explanations were provided.
Another thing that is worth mentioning is that the PullRequestEvent
event doesn't have the payload data, so it's impossible to distinguish the events between opened, edited, closed, etc. I've assumed that the PullRequestEvent
corresponds to the PR opened
event.
This is because of the nature of the first question:
Top 10 active users sorted by the amount of PRs created and commits pushed
The real question that I'll be answering is:
Top 10 active users sorted by the amount of PRs events and commits pushed
Please take into account that, as per the question, the commits do not necessarily have to be related to the same PR.
Finally, I understood that the phrase active users
refers not to a bot.
I thought:
I have the questions that I need to answer, so... how does a SQL query that answer them might look like?
I'm assuming that the data consumers are familiar with SQL. If this is not the case, the solution might be to create a specific report schema and tables with the results of the following queries.
Let's think about the first one:
Top 10 active users sorted by the amount of PRs created and commits pushed
It will look somehow like these:
SELECT
dim_users.user_id
, dim_users.username
, COUNT(*) AS num_prs_created_and_commits_pushed
FROM some_schema.fct_events
LEFT JOIN some_schema.dim_users
ON fct_events.user_id = dim_users.id
WHERE fct_events."type" IN ('PushEvent', 'PullRequestEvent')
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10
Where:
dim_users
is a dimension table, containing the user ID and usernamefct_events
is the fact table, containing all the events
So at first sight, the dim_users
can be an SCD type 2, as the username rarely changes over time, but it can. It seemed an overkill for this specific case, so I decided to model it as a type 0.
Performing a similar thing for the rest of the questions:
Top 10 repositories sorted by the amount of commits pushed
Top 10 repositories sorted by the amount of watch events
I realized that the queries would be quite similar to the previous one, and the other dimensions were very straightforward. So, these tables were created too:
dim_commits
is a dimension table, containing the commit ID, the commit SHA, and the event IDdim_repos
is a dimension table, containing the repo ID and name
I decided to use classic modular data modeling techniques, and thought about these layers:
staging
: just a copy of the landing/source tables with some types casting (if needed), in order to standardizeintermediate
: here I'll place reusable models, with some deduplication logicmarts
: here I'll place the final models, in a star schema (facts surrounded by dimensions)
Since the raw data doesn't need much processing (just some deduplication logic), all of the models in the staging
and intermediate
layers will be quite similar, and the only difference will be the deduplication logic. I've created a macro to apply the DRY principle in these layers.
The final lineage graph is as follows:
Using the data model created with dbt
, you can answer the required questions.
Please, run these queries in DBeaver to verify the results.
-- Top 10 active users sorted by the amount of PRs created and commits pushed
SELECT
fct_events.user_id AS user_id
, dim_users.username AS username
, COUNT(*) AS num_prs_created_and_commits_pushed
FROM reporting.fct_events
LEFT JOIN reporting.dim_users
ON fct_events.user_id = dim_users.id
WHERE fct_events."type" IN ('PushEvent', 'PullRequestEvent')
AND NOT username ~* '-bot|\[bot\]|bot$'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10
-- Top 10 repositories sorted by the amount of commits pushed
SELECT
fct_events.repo_id AS repo_id
, dim_repos.name AS repo_name
, COUNT(*) AS num_commits_per_repo
FROM reporting.fct_events
LEFT JOIN reporting.dim_repos
ON fct_events.repo_id = dim_repos.id
WHERE fct_events.commit_sha IS NOT NULL
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10
-- Top 10 repositories sorted by the amount of watch events
SELECT
fct_events.repo_id AS repo_id
, dim_repos.name AS repo_name
, COUNT(*) AS num_watch_events_per_repo
FROM reporting.fct_events
LEFT JOIN reporting.dim_repos
ON fct_events.repo_id = dim_repos.id
WHERE fct_events."type" = 'WatchEvent'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 10
I've added some tests in the intermediate
and reporting
layers to verify the correctness of the data and ensure the data quality.
Generally speaking, the tests aim to ensure:
- No ID is missing
- Data types are as expected
- There are no duplicates
- Critical columns are present
To run the tests, open a terminal and run:
make dbt-test-model node="--target prod"
Also, there are some model contracts enforced in the intermediate
and reporting
layers, in order to avoid inserting duplicated fields, nulls, etc., and to ensure the models' relations.
If you're an Airflow fan (like me), you can set up an environment to run the dbt pipeline in an Airflow DAG.
To do this, please run (these commands are similar to the setup process):
make build
make up
Then, go to http://localhost:8080/ and log in with the credentials airflow:airflow
. You'll find a DAG named transformations
, please go ahead and click on it. You'll see a DAG like this:
If you want to test it, just click on the toggle button and run the pipeline.
Please note that the models are run before their tests in the same DAG.
If you're struggling with some commands, please run make help
to get all the available commands.
I've used poetry to manage the project's dependencies. If you want to install it in your local machine, please run:
make install-poetry
And then run:
make install-project
Then you'll have all the dependencies installed, and a virtual environment created in this very directory. This is useful, for example, if you're using VS Code and want to explore the code. Also, you might want to use pyenv to install Python 3.10.12.
All the code in this project has been linted and formatted with these tools:
I just cloned the repo and want to play around with the pre-commit framework? Just run:
make nox-hooks