Airflow server for personal data integration and experimentation.
This repository contains a Docker Development Container for VSCode and the infrastructure and workflows for my personal airflow instance. It has been deployed as a Python 3.12 application to Azure App Service on a relatively small instance with a small PostgreSQL metadatabase.
The following data will be ingested from my personal systems into a BigQuery warehouse for automation and analysis.
- Notion
- HubSpot
- Google Contacts
graph TB
%% Sources
S1[Notion]
S2[HubSpot]
subgraph raw
direction TB
L1[Daily Habits]
L2[Weekly Habits]
L3[Contacts]
L4[Companies]
L5[Engagements]
end
%% Raw to Staging Flows
S1 --> L1
S1 --> L2
S2 --> L3
S2 --> L4
S2 --> L5
subgraph staging
C1[Notion Habits]
end
%% Staging to Intermediate Flows
L1 --> C1
L2 --> C1
- Alembic database migrations for raw tables which should generally match the schema of the source system, run via Airflow provider package
- Airflow to orchestrate data loading scripts and additional automated workflows
- DBT core to define data models and transformations, again orchestrated by Airflow (via CLI / bash TaskFlow)
The project has been strucutrd and designed with inspiration from dbt project recommendations and other sources.
- DBT projects stored in separate subdirectory from DAGs (at least now)
- DAGs and DBT projects organised at the top level by owner (should more get involved)
- Further organisation by data source and / or function
- Naming generally follows DBT recommended
[layer]_[source]__[entity]
, adapted for Airflow DAGs with__[refresh-type]
and other modifications as needed.
While it generally isn't recommended to maintain infrastructure and workflows in the same repository, it is not a major concern for this basic setup. The AIRFLOW_HOME
variable is mapped to the repository root to load updated configurations across environments from airflow.cfg
and webserver_config.py
, some of which are overriden by environment variables outlined below.
To run Airflow on a single instance, I used Honcho to run multiple processes via Procfile (webserver + scheduler)
- Create Web App + PostgreSQL with Python
- Turn on Application Insights, Logging
- Set relevant environment variables for Airflow
AIRFLOW_HOME=/home/site/wwwroot
to run airflow from deployed application foldersAIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://{username}:{password}@{host}:{port}/{database}
from the Azure databaseAIRFLOW__CORE__FERNKET_KEY={generated-key}
following this guidance to encrypt connection dataAIRFLOW__CORE__INTERNAL_API_SECRET_KEY={generated-secret1}
following this guidanceAIRFLOW__WEBSERVER__SECRET_KEY={generated-secret2}
following guidance aboveAIRFLOW__WEBSERVER__INSTANCE_NAME=MY INSTANCE!
- Generate Publish Profile file and deploy application code from GitHub
- Set startup command to use the
startup.txt
file - Run database migrations (
airflow db migrate
) and user setup (airflow users create
) as one-off admin process, Procfile just for main processes- Reference quick start for guidance on this setup process
- It may be necessary to run these via startup command to get the app to launch
- I referenced this workflow to deploy Python app to App Service using Publish Profile basic authentication
- Google Cloud BigQuery using Airflow BigQuery Provider and dbt
- Notion using Notion Client
Unit testing and the local instance are connected to a separate Google Cloud Platform project for development purposes.
- Build Dev Container in VSCode, this will run
script/setup
to install dependencies (with dev) - To run server locally, run
honcho start
in the terminal - Add connection settings in the interface or upload via file
- Write and run unit tests for DAGs