personal-reporting-airflow

Airflow server for personal data integration and experimentation.

Overview

This repository contains a Docker Development Container for VSCode and the infrastructure and workflows for my personal airflow instance. It has been deployed as a Python 3.12 application to Azure App Service on a relatively small instance with a small PostgreSQL metadatabase.

Data Sources

The following data will be ingested from my personal systems into a BigQuery warehouse for automation and analysis.

Notion
HubSpot
Google Contacts

Warehouse Data Flow

graph TB

    %% Sources
    S1[Notion]
    S2[HubSpot]

    subgraph raw
        direction TB
        L1[Daily Habits]
        L2[Weekly Habits]
        L3[Contacts]
        L4[Companies]
        L5[Engagements]
    end

    %% Raw to Staging Flows
    S1 --> L1
    S1 --> L2
    S2 --> L3
    S2 --> L4
    S2 --> L5

    subgraph staging
        C1[Notion Habits]
    end

    %% Staging to Intermediate Flows
    L1 --> C1
    L2 --> C1

Frameworks

Alembic database migrations for raw tables which should generally match the schema of the source system, run via Airflow provider package
Airflow to orchestrate data loading scripts and additional automated workflows
DBT core to define data models and transformations, again orchestrated by Airflow (via CLI / bash TaskFlow)

Standards

The project has been strucutrd and designed with inspiration from dbt project recommendations and other sources.

DBT projects stored in separate subdirectory from DAGs (at least now)
DAGs and DBT projects organised at the top level by owner (should more get involved)
Further organisation by data source and / or function
Naming generally follows DBT recommended [layer]_[source]__[entity], adapted for Airflow DAGs with __[refresh-type] and other modifications as needed.

Setup

Airflow Setup

While it generally isn't recommended to maintain infrastructure and workflows in the same repository, it is not a major concern for this basic setup. The AIRFLOW_HOME variable is mapped to the repository root to load updated configurations across environments from airflow.cfg and webserver_config.py, some of which are overriden by environment variables outlined below.

To run Airflow on a single instance, I used Honcho to run multiple processes via Procfile (webserver + scheduler)

Azure Setup

Create Web App + PostgreSQL with Python
Turn on Application Insights, Logging
Set relevant environment variables for Airflow
- AIRFLOW_HOME=/home/site/wwwroot to run airflow from deployed application folders
- AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://{username}:{password}@{host}:{port}/{database} from the Azure database
- AIRFLOW__CORE__FERNKET_KEY={generated-key} following this guidance to encrypt connection data
- AIRFLOW__CORE__INTERNAL_API_SECRET_KEY={generated-secret1} following this guidance
- AIRFLOW__WEBSERVER__SECRET_KEY={generated-secret2} following guidance above
- AIRFLOW__WEBSERVER__INSTANCE_NAME=MY INSTANCE!
Generate Publish Profile file and deploy application code from GitHub
Set startup command to use the startup.txt file
Run database migrations (airflow db migrate) and user setup (airflow users create) as one-off admin process, Procfile just for main processes
- Reference quick start for guidance on this setup process
- It may be necessary to run these via startup command to get the app to launch

Automated Deployment

I referenced this workflow to deploy Python app to App Service using Publish Profile basic authentication

Integrations

Google Cloud BigQuery using Airflow BigQuery Provider and dbt
Notion using Notion Client

Testing

Environments

Unit testing and the local instance are connected to a separate Google Cloud Platform project for development purposes.

Setup Steps

Build Dev Container in VSCode, this will run script/setup to install dependencies (with dev)
To run server locally, run honcho start in the terminal
Add connection settings in the interface or upload via file
Write and run unit tests for DAGs

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
config		config
dags		dags
dbt/michael		dbt/michael
plugins		plugins
script		script
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Procfile		Procfile
README.md		README.md
airflow.cfg		airflow.cfg
constraints.txt		constraints.txt
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
startup.txt		startup.txt
webserver_config.py		webserver_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

personal-reporting-airflow

Overview

Data Sources

Warehouse Data Flow

Frameworks

Standards

Setup

Airflow Setup

Azure Setup

Automated Deployment

Integrations

Testing

Environments

Setup Steps

About

Releases

Packages

Languages

michaelconan/personal-reporting-airflow

Folders and files

Latest commit

History

Repository files navigation

personal-reporting-airflow

Overview

Data Sources

Warehouse Data Flow

Frameworks

Standards

Setup

Airflow Setup

Azure Setup

Automated Deployment

Integrations

Testing

Environments

Setup Steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages