Lightweight Python ETL toolkit using Prefect.
- Free software: MIT license
- Documentation: https://dkapitan.github.io/nimbletl
Flexible Python ETL toolkit for datawarehousing framework based on Dask, Prefect and the pydata stack. It follows the original design principles from these libraries, combined with a functional programming approach to data engineering.
Google Cloud Platform (GCP) is used as the core infrastructure, particularly BigQuery (GBQ) and Cloud Storage (GCS) as the main storage engines. We follow Google's recommendations on how to use BigQuery for data warehouse applications with four layers:
- source data, in production environment or file-based
- staging, on GCS
- datavaault, on GBQ
- datamarts, on GBQ using ARRAY_AGG, STRUCT, UNNEST SQL-pattern
nimble (/ˈnɪmb(ə)l/): quick and light in movement or action; agile, wink at the godfather of the star-schema, Kimball
pip install -e git+https://github.com/dkapitan/nimbletl.git
A conda environment is included for convenience, containing most commonly use packages.
conda env create -f environment.yml
Try nl-open-data to see nimbletl in action and create a datawarehouse with Dutch open data from various sources.
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.