The goal of this project is to implement the skeleton of a robust ELT pipeline. Things to consider are:
- version control
- development flow
- file project structure
- unit testing
- logging
- documentation
- virtual enviornments/dependency management
- orchestration
- general best practices for data engineering
- containerization
- supporting downstream analytics/ML
Strava API --> Python --> BigQuery + dbt --> Tableau/ML in Jupyter Notebook
- light data transformation with Pandas
- orchestration through Google Cloud Services
- data storage through BigQuery
- final data transformations (dimensional modeling + OBT) for downstream analytics through dbt
- Containerization via Docker
- ELT job notifications sent through Slack
- Downstream analytics supported by this pipeline
- dashboard via Tableau
- cycling ML model via Python/Sklearn
- Python application is containerized and pushed to Google Cloud Artifact Registry
- Container is then deployed on Cloud Run Jobs at a set schedule
- Every midnight, the ELT pipeline is ran, checking for new data to upload to BigQuery
- At job completion, a Slack notification with job meta data and success status is sent
- configs : .yml file with API tokens, db user/password, ELT params
- src : source code
- tests : unit tests