Skip to content
This repository has been archived by the owner on Sep 3, 2024. It is now read-only.

A project demonstrating an ETL pipeline primarily using AWS infrastructure into a data warehouse

Notifications You must be signed in to change notification settings

ajschofield/de-project-bentley

Repository files navigation

Note

Considering that myself and my team have graduated from the Northcoders Data Engineering course, this project will be archived and made read-only. I will be continuing this project solo, which you can find here, where I will be adding more features over time.

ToteSys - Data Engineering Project

Python AWS Terraform Postgresql GitHub Actions

Terraform Main Deployment Workflow Status Production Environment Status

Contributors

ellsymonds
Ellie Symonds
lian-manonog
Lianmei Manon-og
T-Aji
Tolu Ajibade
HastarTara
Joslin Rashleigh
bulve-ad
Anzelika Belotelova
ajschofield
Alex Schofield

Summary

The project aims to implement a data platform that can extract data from an operational database, archive it in a data lake, and make it easily accessible within a remodelled OLAP data warehouse.

The solution showcases our skills in:

  • Python
  • PostgreSQL
  • Database modelling
  • Amazon Web Services (AWS)
  • Agile methodologies

Main Objectives

Our goal is to create a reliable ETL (Extract, Transform, Load) pipeline that can:

  1. Extract the data from the totesys operational database
  2. Store the data in AWS S3 buckets, that will form our data lake
  3. Transform the data into a suitable schema for the data warehouse
  4. Load the transformed data into the data warehouse hosted on AWS

Key Features

We aim for the project to have certain features. Some are more prioritised than others.

  • Automated data ingestion from totesys db
  • Data storage for ingested and processed data in S3 buckets
  • Data transformation for data warehouse schema
  • Automated data loading into the data warehouse schema
  • Logging and monitoring with CloudWatch
  • Notifications for errors and successful runs (e.g. successful ingestion)
  • Visualisation of warehouse data