This repository contains the code and configurations to build a fully automated and scalable data pipeline. The pipeline extracts data from a MySQL SDatabase on AWS RDS, transforms it into a star schema using AWS Glue, and loads the transformed data into Amazon S3 for querying with Amazon Athena. Terraform is used to manage infrastructure as code, ensuring repeatability and scalability.
- Architecture
- Technologies Used
- Project Setup
- Prerequisites
- Installation
- Running the Pipeline
- Challenges and Solutions
- Conclusion
![Data Pipeline Architecture](/Users/freDelicious/Documents/git/DataPipeline Project/images/ETL.drawio.png)
The data pipeline includes the following stages:
- Data Ingestion: Extract data from MySQL on AWS RDS.
- Transformation: Use AWS Glue to clean and transform the data into a star schema.
- Data Storage: Load the transformed data into Amazon S3.
- Data Querying: Query the data using Amazon Athena.
- Visualization: Create interactive dashboards with Jupyter Lab for data exploration.
- AWS Glue: For ETL (Extract, Transform, Load) processing.
- Amazon RDS (MySQL): Source database for storing transactional data.
- Amazon S3: Scalable object storage for both raw and transformed data.
- Amazon Athena: Serverless query service to analyze data stored in S3.
- Jupyter Lab: For data visualization and analysis.
- Terraform: Infrastructure as Code (IaC) tool to automate AWS resource provisioning.
- AWS account
- Python 3.9+
- Terraform installed (v1.0+)
- AWS CLI configured with appropriate permissions
-
Clone the repository:
git clone https://github.com/your-repo/data-pipeline-aws.git cd data-pipeline-aws
-
Initialize Terraform:
terraform init
-
Configure Terraform: Adjust the
variables.tf
file to set your AWS region, RDS instance details, and S3 bucket names. Thetf
files you checked allow you to declare the resources of your data architecture. You still need to specify for AWS Glue how to perform data extraction, transformation, and load. You can check all of these steps as Python script glue_job.py. -
Deploy the infrastructure:
terraform apply
-
Configure and start AWS Glue jobs: Follow the instructions in
glue_job.py
to configure your Glue job. You can start the job using the AWS CLI:aws glue start-job-run --job-name <your-glue-job-name>
Once the infrastructure is deployed and Glue jobs are running:
-
Ingest Data: AWS Glue extracts data from MySQL.
-
Transform Data: The data is transformed into a star schema and stored in Amazon S3.
-
Query with Athena: Use Athena to run SQL queries on the transformed data. Example query:
SELECT country, SUM(orderAmount) AS total_sales FROM fact_orders GROUP BY country ORDER BY total_sales DESC;
-
Data Visualization: Use Jupyter Lab notebooks to visualize the data. Example code to visualize total sales per country:
import pandas as pd import seaborn as sns df = pd.read_csv('s3://path-to-your-data/sales_data.csv') sns.barplot(x='Country', y='Total Sales', data=df)
- Optimizing Glue Job Performance: Initially, Glue jobs were slow due to resource limitations. Increasing the number of workers and using pushdown predicates optimized performance.
- Efficient Querying: Partitioning data in S3 by date and country improved Athena’s query performance and reduced costs.
- Schema Management: AWS Glue’s DynamicFrames were used to handle schema evolution and mismatches between MySQL and the transformed data.
Feel free to customize this README by replacing placeholders (like the image path and repository URL) and adding any additional instructions or links.
Let us know if you encounter any issues or have ideas for improvements! 🚀
Author: ✍️
- Freda Victor
- Project Blog.
- Date: 17 September 2024