Automating a Scalable Data Pipeline with AWS Glue, Athena, and Terraform

Project Overview

This repository contains the code and configurations to build a fully automated and scalable data pipeline. The pipeline extracts data from a MySQL SDatabase on AWS RDS, transforms it into a star schema using AWS Glue, and loads the transformed data into Amazon S3 for querying with Amazon Athena. Terraform is used to manage infrastructure as code, ensuring repeatability and scalability.

Architecture

![Data Pipeline Architecture](/Users/freDelicious/Documents/git/DataPipeline Project/images/ETL.drawio.png)

The data pipeline includes the following stages:

Data Ingestion: Extract data from MySQL on AWS RDS.
Transformation: Use AWS Glue to clean and transform the data into a star schema.
Data Storage: Load the transformed data into Amazon S3.
Data Querying: Query the data using Amazon Athena.
Visualization: Create interactive dashboards with Jupyter Lab for data exploration.

Technologies Used

AWS Glue: For ETL (Extract, Transform, Load) processing.
Amazon RDS (MySQL): Source database for storing transactional data.
Amazon S3: Scalable object storage for both raw and transformed data.
Amazon Athena: Serverless query service to analyze data stored in S3.
Jupyter Lab: For data visualization and analysis.
Terraform: Infrastructure as Code (IaC) tool to automate AWS resource provisioning.

Project Setup

Prerequisites

AWS account
Python 3.9+
Terraform installed (v1.0+)
AWS CLI configured with appropriate permissions

Installation

Clone the repository:

git clone https://github.com/your-repo/data-pipeline-aws.git
cd data-pipeline-aws

Initialize Terraform:
```
terraform init
```
Configure Terraform: Adjust the variables.tf file to set your AWS region, RDS instance details, and S3 bucket names. The tf files you checked allow you to declare the resources of your data architecture. You still need to specify for AWS Glue how to perform data extraction, transformation, and load. You can check all of these steps as Python script glue_job.py.
Deploy the infrastructure:
```
terraform apply
```
Configure and start AWS Glue jobs: Follow the instructions in glue_job.py to configure your Glue job. You can start the job using the AWS CLI:
```
aws glue start-job-run --job-name <your-glue-job-name>
```

Running the Pipeline

Once the infrastructure is deployed and Glue jobs are running:

Ingest Data: AWS Glue extracts data from MySQL.
Transform Data: The data is transformed into a star schema and stored in Amazon S3.

Query with Athena: Use Athena to run SQL queries on the transformed data. Example query:

SELECT country, SUM(orderAmount) AS total_sales
FROM fact_orders
GROUP BY country
ORDER BY total_sales DESC;

Data Visualization: Use Jupyter Lab notebooks to visualize the data. Example code to visualize total sales per country:

import pandas as pd
import seaborn as sns
df = pd.read_csv('s3://path-to-your-data/sales_data.csv')
sns.barplot(x='Country', y='Total Sales', data=df)

Challenges and Solutions

Optimizing Glue Job Performance: Initially, Glue jobs were slow due to resource limitations. Increasing the number of workers and using pushdown predicates optimized performance.
Efficient Querying: Partitioning data in S3 by date and country improved Athena’s query performance and reduced costs.
Schema Management: AWS Glue’s DynamicFrames were used to handle schema evolution and mismatches between MySQL and the transformed data.

Conclusion

Feel free to customize this README by replacing placeholders (like the image path and repository URL) and adding any additional instructions or links.

Let us know if you encounter any issues or have ideas for improvements! 🚀

Author: ✍️

Freda Victor
Project Blog.
Date: 17 September 2024

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
images		images
jupyterlab		jupyterlab
scripts		scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automating a Scalable Data Pipeline with AWS Glue, Athena, and Terraform

Project Overview

Table of Contents

Architecture

Technologies Used

Project Setup

Prerequisites

Installation

Running the Pipeline

Challenges and Solutions

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Freddie20/DataPipelineProject

Folders and files

Latest commit

History

Repository files navigation

Automating a Scalable Data Pipeline with AWS Glue, Athena, and Terraform

Project Overview

Table of Contents

Architecture

Technologies Used

Project Setup

Prerequisites

Installation

Running the Pipeline

Challenges and Solutions

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages