End-to-End-YouTube-Data-Pipeline-with-AWS

Figure: End-to-End AWS YouTube Data Pipeline Architecture - Illustrating data flow from ingestion to analytics.

This project aims to securely manage, streamline, and analyze structured and semi-structured YouTube video data based on video categories and trending metrics.

Project Goals

Data Ingestion
Build a mechanism to ingest data from different sources.
ETL System
Transform raw data into a structured and analytical-ready format.
Data Lake
Centralize structured and semi-structured data from multiple sources into a repository.
Scalability
Ensure the system scales as the size of data increases.
Cloud-Based Architecture
Leverage AWS to process vast amounts of data efficiently.
Reporting
Create a dashboard to analyze key metrics and answer business questions.

Data Source

The data for this project is sourced from Kaggle. You can access it here.

Architecture Overview

Data Flow Summary

Data ingestion is performed using CLI commands located in the file
Lambda is used to process raw JSON files, transforming them into structured formats. These transformations are handled by a Lambda function that triggers whenever a new file is uploaded to the raw S3 bucket. The script is located here
Glue is used for further data transformation on the structured data (CSVs), leveraging a PySpark script located here. This transformation ensures the proper data types in the schema, enabling seamless joins with the processed JSON data.
The pipeline uses three S3 buckets:
- Raw: Stores unprocessed data.
- Cleansed: Contains cleansed data.
- Analytics: Hosts the final transformed data ready for reporting. This way, there is no need for writing queries to join the data.
Glue's Visual ETL process prepares the data for analysis by joining the processed csv and processed json. The architecture for this process is shown below:

Figure: Glue Visual ETL process for data transformation.
The final step connects the analytics-ready data to QuickSight for dashboard generation.

Querying Raw JSON Data

When attempting to query the raw JSON data directly, an error occurred due to improper format as shown below:

Figure: Error encountered while querying unprocessed JSON data.

After Running the Lambda Function

After processing the JSON data using the AWS Lambda function, the format was corrected and converted to parquet, enabling successful queries. The cleaned data query result is shown here:

Figure: Query result after processing raw JSON data with Lambda.

Final Reporting Version

To prepare the data for reporting, the CSV data was processed using AWS Glue to ensure compatibility with the cleaned json data. A Glue Visual ETL job was then used to join the processed CSV and parquet data. The final, analytics-ready data is shown below:

Figure: Joined data ready for reporting after Glue processing and transformation.

Dashboard Preview

Figure: Dashboard generated from the pipeline with QuickSight.

Key Components

AWS Services Used

S3: For data storage (Raw, Cleansed, Analytics buckets).
AWS Lambda: Event-driven transformations for incoming raw data.
AWS Glue: ETL processing using PySpark and Visual ETL. Also organizes and classifies the data, ensuring discoverability and metadata management.
AWS IAM: For access control and permissions.
AWS Athena: AWS Athena enables SQL-based querying of processed data directly from the Analytics S3 bucket.
Amazon QuickSight: Generate insights via dashboards.
AWS CloudWatch: Monitor and alert for pipeline performance.

How It Works

Ingestion:
Raw data from Kaggle is uploaded to the Raw S3 bucket using CLI commands.
Transformation:
- AWS Lambda processes the raw JSON data, transforming it into structured format and moving it to the Cleansed S3 bucket.
- AWS Glue PySpark and Visual ETL further transform and prepare the data for analytics. The output is stored in the Analytics S3 bucket.
Analytics:
- Data is queried using AWS Athena and visualized using Amazon QuickSight.
Monitoring:
AWS CloudWatch is set up to monitor the system's performance and send alerts if needed.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Architecture_Diagram		Architecture_Diagram
Screenshots		Screenshots
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
cli_command.sh		cli_command.sh
lambda_function.py		lambda_function.py
pyspark_code.py		pyspark_code.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End-YouTube-Data-Pipeline-with-AWS

Project Goals

Data Source

Architecture Overview

Data Flow Summary

Querying Raw JSON Data

After Running the Lambda Function

Final Reporting Version

Dashboard Preview

Key Components

AWS Services Used

How It Works

About

Releases

Packages

Languages

License

Abdul-AA/End-to-End-YouTube-Data-Pipeline-with-AWS

Folders and files

Latest commit

History

Repository files navigation

End-to-End-YouTube-Data-Pipeline-with-AWS

Project Goals

Data Source

Architecture Overview

Data Flow Summary

Querying Raw JSON Data

After Running the Lambda Function

Final Reporting Version

Dashboard Preview

Key Components

AWS Services Used

How It Works

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages