NBA Data Lake Pipeline

Welcome to my 3rd project in the Devops Challenge. This project automates the process of collecting, storing, and analyzing NBA player data using AWS. It fetches data from the Sportsdata.io API and sets up a data lake in AWS for easy querying and analytics.

I once again decided to add my own challenge to the project by automating the process with github action, and logging it with cloudwatch.

🚀 What This Project Does

Fetch NBA Data: Gets player data from the Sportsdata.io API.
Store Data in S3: Saves the data in AWS S3 as JSON.
Create a Data Lake: Sets up AWS Glue for data organization.
Enable SQL Queries: Configures AWS Athena to query the data.
Logs Everything: Tracks all activities using AWS CloudWatch.

🛠️ Tools and Technologies

Python 3.8
AWS: S3, Glue, Athena, CloudWatch
Sportsdata.io: NBA data API
GitHub Actions: Automates deployment

📝 Setup Instructions

Step 1: Prerequisites

AWS account.
IAM Role/Permissions: Ensure the user or role running the script has the following permissions:

S3: s3:CreateBucket, s3:PutObject, s3:DeleteBucket, s3:ListBucket Glue: glue:CreateDatabase, glue:CreateTable, glue:DeleteDatabase, glue:DeleteTable Athena: athena:StartQueryExecution, athena:GetQueryResults

Sportsdata.io API key.
Add these secrets to your GitHub repository (Settings > Secrets and variables > Actions):

Secret Name	Description
`AWS_ACCESS_KEY_ID`	AWS access key
`AWS_SECRET_ACCESS_KEY`	AWS secret access key
`AWS_REGION`	AWS region (e.g., `us-east-1`)
`AWS_BUCKET_NAME`	Your S3 bucket name
`NBA_ENDPOINT`	Sportsdata.io API endpoint
`SPORTS_DATA_API_KEY`	Sportsdata.io API key

Step 2: How It Works

Clone the Repo

git clone https://github.com/kingdave4/Nba_Data_Lake.git
cd nba-data-lake-pipeline

Breakdown of the project.

This project is meant to run the python script automatically with all it's depencies in the github action which is located in the ".\github\workflows\deploy.yml"

GitHub Actions will: Set up AWS resources. run the python script "nba_data_script.py" which does the following: configuration and initialization for the AWS services.

Configuraion

initialization

Order of the python code execution

**Creation of the S3 bucket.
**Cretion of the Glue Databset.
**Fetching the NBA data.
**Convert to Json format.
**uploading data to s3.
**Creating Glue Table "nba_players.
**Configuring athena for querrying.

Github action code

This code is setup so that whenever there is a push request to our repository, this code will run and it will automate the deployment of our script which will create all the resources.

📊 Results of the github action

Once the pipeline is complete:

S3 Bucket: Data is stored under the raw-data/ folder.

AWS Glue: Manages the data schema.

AWS Athena: Query the data using SQL.

Example Query (Athena):

SELECT FirstName, LastName, Position, Team FROM nba_players WHERE Position = 'SG';

🛡️ Error Tracking

CloudWatch Logs: Tracks all activities (e.g., S3 uploads, API calls).

Logs can help troubleshoot errors like missing API keys or AWS setup issues.

What I Learned:

🌟 Used AWS tools like S3, Glue, Athena, and CloudWatch to build a system for storing and analyzing data.
🌟 Set up GitHub Actions to automate the pipeline so it runs every time new code is pushed.
🌟 Learned to keep sensitive information (like API keys and AWS credentials) safe using GitHub Secrets and .env files.
🌟 Learned to fetch real-world data from an API and save it in an organized format for analysis.
🌟 Used SQL to analyze the stored data with AWS Athena.
🌟 Set up logging in AWS CloudWatch to track the pipeline and quickly fix problems.

Future Enhancements:

🌟Automate data ingestion with AWS Lambda

🌟Implement a data transformation layer with AWS Glue ETL

🌟Add advanced analytics and visualizations (AWS QuickSight)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
img		img
python_script		python_script
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Data Lake Pipeline

🚀 What This Project Does

🛠️ Tools and Technologies

📝 Setup Instructions

Step 1: Prerequisites

Step 2: How It Works

Breakdown of the project.

Configuraion

initialization

Order of the python code execution

Github action code

📊 Results of the github action

About

Releases

Packages

Languages

kingdave4/Nba_Data_Lake

Folders and files

Latest commit

History

Repository files navigation

NBA Data Lake Pipeline

🚀 What This Project Does

🛠️ Tools and Technologies

📝 Setup Instructions

Step 1: Prerequisites

Step 2: How It Works

Breakdown of the project.

Configuraion

initialization

Order of the python code execution

** Github action code **

📊 Results of the github action

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Github action code

Packages