Skip to content

Showcase skill in AWS, Big Data warehousing, SPARK, Data Gathering, Data Analysis, Dashboarding and Reporting

Notifications You must be signed in to change notification settings

PedGit025/Big-Data-Showcase

Repository files navigation

What

This is a project for Handling, Processing Big Data and Generate Insights from Big Data

Process

  1. Get Movie Details from TMDB (API)

  2. Data Processing - Cleaning Data, formatting and removing stop words (NLTK module)

  3. Store in a Data warehouse (AWS and local machine) and utilize Online analytical processing (OLAP) and simulate Online Transaction Processing (OLTP) through S3 to local using the Updater

  4. Analyze data and Generate Dashboard via tableau

Files

  1. Data Analysis.pdf - Data Analysis presentation

  2. IMDB Dashboard.png - Sample Image of Dashboard

  3. PreprocessF.ipynb - Preprocessing of the scraped movie details using Pandas

  4. SPARK, SQL, EMR.py - Using AWS EMR to run SPARK and SQL commands for Big Data Analysis

  5. Scrape IMDB through API.ipynb - Get Movie Details as dataset 'Title', 'Year', 'Revenue', 'Budget', 'Runtime', 'Actors', 'Rating', 'Production_company', 'Genre', and 'IMDb_code'

  6. Update Scrape progress.ipynb - This code checks if there are new movies added to IMDB and it adds it to the dataset, using pickle module it loads the current data then runs the scraping to check for updates to simulate Online Transaction Processing (OLTP)

  7. Use Python data cleaning in AWS s3.ipynb - Utilized AWS s3 and CLI to remove stop words from the dataset before processing

  8. data warehousing.ipynb - Process the cleaned data to Star Schema

  9. main.py - Using AWS EMR to use spark and SQL commands to generate insights from the data

Co Creators

[email protected]

[email protected]

[email protected]

About

Showcase skill in AWS, Big Data warehousing, SPARK, Data Gathering, Data Analysis, Dashboarding and Reporting

Topics

Resources

Stars

Watchers

Forks