This is a project for Handling, Processing Big Data and Generate Insights from Big Data
-
Get Movie Details from TMDB (API)
-
Data Processing - Cleaning Data, formatting and removing stop words (NLTK module)
-
Store in a Data warehouse (AWS and local machine) and utilize Online analytical processing (OLAP) and simulate Online Transaction Processing (OLTP) through S3 to local using the Updater
-
Analyze data and Generate Dashboard via tableau
-
Data Analysis.pdf - Data Analysis presentation
-
IMDB Dashboard.png - Sample Image of Dashboard
-
PreprocessF.ipynb - Preprocessing of the scraped movie details using Pandas
-
SPARK, SQL, EMR.py - Using AWS EMR to run SPARK and SQL commands for Big Data Analysis
-
Scrape IMDB through API.ipynb - Get Movie Details as dataset 'Title', 'Year', 'Revenue', 'Budget', 'Runtime', 'Actors', 'Rating', 'Production_company', 'Genre', and 'IMDb_code'
-
Update Scrape progress.ipynb - This code checks if there are new movies added to IMDB and it adds it to the dataset, using pickle module it loads the current data then runs the scraping to check for updates to simulate Online Transaction Processing (OLTP)
-
Use Python data cleaning in AWS s3.ipynb - Utilized AWS s3 and CLI to remove stop words from the dataset before processing
-
data warehousing.ipynb - Process the cleaned data to Star Schema
-
main.py - Using AWS EMR to use spark and SQL commands to generate insights from the data