What

This is a project for Handling, Processing Big Data and Generate Insights from Big Data

Process

Get Movie Details from TMDB (API)
Data Processing - Cleaning Data, formatting and removing stop words (NLTK module)
Store in a Data warehouse (AWS and local machine) and utilize Online analytical processing (OLAP) and simulate Online Transaction Processing (OLTP) through S3 to local using the Updater
Analyze data and Generate Dashboard via tableau

Data Analysis.pdf - Data Analysis presentation
IMDB Dashboard.png - Sample Image of Dashboard
PreprocessF.ipynb - Preprocessing of the scraped movie details using Pandas
SPARK, SQL, EMR.py - Using AWS EMR to run SPARK and SQL commands for Big Data Analysis
Scrape IMDB through API.ipynb - Get Movie Details as dataset 'Title', 'Year', 'Revenue', 'Budget', 'Runtime', 'Actors', 'Rating', 'Production_company', 'Genre', and 'IMDb_code'
Update Scrape progress.ipynb - This code checks if there are new movies added to IMDB and it adds it to the dataset, using pickle module it loads the current data then runs the scraping to check for updates to simulate Online Transaction Processing (OLTP)
Use Python data cleaning in AWS s3.ipynb - Utilized AWS s3 and CLI to remove stop words from the dataset before processing
data warehousing.ipynb - Process the cleaned data to Star Schema
main.py - Using AWS EMR to use spark and SQL commands to generate insights from the data

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Data		Data
Data Analysis.pdf		Data Analysis.pdf
IMDB Dashboard.png		IMDB Dashboard.png
IMDB Dashboard.twb		IMDB Dashboard.twb
PreprocessF.ipynb		PreprocessF.ipynb
README.md		README.md
SPARK, SQL, EMR.py		SPARK, SQL, EMR.py
Scrape IMDB through API.ipynb		Scrape IMDB through API.ipynb
Update Scrape progress.ipynb		Update Scrape progress.ipynb
Use Python data cleaning in AWS s3.ipynb		Use Python data cleaning in AWS s3.ipynb
data warehousing.ipynb		data warehousing.ipynb
main.py		main.py