To install all the required libraries and dependencies, run the following command in your virtual environment:
pip3 install -r requirements.txt
-
Download the initial Amazon dataset from here
-
Split the dataset into manageable chunks by running the
split_dataset.py
file as follows :sudo ${SPARK_HOME}/bin/spark-submit split_dataset.py Electronics.json output
-
Upload the split files to S3
-
Setup an EMR cluster to run a pyspark script (refer Assignment 5)
-
Run the
model_creation.py
file (to create a model based on the split training data) on the cluster with a suitable configuration as follows:spark-submit --deploy-mode cluster --conf spark.yarn.maxAppAttempts=1 s3://amazon-product-recommender/scripts/model_creation.py s3://amazon-product-recommender/ElectronicProductDataZIP/ s3://amazon-product-recommender/output
-
This will create a model that will be placed in S3
-
This script was run as an AWS lambda function. However, if required, it can be run locally as follows:
-
Firstly, add your AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_SQS_QUEUE_NAME in
scraper.py
after creating a queue in SQS. -
Run the python scraper script with the below command :
python3 scraper.py
This script scrapes the required data from Amazon.ca and pushes it to the SQS queue.
-
Initialize an EC2 instance and setup MySQL Server. Add the DB credentials to the
sentiment_analyzer.py
file -
Add your AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_SQS_QUEUE_NAME to
sentiment_analyzer.py
-
Run the
sentiment_analyzer.py
using the following command :python3 sentiment_analyzer.py path_to_model
-
This file will ingest all messages from the SQS queue and run them through the model, then the predicted output labels are correspondingly modified in the MySQL DB.
-
On the same EC2 instance, setup a Grafana server.
-
In the grafana server, initialize a data-source as your already existing MySQL server.
-
Import the
grafana_dashboard.json
, you should now be able to visualize the latest data from the database.