This repo contains the entire code used in our comparisons between Rapids - Scikit Learn - Spark - Pandas that are featured here: https://medium.com/sfu-cspmp/rapids-the-future-of-gpu-data-science-9e0524563019
These notebooks create the graphs used in the Blog Post.
Rapids ETL Timing.ipynb \ Rapids ML Timing.ipynb \ Pandas Timing.ipynb \ Scikit-Learning Timing.ipynb
These notebooks perform the experiments and record the timings. The link for the data for these notebooks is located below.
This script creates the data for the ML experiments (also located below - see ml_data.zip)
These scripts are PySpark scripts to partition the data for spark to increase the experiment speed.
These PySpark scripts execute the experiments using spark. See below for submission instructions:
- spark-submit spark_etl_tests.py bc_air_monitoring_stations.csv spark_etl_test_subsets spark_etl_results
- spark-submit spark_ml_tests.py spark_ml_test_subsets spark_ml_results
All data used can be found here: https://1sfu-my.sharepoint.com/:f:/g/personal/avickars_sfu_ca/Erj8utK-OatOiN9aOpWZZGABFWtGZyYPm29KrTQuc8_gWw?e=k5DXO9
The results for AWS are located in "AWS Results" in this repo.