Rapids Blog Post Repository

This repo contains the entire code used in our comparisons between Rapids - Scikit Learn - Spark - Pandas that are featured here: https://medium.com/sfu-cspmp/rapids-the-future-of-gpu-data-science-9e0524563019

The Code

These notebooks create the graphs used in the Blog Post.

These notebooks perform the experiments and record the timings. The link for the data for these notebooks is located below.

This script creates the data for the ML experiments (also located below - see ml_data.zip)

These scripts are PySpark scripts to partition the data for spark to increase the experiment speed.

These PySpark scripts execute the experiments using spark. See below for submission instructions:

spark-submit spark_etl_tests.py bc_air_monitoring_stations.csv spark_etl_test_subsets spark_etl_results
spark-submit spark_ml_tests.py spark_ml_test_subsets spark_ml_results

The results for AWS are located in "AWS Results" in this repo.