TPC-DS is a decision support benchmark often used to evaluate performance of OLAP Databases and Big Data systems.
The notebook in this folder runs a user-specified subset of the TPC-DS queries on the Scale Factor 10 (GiB) dataset. It uses TPCDS PySpark to execute TPC-DS queries with SparkSQL on GPU and CPU capturing the metrics as a Pandas dataframe. It then plots a comparison bar chart visualizing the GPU acceleration achieved for the queries run with RAPIDS Spark in this very notebook.
This notebook can be opened and executed using standard
- Jupyter(Lab)
- in VSCode with Jupyter extension
It can also be opened and evaluated on hosted Notebook environments. Use the link below to launch on Google Colab and connect it to a GPU instance.
Here is the bar chart from a recent execution on Google Colab's T4 High RAM instance using RAPIDS Spark 24.12.0 with Apache Spark 3.5.0