Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Spark-DL notebooks for CI/CD and update to latest dependencies #439

Merged
merged 37 commits into from
Oct 15, 2024
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
7c9ac29
Update Spark-DL examples
rishic3 Sep 30, 2024
307f177
Update README.md
rishic3 Sep 30, 2024
7679c6b
Update README.md
rishic3 Sep 30, 2024
6677c01
update numpy versions
rishic3 Sep 30, 2024
f74496c
Merge remote-tracking branch 'origin/branch-24.10' into branch-24.10
rishic3 Sep 30, 2024
6920dda
Update readme/reqs, cond_gen_tf works
rishic3 Oct 1, 2024
68e5db9
update image_classif
rishic3 Oct 1, 2024
197323a
update torch examples with dynamo compilation
rishic3 Oct 1, 2024
2a048ba
Test modelopt warning
rishic3 Oct 1, 2024
36de140
torch examples updated for aot compilation
rishic3 Oct 3, 2024
56db56c
Separate conditional generation to tf/torch on triton
rishic3 Oct 3, 2024
8b8bcab
Huggingface ex's updated with standalone
rishic3 Oct 3, 2024
a8a29ae
Reran TF ex's with standalone
rishic3 Oct 3, 2024
0b998ef
Remove setMaster for nbconvert
rishic3 Oct 4, 2024
a0b0688
Update installation instructions
rishic3 Oct 4, 2024
48c3b1e
Address TF warnings/errors
rishic3 Oct 4, 2024
b2cc07b
Addressing comments
rishic3 Oct 4, 2024
d9cc527
Update README
rishic3 Oct 4, 2024
e399de8
Update Spark tensorrt compilation note, truncate keras outputs
rishic3 Oct 4, 2024
1d21c16
Fix resource warnings, update arrow check
rishic3 Oct 5, 2024
56de5f3
Update SetMaster, fix caching issue
rishic3 Oct 8, 2024
546470c
Correctly set max_length for conditional generation
rishic3 Oct 8, 2024
5c5ccef
Disable tokenizer parallelism
rishic3 Oct 8, 2024
4d272eb
Update README to include DL inference
rishic3 Oct 8, 2024
7d98ecf
Update README.md
rishic3 Oct 8, 2024
87768d2
Enable tokenizer parallelism
rishic3 Oct 10, 2024
0a6ce22
Update suffixes for CI/CD, add table to readme
rishic3 Oct 10, 2024
3eb253a
Finish updating suffix
rishic3 Oct 10, 2024
825d9fb
Update README.md
rishic3 Oct 10, 2024
235b752
Update README.md
rishic3 Oct 10, 2024
99894d5
Enable tokenizer parallelism
rishic3 Oct 11, 2024
0723790
Merge branch 'branch-24.10' of https://github.com/NVIDIA/spark-rapids…
rishic3 Oct 11, 2024
1517f49
Merge branch 'NVIDIA:branch-24.10' into branch-24.10
rishic3 Oct 11, 2024
a56b0f7
Merge branch 'branch-24.10' of http://github.com/rishic3/spark-rapids…
rishic3 Oct 11, 2024
68a6032
Separate requirements files
rishic3 Oct 11, 2024
f9f6a45
Fix typo
rishic3 Oct 11, 2024
5dd7787
Reference requirements.txt
rishic3 Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ RAPIDS Accelerator for Apache Spark accelerates Spark applications with no code
You can download the latest version of RAPIDS Accelerator [here](https://nvidia.github.io/spark-rapids/docs/download.html).
This repo contains examples and applications that showcases the performance and benefits of using
RAPIDS Accelerator in data processing and machine learning pipelines.
There are broadly four categories of examples in this repo:
There are broadly five categories of examples in this repo:
1. [SQL/Dataframe](./examples/SQL+DF-Examples)
2. [Spark XGBoost](./examples/XGBoost-Examples)
3. [Deep Learning/Machine Learning](./examples/ML+DL-Examples)
3. [Machine Learning/Deep Learning](./examples/ML+DL-Examples)
4. [RAPIDS UDF](./examples/UDF-Examples)
5. [Databricks Tools demo notebooks](./tools/databricks)

Expand All @@ -23,7 +23,8 @@ Here is the list of notebooks in this repo:
| 3 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 4 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
| 5 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 6 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
| 6 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 7 | ML/DL | DL Inference | 11 notebooks demonstrating distributed model inference on Spark using the `predict_batch_udf` across various frameworks: PyTorch, HuggingFace, and TensorFlow

Here is the list of Apache Spark applications (Scala and PySpark) that
can be built for running on GPU with RAPIDS Accelerator in this repo:
Expand All @@ -33,7 +34,7 @@ can be built for running on GPU with RAPIDS Accelerator in this repo:
| 1 | XGBoost | Agaricus (Scala) | Uses XGBoost classifier function to create model that can accurately differentiate between edible and poisonous mushrooms with the [agaricus dataset](https://archive.ics.uci.edu/ml/datasets/mushroom)
| 2 | XGBoost | Mortgage (Scala) | End-to-end ETL + XGBoost example to predict mortgage default with [Fannie Mae Single-Family Loan Performance Data](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data)
| 3 | XGBoost | Taxi (Scala) | End-to-end ETL + XGBoost example to predict taxi trip fare amount with [NYC taxi trips data set](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
| 4 | ML/DL | PCA End-to-End | Spark MLlib based PCA example to train and transform with a synthetic dataset
| 4 | ML/DL | PCA | [Spark-Rapids-ML](https://github.com/NVIDIA/spark-rapids-ml) based PCA example to train and transform with a synthetic dataset
| 5 | UDF | URL Decode | Decodes URL-encoded strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
| 6 | UDF | URL Encode | URL-encodes strings using the [Java APIs of RAPIDS cudf](https://docs.rapids.ai/api/cudf-java/legacy/)
| 7 | UDF | [CosineSimilarity](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/java/com/nvidia/spark/rapids/udf/java/CosineSimilarity.java) | Computes the cosine similarity between two float vectors using [native code](./examples/UDF-Examples/RAPIDS-accelerated-UDFs/src/main/cpp/src)
Expand Down
44 changes: 40 additions & 4 deletions examples/ML+DL-Examples/Spark-DL/dl_inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,53 @@ predictions = df.withColumn("preds", mnist("data")).collect()

In this simple case, the `predict_batch_fn` will use TensorFlow APIs to load the model and return a simple `predict` function which operates on numpy arrays. The `predict_batch_udf` will automatically convert the Spark DataFrame columns to the expected numpy inputs.

All notebooks have been saved with sample outputs for quick browsing.
All notebooks have been saved with sample outputs for quick browsing.
Here is a full list of the notebooks with their published example links:

| | Category | Notebook Name | Description | Link
| ------------- | ------------- | ------------- | ------------- | -------------
| 1 | PyTorch | Image Classification | Training a model to predict clothing categories in FashionMNIST, including accelerated inference with Torch-TensorRT. | [Link](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
| 2 | PyTorch | Regression | Training a model to predict housing prices in the California Housing Dataset, including accelerated inference with Torch-TensorRT. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
| 3 | Tensorflow | Image Classification | Training a model to predict hand-written digits in MNIST. | [Link](https://www.tensorflow.org/tutorials/keras/save_and_load)
| 4 | Tensorflow | Feature Columns | Training a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers)
| 5 | Tensorflow | Keras Metadata | Training ResNet-50 to perform flower recognition on Databricks. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
| 6 | Tensorflow | Text Classification | Training a model to perform sentiment analysis on the IMDB dataset. | [Link](https://www.tensorflow.org/tutorials/keras/text_classification)
| 7+8 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer, with notebooks demoing both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5)
| 9+10 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines, with notebooks demoing both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
| 11 | HuggingFace | Sentence Transformers | Sentence embeddings using the SentenceTransformers framework in Torch. | [Link](https://huggingface.co/sentence-transformers)

## Running the Notebooks

If you want to run the notebooks yourself, please follow these instructions.

**Note**: for demonstration purposes, these examples just use a local Spark Standalone cluster with a single executor, but you should be able to run them on any distributed Spark cluster.
**Notes**:
- The notebooks require a GPU environment for the executors.
- Please create separate environments for PyTorch and Tensorflow examples as specified below. This will avoid conflicts between the CUDA libraries bundled with their respective versions. The Huggingface examples will have a `_torch` or `_tf` suffix to specify the environment used.
- The PyTorch notebooks include model compilation and accelerated inference with TensorRT. While not included in the notebooks, Tensorflow also supports [integration with TensorRT](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html), but may require downgrading the TF version.
- For demonstration purposes, these examples just use a local Spark Standalone cluster with a single executor, but you should be able to run them on any distributed Spark cluster.

#### Create environment

**For PyTorch:**
```
conda create -n spark-dl-torch python=3.11
conda activate spark-dl-torch
pip install -r requirements.txt
pip install torch torchvision torch-tensorrt tensorrt --extra-index-url https://download.pytorch.org/whl/cu121
pip install sentence_transformers sentencepiece
pip install "nvidia-modelopt[all]" --extra-index-url https://pypi.nvidia.com
```
# install dependencies for example notebooks
**For TensorFlow:**
```
conda create -n spark-dl-tf python=3.11
conda activate spark-dl-tf
pip install -r requirements.txt
pip install tensorflow[and-cuda] tf-keras
```

#### Launch Jupyter + Spark

rishic3 marked this conversation as resolved.
Show resolved Hide resolved
```
# setup environment variables
export SPARK_HOME=/path/to/spark
export MASTER=spark://$(hostname):7077
Expand Down Expand Up @@ -70,4 +106,4 @@ The example notebooks also demonstrate integration with [Triton Inference Server

**Note**: Some examples may require special configuration of server as highlighted in the notebooks.

**Note**: for demonstration purposes, the Triton Inference Server integrations just launch the server in a docker container on the local host, so you will need to [install docker](https://docs.docker.com/engine/install/) on your local host. Most real-world deployments will likely be hosted on remote machines.
**Note**: for demonstration purposes, the Triton Inference Server integrations just launch the server in a docker container on the local host, so you will need to [install docker](https://docs.docker.com/engine/install/) on your local host. Most real-world deployments will likely be hosted on remote machines.
Loading