Fix Spark-DL notebooks for CI/CD and update to latest dependencies #439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

eordentlich merged 37 commits into NVIDIA:branch-24.10 from rishic3:branch-24.10

Oct 15, 2024

Collaborator

rishic3 commented Sep 30, 2024 •

edited

Loading

Changes:

Version updates:

Updated all examples to run with latest Triton (24.08), Tensorflow (2.17.0), and Torch (2.4.1), and match updated APIs.
Notable changes:
- Using new .keras model format in Tensorflow wherever applicable
- Added use of Torch-TensorRT compilation/inference locally and on Spark to PyTorch notebooks

Environment separation:

Updated README for instructions on creating separate environments for Torch/Tensorflow to avoid CUDA conflicts.
Separated Huggingface examples into _torch/_tf versions for environment separation, and demonstrate model interoperability.

CI/CD:

Included Spark Session initialization to work with CI/CD pipeline.
Verified that all notebooks run error-free with jupyter nbconvert.

rishic3 added 2 commits

September 30, 2024 15:37


          Update Spark-DL examples

7c9ac29

Signed-off-by: Rishi Chandra <[email protected]>


          Update README.md

307f177

rishic3 marked this pull request as ready for review

September 30, 2024 15:59

Collaborator Author

rishic3 commented Sep 30, 2024

@eordentlich @leewyang requesting a review - thanks!

rishic3 and others added 3 commits

September 30, 2024 14:11


          Update README.md

7679c6b


          update numpy versions

6677c01


          Merge remote-tracking branch 'origin/branch-24.10' into branch-24.10

f74496c

eordentlich reviewed

View reviewed changes

Collaborator

eordentlich left a comment

Really nice. Here is a preliminary set of comments. Will do another round after revisions.

examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/image_classification.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/image_classification.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/README.md Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

rishic3 added 8 commits

October 1, 2024 18:16


          Update readme/reqs, cond_gen_tf works

6920dda


          update image_classif

68e5db9


          update torch examples with dynamo compilation

197323a


          Test modelopt warning

2a048ba


          torch examples updated for aot compilation

36de140


          Separate conditional generation to tf/torch on triton

56db56c


          Huggingface ex's updated with standalone

8b8bcab


          Reran TF ex's with standalone

a8a29ae

leewyang reviewed

View reviewed changes

Collaborator

leewyang left a comment

Minor nits, otherwise LGTM. Good work!

examples/ML+DL-Examples/Spark-DL/dl_inference/README.md Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated

+                    "2024-10-03 00:58:33.914094: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+                    "2024-10-03 00:58:33.919757: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+                    "To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+                    "2024-10-03 00:58:34.259847: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n"

Collaborator

leewyang Oct 4, 2024

Try to resolve (E)rrors and (W)arnings (here and in other notebooks), if possible. Otherwise, maybe add a quick comment why this is expected/OK.

rishic3 added 3 commits

October 4, 2024 18:26


          Remove setMaster for nbconvert

0b998ef


          Update installation instructions

a0b0688


          Address TF warnings/errors

48c3b1e

eordentlich reviewed

View reviewed changes

Collaborator

eordentlich left a comment

Looks like a ton of work. Nice!
A few remaining minor comments/questions/suggestions.

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated

+                 "outputs": [],
+                 "source": [
+                  "spark.conf.set(\"spark.sql.execution.arrow.maxRecordsPerBatch\", \"512\")\n",
+                  "# This line will fail if the vectorized reader runs out of memory\n",

Collaborator

eordentlich Oct 4, 2024

What is the 'vectorized reader' and in general what is this line for? @leewyang

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/image_classification.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/image_classification.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/regression.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/pytorch/regression.ipynb Outdated Show resolved Hide resolved

examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/keras-metadata.ipynb Outdated Show resolved Hide resolved

rishic3 added 3 commits

October 4, 2024 20:49


          Addressing comments

b2cc07b


          Update README

d9cc527


          Update Spark tensorrt compilation note, truncate keras outputs

e399de8

eordentlich reviewed

View reviewed changes

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated

-                  "spark.conf.set(\"spark.sql.execution.arrow.maxRecordsPerBatch\", \"512\")\n",
-                  "# This line will fail if the vectorized reader runs out of memory\n",
+                  "if int(spark.conf.get(\"spark.sql.execution.arrow.maxRecordsPerBatch\")) < 512:\n",
+                  "    print(\"Increasing `spark.sql.execution.arrow.maxRecordsPerBatch` to ensure the vectorized reader won't run out of memory\")\n",

Collaborator

eordentlich Oct 4, 2024

Default is 10000 so this is decreasing.

Collaborator Author

rishic3 Oct 4, 2024

Right decreasing. I guess this is meant to ensure the batches fit in the internal Arrow buffer, since each record is fairly large?

rishic3 added 3 commits

October 5, 2024 18:49


          Fix resource warnings, update arrow check

1d21c16


          Update SetMaster, fix caching issue

56de5f3


          Correctly set max_length for conditional generation

546470c

eordentlich reviewed

View reviewed changes

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved


          Disable tokenizer parallelism

5c5ccef

rishic3 added 2 commits

October 8, 2024 23:14


          Update README to include DL inference

4d272eb


          Update README.md

7d98ecf

eordentlich reviewed

View reviewed changes

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved


          Enable tokenizer parallelism

87768d2

YanxuanLiu reviewed

View reviewed changes

examples/ML+DL-Examples/Spark-DL/dl_inference/requirements.txt Show resolved Hide resolved

rishic3 and others added 4 commits

October 10, 2024 17:13


          Update suffixes for CI/CD, add table to readme

0a6ce22


          Finish updating suffix

3eb253a


          Update README.md

825d9fb


          Update README.md

235b752

eordentlich reviewed

View reviewed changes

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb Outdated Show resolved Hide resolved

rishic3 and others added 6 commits

October 11, 2024 00:19


          Enable tokenizer parallelism

99894d5


          Merge branch 'branch-24.10' of https://github.com/NVIDIA/spark-rapids…

…-examples into branch-24.10


          Merge branch 'NVIDIA:branch-24.10' into branch-24.10

1517f49


          Merge branch 'branch-24.10' of http://github.com/rishic3/spark-rapids…

a56b0f7

…-examples into branch-24.10


          Separate requirements files

68a6032


          Fix typo

f9f6a45

eordentlich reviewed

View reviewed changes

examples/ML+DL-Examples/Spark-DL/dl_inference/README.md Outdated Show resolved Hide resolved


          Reference requirements.txt

5dd7787

eordentlich approved these changes

View reviewed changes

Collaborator

eordentlich left a comment

👍

eordentlich merged commit fc23a57 into NVIDIA:branch-24.10

1 of 2 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet