-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Driver installation not working and Dataproc 2.2 cluster creation is failing #1239
Comments
This is due to https://github.com/cjac/initialization-actions/blob/rapids-20240806/gpu/install_gpu_driver.sh#L1077 Santosh, did you say you've tried this workaround and that it's unblocked you? |
Please review and test #1240 |
@cjac Yes, I tried with the workaround script you mentioned but still breaking with similar error in Dataproc 2.2 -----END PGP PUBLIC KEY BLOCK-----' sed -i -e 's:deb https:deb [signed-by=/usr/share/keyrings/mysql.gpg] https:g' /etc/apt/sources.list.d/mysql.list |
@cjac I have disabled secure boot in dataproc. Is that okay or should we enable it for this workaround? |
to use secure boot, you'll need to build a custom image. Instructions here: https://github.com/GoogleCloudDataproc/custom-images/tree/master/examples/secure-boot You do not need secure boot enabled for the workaround to function. I think you may just be missing an apt-get update after the sources.list files are cleaned up and the trust keys are written to /usr/share/keyrings |
@cjac I tried with that but still breaking with same error |
I forgot that I'm pinned to 2.2.20-debian12 I'll try to make it work with the latest from the 2.2 line. |
Okay, Thank you. I am getting the error in, 2.2.32-debian12. |
this might do it:
|
yes, that last iteration does seem to get the installer working for me on 2.2 latest |
@cjac Thank you. I tried with the above changes but the cluster creation still failed. It didn't give the previous package installation error and looks good in init script logs, last few lines of install_gpu_dirver.sh script below:- pdate-alternatives: using /usr/lib/mesa-diverted to provide /usr/lib/glx (glx) in auto mode I am seeing the following error in Dataproc logs:- DEFAULT 2024-09-27T02:58:49.624652770Z Setting up xserver-xorg-video-nvidia (560.35.03-1) ... I think this error caused the cluster creation failure. |
@cjac We are unable to create dataproc GPU cluster since Dataproc 2.1/2/2 upgrade . Please let me know if there are any workaround to proceed with cluster creation. |
I did publish another version since last we spoke. Can you please review the code at https://github.com/GoogleCloudDataproc/initialization-actions/pull/1240/files please? The tests passed last commit but took 2 hours and one minute to complete. This latest update should reduce the runtime significantly. |
I received those messages as well, but they should just be warnings. Does the new change get things working? |
@cjac I tried the latest script but dataproc initialization action is breaking with timeout error and cluster is not starting:- name: "gs://syn-development-kub/syn-cluster-config/install_gpu_driver.sh" I couldn't find any error details in the init script output. I am attaching the init script output for your reference. |
Can you increase your timeout by 5-10 minutes? I do have a fix that's in the works for the base image, and once it gets published, we should be able to skip the full upgrade in the init action. |
Here is a recent cluster build I did in my repro lab. It took 14m47.946s:
|
I see that I hard-coded a regional bucket path into the code. this will slow things down when running outside of us-west4 ; I'll fix that next. |
@cjac Adding timeout fixed the error and created cluster. We are able to run GPU workloads in the cluster. Thank you so much for the support!!. |
Glad I could help! I'll work on getting these changes integrated into the base image. |
Hi,
I am trying to attach GPUs to Dataproc 2.2 cluster, but it is breaking and cluster creation failing. Secure boot is disabled and I am using the latest install_gpu_driver.sh from this repository. I am getting the following error during cluster initialization now:-
++ tr '[:upper:]' '[:lower:]'
++ lsb_release -is
++ . /etc/os-release
+++ PRETTY_NAME='Debian GNU/Linux 12 (bookworm)'
+++ NAME='Debian GNU/Linux'
+++ VERSION_ID=12
+++ VERSION='12 (bookworm)'
+++ VERSION_CODENAME=bookworm
+++ ID=debian
+++ HOME_URL=https://www.debian.org/
+++ SUPPORT_URL=https://www.debian.org/support
+++ BUG_REPORT_URL=https://bugs.debian.org/
++ echo debian12
++ get_metadata_attribute dataproc-role
++ local -r attribute_name=dataproc-role
++ local -r default_value=
++ /usr/share/google/get_metadata_value attributes/dataproc-role
++ get_metadata_attribute rapids-runtime SPARK
++ local -r attribute_name=rapids-runtime
++ local -r default_value=SPARK
++ /usr/share/google/get_metadata_value attributes/rapids-runtime
++ echo -n SPARK
++ get_metadata_attribute cuda-version 12.4
++ local -r attribute_name=cuda-version
++ local -r default_value=12.4
++ /usr/share/google/get_metadata_value attributes/cuda-version
++ echo -n 12.4
++ get_metadata_attribute gpu-driver-version 550.54.14
++ local -r attribute_name=gpu-driver-version
++ local -r default_value=550.54.14
++ /usr/share/google/get_metadata_value attributes/gpu-driver-version
++ echo -n 550.54.14
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ os_version
++ xargs
++ cut -d= -f2
++ grep '^VERSION_ID=' /etc/os-release
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ os_id
++ xargs
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ get_metadata_attribute cudnn-version 9.1.0.70
++ local -r attribute_name=cudnn-version
++ local -r default_value=9.1.0.70
++ /usr/share/google/get_metadata_value attributes/cudnn-version
++ echo -n 9.1.0.70
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ cut -d= -f2
++ xargs
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_version
++ grep '^VERSION_ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ xargs
++ os_id
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ xargs
++ os_version
++ cut -d= -f2
++ grep '^VERSION_ID=' /etc/os-release
++ xargs
++ os_id
++ cut -d= -f2
++ grep '^ID=' /etc/os-release
++ xargs
++ os_version
++ grep '^VERSION_ID=' /etc/os-release
++ cut -d= -f2
++ xargs
++ get_metadata_attribute nccl-version 2.21.5
++ local -r attribute_name=nccl-version
++ local -r default_value=2.21.5
++ /usr/share/google/get_metadata_value attributes/nccl-version
++ echo -n 2.21.5
++ get_metadata_attribute gpu-driver-url https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
++ local -r attribute_name=gpu-driver-url
++ local -r default_value=https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
++ /usr/share/google/get_metadata_value attributes/gpu-driver-url
++ echo -n https://download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_vercat
++ is_ubuntu
+++ os_id
+++ grep '^ID=' /etc/os-release
+++ xargs
+++ cut -d= -f2
++ [[ debian == \u\b\u\n\t\u ]]
++ is_rocky
+++ os_id
+++ xargs
+++ cut -d= -f2
+++ grep '^ID=' /etc/os-release
++ [[ debian == \r\o\c\k\y ]]
++ os_version
++ xargs
++ cut -d= -f2
++ grep '^VERSION_ID=' /etc/os-release
++ get_metadata_attribute nccl-repo-url https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
++ local -r attribute_name=nccl-repo-url
++ local -r default_value=https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
++ /usr/share/google/get_metadata_value attributes/nccl-repo-url
++ echo -n https://developer.download.nvidia.com/compute/machine-learning/repos/debian12/x86_64/nvidia-machine-learning-repo-debian12_1.0.0-1_amd64.deb
++ get_metadata_attribute cuda-url https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
++ local -r attribute_name=cuda-url
++ local -r default_value=https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
++ /usr/share/google/get_metadata_value attributes/cuda-url
++ echo -n https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
++ echo -e '8.3.1.22\n9.1.0.70'
++ head -n1
++ sort -V
++ echo -e '9.1.0.70\n8.4.1.50'
++ head -n1
++ sort -V
++ echo -e '12.0\n12.4'
++ head -n1
++ sort -V
++ get_metadata_attribute gpu-driver-provider NVIDIA
++ local -r attribute_name=gpu-driver-provider
++ local -r default_value=NVIDIA
++ /usr/share/google/get_metadata_value attributes/gpu-driver-provider
++ echo -n NVIDIA
++ get_metadata_attribute install-gpu-agent false
++ local -r attribute_name=install-gpu-agent
++ local -r default_value=false
++ /usr/share/google/get_metadata_value attributes/install-gpu-agent
++ echo -n false
++ mktemp -u -d -p /run/tmp -t ca_dir-XXXX
++ get_metadata_attribute private_secret_name
++ local -r attribute_name=private_secret_name
++ local -r default_value=
++ /usr/share/google/get_metadata_value attributes/private_secret_name
++ echo -n ''
++ uname -r
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_version
++ grep '^VERSION_ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ os_id
++ grep '^ID=' /etc/os-release
++ xargs
++ cut -d= -f2
++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
++ apt-get install -y -qq pciutils linux-headers-6.1.0-25-cloud-amd64
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
Please let me know if I am missing anything or is there any work around to proceed further?
The text was updated successfully, but these errors were encountered: