Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show expected and problematic output produced by deviceQuery in GPU docs #139

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions docs/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,10 +152,32 @@ The only scenario where this would be required is if `$LD_LIBRARY_PATH` is modif

### Testing the GPU support {: #gpu_cuda_testing }
Copy link
Collaborator

@casparvl casparvl Dec 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this only treats testing if you can run CUDA-enabled software from EESSI. Maybe we can also include a small instruction for testing if building new CUDA software on top of EESSI works properly. Something like this:
First, create a file hello_cuda.cu with the contents

#include <stdio.h>

__global__ void helloCUDA()
{
    printf("Hello, CUDA!\n");
}

int main()
{
    helloCUDA<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Then

module load CUDA/<some_version>
nvcc -o hello_cuda.cu -o hello_cuda
chmod u+x hello_cuda
./hello_cuda 

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And mention they should test this for each version of CUDA they installed in host_injections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but that should be done in a separate PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, sure. I won't block this one over it :) Although I would consider it to be an integral part of "Testing the GPU support" to be honest :)

Copy link
Member

@ocaisa ocaisa Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it as so integral if we are focused on software consumers, it's only integral if you want to do development-type work


The quickest way to test if software installations included in EESSI can access and use your GPU is to run the
The quickest way to test if software installations included in EESSI can access and use your GPU is to run the
`deviceQuery` executable that is part of the `CUDA-Samples` module:
```
```{ .bash .copy }
module load CUDA-Samples
deviceQuery
```
If both are successful, you should see information about your GPU printed to your terminal.
If both are successful, you should see information about your GPU printed to your terminal, for example:

```
$ deviceQuery
deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA A2"
CUDA Driver Version / Runtime Version 12.2 / 12.1
CUDA Capability Major/Minor version number: 8.6
...
```

If the `deviceQuery` command can not access your GPU, you will see an error message like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't actually happen though, because of the Lmod guards the only scenario I can see where you would reach this is where you are using a container and the system drivers are too old

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I triggered it by cleaning out the host_injections directory after loading the module.

I agree it's very unlikely that it happens, but we should mention it in the docs regardless, if only to let people easily find this page when searching for error messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that the placement here makes it seem like it not working is likely, but reaching this message is actually very unlikely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a little box saying What does it look like if the command fails?

```
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
```
```
Comment on lines +177 to +183
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `deviceQuery` command can not access your GPU, you will see an error message like:
```
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
```
```
!!! note "What if the `deviceQuery` command fails?"
If the `deviceQuery` command cannot access your GPU, you will see an error message like:
```
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL
```