Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get GPU inference working #31

Open
yx3zhang opened this issue Nov 21, 2024 · 7 comments
Open

Unable to get GPU inference working #31

yx3zhang opened this issue Nov 21, 2024 · 7 comments

Comments

@yx3zhang
Copy link

Hi, I am using the HF hub setup, trying to run the following

gpus = list(range(torch.cuda.device_count()))
test(model, mode="test", dataset=dataset, eval_metrics=["mrr", "hits@10"], gpus=gpus)

The model does not run on gpu for some reasons.
nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off| 00000000:00:16.0 Off | 0 |
| 0% 29C P0 60W / 300W| 678MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off| 00000000:00:17.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off| 00000000:00:18.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off| 00000000:00:19.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A10G Off| 00000000:00:1A.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A10G Off| 00000000:00:1B.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A10G Off| 00000000:00:1C.0 Off | 0 |
| 0% 22C P8 16W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A10G Off| 00000000:00:1D.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

@migalkin
Copy link
Collaborator

Hi, how do you run the code? The README file contains instructions on how to run in the multi-GPU setup, you need to use torch.distributed.launch, for example

python -m torch.distributed.launch --nproc_per_node=4 script/run.py -c config/transductive/inference.yaml --gpus [0,1,2,3]

@yx3zhang
Copy link
Author

Is there a way to make gpu work using the huggingface setup? from https://huggingface.co/mgalkin/ultra_50g

from transformers import AutoModel
from ultra.datasets import CoDExSmall
from ultra.eval import test
model = AutoModel.from_pretrained("mgalkin/ultra_50g", trust_remote_code=True)
dataset = CoDExSmall(root="./datasets/")
test(model, mode="test", dataset=dataset, gpus=None)

Expected results for ULTRA 50g

mrr: 0.498

hits@10: 0.685

@migalkin
Copy link
Collaborator

Yes, please read the instructions in the README file.

  • For a single GPU, specifying gpus=[0] should suffice
  • For a multi-GPU setup, you have to run the python script with the eval code via torch.distributed.launch

I would recommend first to make sure that a single-GPU setup is working and then move to a multi-GPU setup.

@yx3zhang
Copy link
Author

thank you, let me give it a try and report my findings

@yx3zhang
Copy link
Author

It seems to copy the model and data, but no process is running, this is just gpu=[0]

nvidia-smi
Thu Nov 21 21:45:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off| 00000000:00:16.0 Off | 0 |
| 0% 29C P0 60W / 300W| 1354MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off| 00000000:00:17.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off| 00000000:00:18.0 Off | 0 |
| 0% 21C P8 16W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off| 00000000:00:19.0 Off | 0 |
| 0% 21C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A10G Off| 00000000:00:1A.0 Off | 0 |
| 0% 21C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A10G Off| 00000000:00:1B.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A10G Off| 00000000:00:1C.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A10G Off| 00000000:00:1D.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

@migalkin
Copy link
Collaborator

Well, from your stats is GPU 0 seems to be busy and GPU RAM is allocated

1354MiB / 23028MiB

Codex-Small dataset is small and takes roughly this amount of memory, looks ok to me

@yx3zhang
Copy link
Author

it seems to take similar time to cpu, something is definitely odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants