Unable to get GPU inference working #31

yx3zhang · 2024-11-21T18:10:53Z

Hi, I am using the HF hub setup, trying to run the following

gpus = list(range(torch.cuda.device_count()))
test(model, mode="test", dataset=dataset, eval_metrics=["mrr", "hits@10"], gpus=gpus)

The model does not run on gpu for some reasons.
nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off| 00000000:00:16.0 Off | 0 |
| 0% 29C P0 60W / 300W| 678MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off| 00000000:00:17.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off| 00000000:00:18.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off| 00000000:00:19.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A10G Off| 00000000:00:1A.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A10G Off| 00000000:00:1B.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A10G Off| 00000000:00:1C.0 Off | 0 |
| 0% 22C P8 16W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A10G Off| 00000000:00:1D.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

migalkin · 2024-11-21T18:16:15Z

Hi, how do you run the code? The README file contains instructions on how to run in the multi-GPU setup, you need to use torch.distributed.launch, for example

python -m torch.distributed.launch --nproc_per_node=4 script/run.py -c config/transductive/inference.yaml --gpus [0,1,2,3]

yx3zhang · 2024-11-21T19:20:48Z

Is there a way to make gpu work using the huggingface setup? from https://huggingface.co/mgalkin/ultra_50g

from transformers import AutoModel
from ultra.datasets import CoDExSmall
from ultra.eval import test
model = AutoModel.from_pretrained("mgalkin/ultra_50g", trust_remote_code=True)
dataset = CoDExSmall(root="./datasets/")
test(model, mode="test", dataset=dataset, gpus=None)

Expected results for ULTRA 50g

mrr: 0.498

hits@10: 0.685

migalkin · 2024-11-21T19:28:28Z

Yes, please read the instructions in the README file.

For a single GPU, specifying gpus=[0] should suffice
For a multi-GPU setup, you have to run the python script with the eval code via torch.distributed.launch

I would recommend first to make sure that a single-GPU setup is working and then move to a multi-GPU setup.

yx3zhang · 2024-11-21T19:29:58Z

thank you, let me give it a try and report my findings

yx3zhang · 2024-11-21T21:58:47Z

It seems to copy the model and data, but no process is running, this is just gpu=[0]

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

migalkin · 2024-11-21T22:01:15Z

Well, from your stats is GPU 0 seems to be busy and GPU RAM is allocated

1354MiB / 23028MiB

Codex-Small dataset is small and takes roughly this amount of memory, looks ok to me

yx3zhang · 2024-11-21T23:14:43Z

it seems to take similar time to cpu, something is definitely odd.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get GPU inference working #31

Unable to get GPU inference working #31

yx3zhang commented Nov 21, 2024

migalkin commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

migalkin commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

migalkin commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

Unable to get GPU inference working #31

Unable to get GPU inference working #31

Comments

yx3zhang commented Nov 21, 2024

migalkin commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

Expected results for ULTRA 50g

mrr: 0.498

hits@10: 0.685

migalkin commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

yx3zhang commented Nov 21, 2024

migalkin commented Nov 21, 2024

yx3zhang commented Nov 21, 2024