-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to get GPU inference working #31
Comments
Hi, how do you run the code? The README file contains instructions on how to run in the multi-GPU setup, you need to use python -m torch.distributed.launch --nproc_per_node=4 script/run.py -c config/transductive/inference.yaml --gpus [0,1,2,3] |
Is there a way to make gpu work using the huggingface setup? from https://huggingface.co/mgalkin/ultra_50g from transformers import AutoModel Expected results for ULTRA 50gmrr: 0.498hits@10: 0.685 |
Yes, please read the instructions in the README file.
I would recommend first to make sure that a single-GPU setup is working and then move to a multi-GPU setup. |
thank you, let me give it a try and report my findings |
It seems to copy the model and data, but no process is running, this is just gpu=[0] nvidia-smi +---------------------------------------------------------------------------------------+ |
Well, from your stats is GPU 0 seems to be busy and GPU RAM is allocated
Codex-Small dataset is small and takes roughly this amount of memory, looks ok to me |
it seems to take similar time to cpu, something is definitely odd. |
Hi, I am using the HF hub setup, trying to run the following
gpus = list(range(torch.cuda.device_count()))
test(model, mode="test", dataset=dataset, eval_metrics=["mrr", "hits@10"], gpus=gpus)
The model does not run on gpu for some reasons.
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off| 00000000:00:16.0 Off | 0 |
| 0% 29C P0 60W / 300W| 678MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off| 00000000:00:17.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off| 00000000:00:18.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off| 00000000:00:19.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A10G Off| 00000000:00:1A.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A10G Off| 00000000:00:1B.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A10G Off| 00000000:00:1C.0 Off | 0 |
| 0% 22C P8 16W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A10G Off| 00000000:00:1D.0 Off | 0 |
| 0% 22C P8 15W / 300W| 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered: