Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU support for iw3 #59

Open
nagadomi opened this issue Oct 11, 2023 · 38 comments
Open

Multi GPU support for iw3 #59

nagadomi opened this issue Oct 11, 2023 · 38 comments

Comments

@nagadomi
Copy link
Owner

from #28 (comment)

i have 2 gpus.
Haw can i use all gpu in iw3?

@nagadomi
Copy link
Owner Author

pytorch/pytorch#8637
MiDaS has the problem linked above and cannot be used with nn.DataParallel.

@nagadomi
Copy link
Owner Author

the above problem was fixed by nagadomi/MiDaS_iw3@22193f4 ,
but still seems to have register_forward_hook problem on multi GPUs.

@nagadomi
Copy link
Owner Author

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,

# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models

for windows_package,
run update.bat.

examples

CLI

python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 

GUI

  1. Choose All CUDA Device in the device panel
  2. Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage
Could you check to see it works?

@elecimage
Copy link

elecimage commented Oct 14, 2023

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,

# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models

for windows_package, run update.bat.

examples

CLI

python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 

GUI

  1. Choose All CUDA Device in the device panel
  2. Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

oh Thank you. I'll test it soon

@elecimage
Copy link

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,

# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models

for windows_package, run update.bat.

examples

CLI

python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 

GUI

  1. Choose All CUDA Device in the device panel
  2. Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

yes it works but slower than 1 gpu use.~

@nagadomi
Copy link
Owner Author

nagadomi commented Oct 16, 2023

@elecimage
First, I do not have Windows multi-GPU environment, so I may not be able to solve this problem.
On Linux 2GPU(Tesla T4 x2) environment, it is possible to achieve roughly 2x FPS.

Here are some possible causes and questions,

  1. How slow is it? I would like to know if it is a little slow or very slow.
  2. What GPU are you using? If you are using different GPUs and one is slower than the other, it may be slower.
  3. Have you increased Depth Batch Size? If the batch size is too small, it may be slower due to multi-GPU overhead.

@elecimage
Copy link

3. epth Batch Size

With 2 GPu's it is roughly 1.2x slower than with 1.

I'm using two 2080ti.

I've tried changing the Depth Batch Size several times, but it doesn't make much difference.

@nagadomi
Copy link
Owner Author

OK, I will try to create a Windows VM on cloud and check the behavior.

@nagadomi
Copy link
Owner Author

nagadomi commented Oct 16, 2023

Maybe fixed by 2b7cbf9.
@elecimage
Would you please update and try again?
On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

@elecimage
Copy link

Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

oh Thank you. I'll test it soon

@elecimage
Copy link

Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

I'm still having problems.
It doesn't speed up or even slow down,
When using 2gpu, it only loads half the vram of the gpu.
I've tested it with 1 GPU and it loads all the VRAM and gets about 2 FPS,
With 2 GPUs, I get about 2 FPS and only half the VRAM is loaded.
The speed is about the same or slightly slower when using 2 GPUs.

@nagadomi
Copy link
Owner Author

nagadomi commented Oct 18, 2023

When using multi GPUs, the batch size is divided for each GPU. So for the same batch size setting, each GPU's VRAM usage will be 1/GPU.

In my test above, I tried the following settings. with 720x720 video, 1GPU = 2.5 FPS, 2GPU = 3.7 FPS.

Depth Model: ZoeD_N

Deivce: All CUDA Device
Depth Resolution: Default
Depth Batch Size: 8 or 16
Stereo Batch Size: 64
Low VRAM: False
TTA: False
FP16: True

GPU is Tesla T4 x2, T4 is the same generation architecture as RTX 2080ti and should have slightly worse performance.
OS is Windows Server 2022, installed the latest NVIDIA driver.

For reference,
on Linux single RTX3070ti: 8 FPS.
on Linux 2x Tesla T4: 5 FPS.

@nagadomi
Copy link
Owner Author

Recent Changes,

  • a little better FPS with minor improvements (Not related to Multi GPU)
  • Implemented GPU parallel mode that keeps replicas of models
  • Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)

The issue of FPS not improving with multiple GPUs may be caused by Windows NVIDIA Driver mode (TCC/WDDM, seems to differ between Tesla Driver and GeForce Driver), so it may not be improved.

@ohjoij-ys
Copy link

Multi GPU not work in my PC.
system:ubuntu 20.04
gpu : rtx3080 x2
drivers: 560 (not open)
image

1gpu
image
2 gpu
image
image

@ohjoij-ys
Copy link

and same result in the windows ,I'm pretty sure it has identified all the cards

@nagadomi
Copy link
Owner Author

ZoeD_Any_N and ZoeD_Any_K do not support Multi GPU.
Which model did you try? Try ZoeD_N first.

@ohjoij-ys
Copy link

use ZoeD_N
ALLcuda - 3.8 FPS
Singal GPU -5.3 fps
image

image

image
image

IF I Manually specifying the number of threads to call will result in out of memory error
image

Low_Vram MOD 16 thread
ALL_CUDA -4.01FPS
one gpu 4.01 Fps
image
image

image
image

@ohjoij-ys
Copy link

use Any_V2_N_S (Doesn't seem to support multi gpu)
All_cuda : 4.85 Fps
singal gpu: 5.0 Fps
image
image
image
image
image

@ohjoij-ys
Copy link

all_cuda vs singal gpu
it seems all not support ,driver/cuda version problem?
any_s : 4.91 vs 4.91
Any_B : 4.81 vs 4.81
Any_V2_N_B :4.77 vs 4.85
zoeD_K 4.06 vs 4.2

@nagadomi
Copy link
Owner Author

Mutli-GPU DataParallel seems to be working (first screenshot of nvidia-smi). It may just slow.
Try increasing Depth Batch Size instead of Worker Threads. DataParallel distributes batch size to multiple GPUs.

Also, you can monitor nvidia-smi with the following commands

watch -n 1 nvidia-smi

or

nvidia-smi -lms

@ohjoij-ys
Copy link

it's not work
image

@nagadomi
Copy link
Owner Author

Turn off Low VRAM and decrease Worker threads or set to 0.
Low VRAM limits batch size to 1.

@ohjoij-ys
Copy link

max batch size is 4 (it will be out of vram when higher)
multi gpu 3.84 fps
singal gpu 5.30 fps
it's reduce performance?

image
image

image

image

@nagadomi
Copy link
Owner Author

Try Stereo Processing Width to auto.
Also, if batch-size=4 works for single GPU, batch-size=8 should work for multi-GPU (batch-size=4 x 2GPU).

@ohjoij-ys
Copy link

batch-size = 8 for gpu
image
multi gpu 4.2 Fps
image
singal gpu 5.9 Fps
image

@nagadomi
Copy link
Owner Author

Try closing the application once and then try again (to avoid out of memory).
Also, DepthAnything model(Any_B) uses less VRAM, so you can try larger batch sizes.

Multi-GPU feature only supports depth estimation models, so if there are other bottlenecks, they will not be improved. Try low-resolution video as well.

Also, when processing multiple videos, the following method is effective.

Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)

@ohjoij-ys
Copy link

ohjoij-ys commented Aug 30, 2024

ANY_B
low-resolution video
bath_size = 16
singal gpu,fps :38
multi-gpu ,fps:38
image
image

bath_size =32
multi gpu fps:38
singal gpu fps:38
cpu load mode seems be different

image
image

image
image

@nagadomi
Copy link
Owner Author

I tried All CUDA in a Tesla T4 x 2 Linux environment.

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [04:02<00:00,  7.36it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 8 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [05:42<00:00,  5.21it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [05:47<00:00,  5.14it/s]

multi gpu fps: 7.36
single gpu fps: 5.14

With Depth Anything(Any_B), the difference is even smaller.

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model Any_B --zoed-batch-size 32 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:00<00:00, 14.83it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 32 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [02:27<00:00, 12.14it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 16 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:26<00:00, 12.18it/s]

multi gpu fps: 14.83
single gpu fps: 12.18

I have an idea about another multi-GPU strategy.
I plan to test that. (GPU round robin on thread pool)

@ohjoij-ys
Copy link

Maybe it’s because Nvidia has cut some features from gaming graphics cards compared to professional cards. Anyway, I’m looking forward to your new multi-GPU strategy.

@nagadomi
Copy link
Owner Author

I have an idea about another multi-GPU strategy.
I plan to test that. (GPU round robin on thread pool)

I made this change.
Recommended settings are: Worker Threads = 2 to 4 multiples of the number of GPUs, and Batch Size is small.

T4 x2 + Linux + 8 core (When tested above, it was 2 cores...)

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [01:29<00:00, 19.97it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out  --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [00:57<00:00, 30.90it/s]

Old code for comparison

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4  --max-workers 8  --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [01:45<00:00, 16.87it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out  --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [01:22<00:00, 21.69it/s]

Single GPU performance is also improved.

On T4 x2 + Windows Server,
multi gpu fps: 22
single gpu fps: 18
Very little difference.

@ohjoij-ys
Copy link

ohjoij-ys commented Aug 31, 2024

it seems not work on my pc
Any_B
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.47it/s]

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [01:08<00:00, 32.55it/s

ZoeD_N
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [02:49<00:00, 13.14it/s]

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.96it/s]

@nagadomi
Copy link
Owner Author

Maybe CPU or IO is the bottleneck and single GPU performance is higher compared to them.
Single RTX 3080 is about 2x faster than Single T4.

Is the single GPU performance of --gpu 0 and --gpu 1 the same?

@nagadomi
Copy link
Owner Author

nagadomi commented Sep 1, 2024

I changed part of #59 (comment) change to only enable it when --cuda-stream option is specified. #213

@ohjoij-ys
Copy link

gpu0 has a little difference compare with Gpu1,and multi gpu are still not work
change ssd not work

Any_B:
--zoed-batch-size 4 --max-workers 8 --yes
cpu :13600kf
ssd: RD20
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:31<00:00, 24.32it/s]
GPU1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:37<00:00, 22.85it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:14<00:00, 30.10it/s]
cpu load state:
image

add --cuda-stream
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:20<00:00, 27.60it/s]
GPU1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.78it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:03<00:00, 35.27it/s]

cpu load state
image

change ssd to optane 900P not add --cuda-stream
Gpu0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:32<00:00, 24.23it/s]
Gpu1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:28<00:00, 25.22it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.61it/s]

add --cuda-stream
Gpu0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:16<00:00, 29.19it/s]
Gpu1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:07<00:00, 33.22it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.47it/s]

cpu load state:
image

@ohjoij-ys
Copy link

ZoeD_N:
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [02:20<00:00, 15.92it/s]
GPU1 :
fz.mkv: 100%|████████████████▉| 2230/2232 [02:16<00:00, 16.36it/s
MULTI GPU:
fz.mkv: 100%|████████████████▉| 2230/2232 [03:28<00:00, 10.70it/s]

Add --cuda-stream
Gpu0:
fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.91it/s]
Gpu1:
fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.95it/s]
MULTI GPU:
fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.95it/s]

@ohjoij-ys
Copy link

ohjoij-ys commented Sep 1, 2024

close efficent cores in 13600k,
not work

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes --cuda-stream

fz.mkv: 100%|████████████████▉| 2230/2232 [03:13<00:00, 11.55it/s]

@nagadomi
Copy link
Owner Author

nagadomi commented Sep 1, 2024

I think the multi-GPU feature is working, but it is simply not efficient.
Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem.
multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.

@ohjoij-ys
Copy link

I think the multi-GPU feature is working, but it is simply not efficient.我认为多 GPU 功能确实有效,但效率不高。 Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem.由于全局解释器锁问题,Python 线程很难正确并行化。 multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.多处理可能会解决问题,但我仍然不敢这样做,因为它需要很多改变。

OK,i see,thank you for your patient answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants