Multi GPU support for iw3 #59

nagadomi · 2023-10-11T06:30:43Z

i have 2 gpus.
Haw can i use all gpu in iw3?

nagadomi · 2023-10-12T06:55:09Z

pytorch/pytorch#8637
MiDaS has the problem linked above and cannot be used with nn.DataParallel.

nagadomi · 2023-10-12T18:47:02Z

the above problem was fixed by nagadomi/MiDaS_iw3@22193f4 ,
but still seems to have register_forward_hook problem on multi GPUs.

nagadomi · 2023-10-12T22:19:46Z

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,

# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models

for windows_package,
run update.bat.

examples

CLI

python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8

GUI

Choose All CUDA Device in the device panel
Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage
Could you check to see it works?

elecimage · 2023-10-14T01:14:28Z

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,
# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models
for windows_package, run update.bat.

examples

CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 
GUI

Choose All CUDA Device in the device panel

Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

oh Thank you. I'll test it soon

elecimage · 2023-10-16T05:58:29Z

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,
# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models
for windows_package, run update.bat.

examples

CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 
GUI

Choose All CUDA Device in the device panel

Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

yes it works but slower than 1 gpu use.~

nagadomi · 2023-10-16T07:06:35Z

@elecimage
First, I do not have Windows multi-GPU environment, so I may not be able to solve this problem.
On Linux 2GPU(Tesla T4 x2) environment, it is possible to achieve roughly 2x FPS.

Here are some possible causes and questions,

How slow is it? I would like to know if it is a little slow or very slow.
What GPU are you using? If you are using different GPUs and one is slower than the other, it may be slower.
Have you increased Depth Batch Size? If the batch size is too small, it may be slower due to multi-GPU overhead.

elecimage · 2023-10-16T07:29:50Z

3. epth Batch Size

With 2 GPu's it is roughly 1.2x slower than with 1.

I'm using two 2080ti.

I've tried changing the Depth Batch Size several times, but it doesn't make much difference.

nagadomi · 2023-10-16T07:56:45Z

OK, I will try to create a Windows VM on cloud and check the behavior.

nagadomi · 2023-10-16T09:00:36Z

Maybe fixed by 2b7cbf9.
@elecimage
Would you please update and try again?
On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

elecimage · 2023-10-17T04:43:16Z

Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

oh Thank you. I'll test it soon

elecimage · 2023-10-18T01:08:32Z

Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

I'm still having problems.
It doesn't speed up or even slow down,
When using 2gpu, it only loads half the vram of the gpu.
I've tested it with 1 GPU and it loads all the VRAM and gets about 2 FPS,
With 2 GPUs, I get about 2 FPS and only half the VRAM is loaded.
The speed is about the same or slightly slower when using 2 GPUs.

nagadomi · 2023-10-18T07:50:16Z

When using multi GPUs, the batch size is divided for each GPU. So for the same batch size setting, each GPU's VRAM usage will be 1/GPU.

In my test above, I tried the following settings. with 720x720 video, 1GPU = 2.5 FPS, 2GPU = 3.7 FPS.

Depth Model: ZoeD_N

Deivce: All CUDA Device
Depth Resolution: Default
Depth Batch Size: 8 or 16
Stereo Batch Size: 64
Low VRAM: False
TTA: False
FP16: True

GPU is Tesla T4 x2, T4 is the same generation architecture as RTX 2080ti and should have slightly worse performance.
OS is Windows Server 2022, installed the latest NVIDIA driver.

For reference,
on Linux single RTX3070ti: 8 FPS.
on Linux 2x Tesla T4: 5 FPS.

nagadomi · 2023-10-22T17:46:55Z

Recent Changes,

a little better FPS with minor improvements (Not related to Multi GPU)
Implemented GPU parallel mode that keeps replicas of models
Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)

The issue of FPS not improving with multiple GPUs may be caused by Windows NVIDIA Driver mode (TCC/WDDM, seems to differ between Tesla Driver and GeForce Driver), so it may not be improved.

ohjoij-ys · 2024-08-29T13:51:15Z

Multi GPU not work in my PC.
system:ubuntu 20.04
gpu : rtx3080 x2
drivers: 560 (not open)

1gpu

2 gpu

ohjoij-ys · 2024-08-29T14:02:39Z

and same result in the windows ,I'm pretty sure it has identified all the cards

nagadomi · 2024-08-29T15:59:07Z

ZoeD_Any_N and ZoeD_Any_K do not support Multi GPU.
Which model did you try? Try ZoeD_N first.

ohjoij-ys · 2024-08-30T12:00:56Z

use ZoeD_N
ALLcuda - 3.8 FPS
Singal GPU -5.3 fps

IF I Manually specifying the number of threads to call will result in out of memory error

Low_Vram MOD 16 thread
ALL_CUDA -4.01FPS
one gpu 4.01 Fps

ohjoij-ys · 2024-08-30T12:07:21Z

use Any_V2_N_S (Doesn't seem to support multi gpu)
All_cuda : 4.85 Fps
singal gpu: 5.0 Fps

ohjoij-ys · 2024-08-30T12:25:15Z

all_cuda vs singal gpu
it seems all not support ,driver/cuda version problem?
any_s : 4.91 vs 4.91
Any_B : 4.81 vs 4.81
Any_V2_N_B :4.77 vs 4.85
zoeD_K 4.06 vs 4.2

nagadomi · 2024-08-30T12:30:29Z

Mutli-GPU DataParallel seems to be working (first screenshot of nvidia-smi). It may just slow.
Try increasing Depth Batch Size instead of Worker Threads. DataParallel distributes batch size to multiple GPUs.

Also, you can monitor nvidia-smi with the following commands

watch -n 1 nvidia-smi

or

nvidia-smi -lms

ohjoij-ys · 2024-08-30T12:37:45Z

it's not work

nagadomi · 2024-08-30T12:40:49Z

Turn off Low VRAM and decrease Worker threads or set to 0.
Low VRAM limits batch size to 1.

ohjoij-ys · 2024-08-30T12:46:49Z

max batch size is 4 (it will be out of vram when higher)
multi gpu 3.84 fps
singal gpu 5.30 fps
it's reduce performance?

nagadomi · 2024-08-30T12:50:47Z

Try Stereo Processing Width to auto.
Also, if batch-size=4 works for single GPU, batch-size=8 should work for multi-GPU (batch-size=4 x 2GPU).

ohjoij-ys · 2024-08-30T12:56:15Z

batch-size = 8 for gpu

multi gpu 4.2 Fps

singal gpu 5.9 Fps

nagadomi · 2024-08-30T13:12:27Z

Try closing the application once and then try again (to avoid out of memory).
Also, DepthAnything model(Any_B) uses less VRAM, so you can try larger batch sizes.

Multi-GPU feature only supports depth estimation models, so if there are other bottlenecks, they will not be improved. Try low-resolution video as well.

Also, when processing multiple videos, the following method is effective.

Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)

ohjoij-ys · 2024-08-30T13:27:12Z

ANY_B
low-resolution video
bath_size = 16
singal gpu，fps :38
multi-gpu ,fps:38

bath_size =32
multi gpu fps:38
singal gpu fps:38
cpu load mode seems be different

nagadomi · 2024-08-31T02:33:10Z

I tried All CUDA in a Tesla T4 x 2 Linux environment.

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [04:02<00:00,  7.36it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 8 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [05:42<00:00,  5.21it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [05:47<00:00,  5.14it/s]

multi gpu fps: 7.36
single gpu fps: 5.14

With Depth Anything(Any_B), the difference is even smaller.

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model Any_B --zoed-batch-size 32 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:00<00:00, 14.83it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 32 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [02:27<00:00, 12.14it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 16 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:26<00:00, 12.18it/s]

multi gpu fps: 14.83
single gpu fps: 12.18

I have an idea about another multi-GPU strategy.
I plan to test that. (GPU round robin on thread pool)

ohjoij-ys · 2024-08-31T05:29:07Z

Maybe it’s because Nvidia has cut some features from gaming graphics cards compared to professional cards. Anyway, I’m looking forward to your new multi-GPU strategy.

nagadomi · 2024-08-31T14:49:33Z

I have an idea about another multi-GPU strategy.
I plan to test that. (GPU round robin on thread pool)

I made this change.
Recommended settings are: Worker Threads = 2 to 4 multiples of the number of GPUs, and Batch Size is small.

T4 x2 + Linux + 8 core (When tested above, it was 2 cores...)

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [01:29<00:00, 19.97it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out  --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [00:57<00:00, 30.90it/s]

Old code for comparison

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4  --max-workers 8  --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [01:45<00:00, 16.87it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out  --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [01:22<00:00, 21.69it/s]

Single GPU performance is also improved.

On T4 x2 + Windows Server,
multi gpu fps: 22
single gpu fps: 18
Very little difference.

ohjoij-ys · 2024-08-31T16:26:46Z

it seems not work on my pc
Any_B
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.47it/s]

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [01:08<00:00, 32.55it/s

ZoeD_N
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [02:49<00:00, 13.14it/s]

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes
fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.96it/s]

nagadomi · 2024-08-31T17:18:44Z

Maybe CPU or IO is the bottleneck and single GPU performance is higher compared to them.
Single RTX 3080 is about 2x faster than Single T4.

Is the single GPU performance of --gpu 0 and --gpu 1 the same?

nagadomi · 2024-09-01T02:33:05Z

I changed part of #59 (comment) change to only enable it when --cuda-stream option is specified. #213

ohjoij-ys · 2024-09-01T05:13:33Z

gpu0 has a little difference compare with Gpu1,and multi gpu are still not work
change ssd not work

Any_B:
--zoed-batch-size 4 --max-workers 8 --yes
cpu :13600kf
ssd: RD20
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:31<00:00, 24.32it/s]
GPU1：
fz.mkv: 100%|████████████████▉| 2230/2232 [01:37<00:00, 22.85it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:14<00:00, 30.10it/s]
cpu load state:

add --cuda-stream
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:20<00:00, 27.60it/s]
GPU1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.78it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:03<00:00, 35.27it/s]

cpu load state

change ssd to optane 900P not add --cuda-stream
Gpu0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:32<00:00, 24.23it/s]
Gpu1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:28<00:00, 25.22it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.61it/s]

add --cuda-stream
Gpu0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:16<00:00, 29.19it/s]
Gpu1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:07<00:00, 33.22it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.47it/s]

cpu load state:

ohjoij-ys · 2024-09-01T05:50:51Z

ZoeD_N:
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [02:20<00:00, 15.92it/s]
GPU1 :
fz.mkv: 100%|████████████████▉| 2230/2232 [02:16<00:00, 16.36it/s
MULTI GPU:
fz.mkv: 100%|████████████████▉| 2230/2232 [03:28<00:00, 10.70it/s]

Add --cuda-stream
Gpu0:
fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.91it/s]
Gpu1:
fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.95it/s]
MULTI GPU:
fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.95it/s]

ohjoij-ys · 2024-09-01T05:52:17Z

close efficent cores in 13600k,
not work

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes --cuda-stream

fz.mkv: 100%|████████████████▉| 2230/2232 [03:13<00:00, 11.55it/s]

nagadomi · 2024-09-01T07:01:15Z

I think the multi-GPU feature is working, but it is simply not efficient.
Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem.
multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.

ohjoij-ys · 2024-09-01T07:12:33Z

I think the multi-GPU feature is working, but it is simply not efficient.我认为多 GPU 功能确实有效，但效率不高。 Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem.由于全局解释器锁问题，Python 线程很难正确并行化。 multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.多处理可能会解决问题，但我仍然不敢这样做，因为它需要很多改变。

OK,i see,thank you for your patient answer

Multi GPU support for iw3 #59

Multi GPU support for iw3 #59

Comments

nagadomi commented Oct 11, 2023

nagadomi commented Oct 12, 2023

nagadomi commented Oct 12, 2023

nagadomi commented Oct 12, 2023

updating steps

examples

elecimage commented Oct 14, 2023 • edited Loading

updating steps

examples

elecimage commented Oct 16, 2023

updating steps

examples

nagadomi commented Oct 16, 2023 • edited Loading

elecimage commented Oct 16, 2023

nagadomi commented Oct 16, 2023

nagadomi commented Oct 16, 2023 • edited Loading

elecimage commented Oct 17, 2023

elecimage commented Oct 18, 2023

nagadomi commented Oct 18, 2023 • edited Loading

nagadomi commented Oct 22, 2023

ohjoij-ys commented Aug 29, 2024

ohjoij-ys commented Aug 29, 2024

nagadomi commented Aug 29, 2024

ohjoij-ys commented Aug 30, 2024

ohjoij-ys commented Aug 30, 2024

ohjoij-ys commented Aug 30, 2024

nagadomi commented Aug 30, 2024

ohjoij-ys commented Aug 30, 2024

nagadomi commented Aug 30, 2024

ohjoij-ys commented Aug 30, 2024

nagadomi commented Aug 30, 2024

ohjoij-ys commented Aug 30, 2024

nagadomi commented Aug 30, 2024

ohjoij-ys commented Aug 30, 2024 • edited Loading

nagadomi commented Aug 31, 2024

ohjoij-ys commented Aug 31, 2024

nagadomi commented Aug 31, 2024

ohjoij-ys commented Aug 31, 2024 • edited Loading

nagadomi commented Aug 31, 2024

nagadomi commented Sep 1, 2024

ohjoij-ys commented Sep 1, 2024

ohjoij-ys commented Sep 1, 2024

ohjoij-ys commented Sep 1, 2024 • edited Loading

nagadomi commented Sep 1, 2024

ohjoij-ys commented Sep 1, 2024

elecimage commented Oct 14, 2023 •

edited

Loading

nagadomi commented Oct 16, 2023 •

edited

Loading

nagadomi commented Oct 16, 2023 •

edited

Loading

nagadomi commented Oct 18, 2023 •

edited

Loading

ohjoij-ys commented Aug 30, 2024 •

edited

Loading

ohjoij-ys commented Aug 31, 2024 •

edited

Loading

ohjoij-ys commented Sep 1, 2024 •

edited

Loading