Shared GPU memory working for one GPU, but not two? #2917

chlimouj · 2024-05-16T15:16:48Z

chlimouj
May 16, 2024

My rig: i9-14000k, 96GB ram, 2x RTX A4000 16GB

I noticed that when running on one GPU I can load models larger than 16GB using shared GPU memory. Not sure how it determines how much system memory to dedicate to this, but it looks like default is 35.6GB, so I can load Llama 2 Chat 70B Q4, a nearly 40.9GB model, when I configure Jan to use one GPU. It's dog-slow (0.3t/s), and that's fine... understandable when sharing system memory through the bus.

I figured if I enable the second GPU it would still do this, but instead I get an error in the log saying "ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20038.81 MiB on device 0: cudaMalloc failed: out of memory"

Might be related to MMQ? I noticed force MMQ is enabled with one GPU, and not enabled with two. No idea what this setting does.

Is that the expected response? I figured it would fill up both GPUs then spill into system memory and share the processing load, even if it's ultimately limited by the shared memory.

Answered by imtuyethan

Sep 2, 2024

You may consider the settings of "n_gl" or "n_gpu_layers" on right panel:

What does this mean?

"n_gl" or "n_gpu_layers" is a setting that controls how many layers of the AI model are loaded into the GPU memory. It's basically a way to balance between speed and memory usage:

Higher n_gpu_layers = more GPU memory used, faster processing
Lower n_gpu_layers = less GPU memory used, potentially slower processing

Suggestions

For 7B models: Try starting with 32 layers (n_gpu_layers=32)
For 13B models: Start with around 20-24 layers
For larger models: You may need to go even lower, perhaps 16 or fewer layers

View full answer

imtuyethan · 2024-09-02T13:16:20Z

imtuyethan
Sep 2, 2024
Maintainer

You may consider the settings of "n_gl" or "n_gpu_layers" on right panel:

What does this mean?

"n_gl" or "n_gpu_layers" is a setting that controls how many layers of the AI model are loaded into the GPU memory. It's basically a way to balance between speed and memory usage:

Higher n_gpu_layers = more GPU memory used, faster processing
Lower n_gpu_layers = less GPU memory used, potentially slower processing

Suggestions

For 7B models: Try starting with 32 layers (n_gpu_layers=32)
For 13B models: Start with around 20-24 layers
For larger models: You may need to go even lower, perhaps 16 or fewer layers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan

Shared GPU memory working for one GPU, but not two? #2917

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Jan

Shared GPU memory working for one GPU, but not two? #2917

chlimouj May 16, 2024

What does this mean?

Suggestions

Replies: 1 comment

imtuyethan Sep 2, 2024 Maintainer

What does this mean?

Suggestions

chlimouj
May 16, 2024

imtuyethan
Sep 2, 2024
Maintainer