Shared GPU memory working for one GPU, but not two? #2917
-
My rig: i9-14000k, 96GB ram, 2x RTX A4000 16GB I noticed that when running on one GPU I can load models larger than 16GB using shared GPU memory. Not sure how it determines how much system memory to dedicate to this, but it looks like default is 35.6GB, so I can load Llama 2 Chat 70B Q4, a nearly 40.9GB model, when I configure Jan to use one GPU. It's dog-slow (0.3t/s), and that's fine... understandable when sharing system memory through the bus. I figured if I enable the second GPU it would still do this, but instead I get an error in the log saying "ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20038.81 MiB on device 0: cudaMalloc failed: out of memory" Might be related to MMQ? I noticed force MMQ is enabled with one GPU, and not enabled with two. No idea what this setting does. Is that the expected response? I figured it would fill up both GPUs then spill into system memory and share the processing load, even if it's ultimately limited by the shared memory. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
You may consider the settings of "n_gl" or "n_gpu_layers" on right panel:
What does this mean?
"n_gl" or "n_gpu_layers" is a setting that controls how many layers of the AI model are loaded into the GPU memory. It's basically a way to balance between speed and memory usage:
Suggestions