Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ollama does not use GPU #111

Open
Blake110 opened this issue Jan 15, 2025 · 6 comments
Open

ollama does not use GPU #111

Blake110 opened this issue Jan 15, 2025 · 6 comments

Comments

@Blake110
Copy link

Blake110 commented Jan 15, 2025

try to run web-ui with remote ollama server with qwen2.5 model, it can create a connection with server, but only running model in cpu mode.
ollama1

test with open-webui with remote ollama server, it is running good with gpu.
running model in remote server in cli wit gpu without any issuses.

image

repull the model, restart ollama server, restart remote server, but still same result.

the browser-web-ui working good with groq.

@MeshkatShB
Copy link
Contributor

try to run web-ui with remote ollama server with qwen2.5 model, it can create a connection with server, but only running model in cpu mode. ollama1

test with open-webui with remote ollama server, it is running good with gpu. running model in remote server in cli wit gpu without any issuses.

image

repull the model, restart ollama server, restart remote server, but still same result.

the browser-web-ui working good with groq.

Hi there. When you want to use ollama, you need to make sure that you have enough vRAM for that purpose. The model itself (e.g. qwen2.5) may be 4.7GB but when used, it requires more vRAM.

How much vRAM do you have available?

I have 8GB NVIDIA 3070 Ti GPU and yet the qwen2:1.5b is loaded but due to its size, the model is far far slower. So, there is a tradeoff between your GPU vRAM and the functionality of the model.

@coolrazor007
Copy link

it is hard to tell, but I'm seeing similar slowness when using Ollama as well. I use Ollama all of the time for LLM prompting and it is quite quick (read takes less than 30sec for a response always) on my RTX3090 when using llama3.1:8b or even other models once loaded. Anyway, no matter what model I use with web-ui here it takes upwards of 30 MINUTES to do a task and often never finishes anyway. Is this normal behavior? How many LLM calls is it making?

From I can tell though, Ollama is using the GPU which makes sense because an external Ollama isn't controlled by web-ui. Web-ui is just making API calls to it. Not to hijack this thread but I get the feeling we are experiencing the same thing.

@MeshkatShB
Copy link
Contributor

it is hard to tell, but I'm seeing similar slowness when using Ollama as well. I use Ollama all of the time for LLM prompting and it is quite quick (read takes less than 30sec for a response always) on my RTX3090 when using llama3.1:8b or even other models once loaded. Anyway, no matter what model I use with web-ui here it takes upwards of 30 MINUTES to do a task and often never finishes anyway. Is this normal behavior? How many LLM calls is it making?

From I can tell though, Ollama is using the GPU which makes sense because an external Ollama isn't controlled by web-ui. Web-ui is just making API calls to it. Not to hijack this thread but I get the feeling we are experiencing the same thing.

I tried the RTX3090 Desktop (24GB vRAM, 32GB RAM) and it ran smoothly (without Docker). It was using around ~70% GPU and 30% CPU (logged with Ollama ps)

@coolrazor007
Copy link

it is hard to tell, but I'm seeing similar slowness when using Ollama as well. I use Ollama all of the time for LLM prompting and it is quite quick (read takes less than 30sec for a response always) on my RTX3090 when using llama3.1:8b or even other models once loaded. Anyway, no matter what model I use with web-ui here it takes upwards of 30 MINUTES to do a task and often never finishes anyway. Is this normal behavior? How many LLM calls is it making?

From I can tell though, Ollama is using the GPU which makes sense because an external Ollama isn't controlled by web-ui. Web-ui is just making API calls to it. Not to hijack this thread but I get the feeling we are experiencing the same thing.

I tried the RTX3090 Desktop (24GB vRAM, 32GB RAM) and it ran smoothly (without Docker). It was using around ~70% GPU and 30% CPU (logged with Ollama ps)

I'm running Ollama and Webui in Docker so that's a difference. Dumb question but Webui doesn't need a GPU itself right? It just calls the remote API for everything LLM related?

@MeshkatShB
Copy link
Contributor

MeshkatShB commented Jan 20, 2025

I'm running Ollama and Webui in Docker so that's a difference. Dumb question but Webui doesn't need a GPU itself right? It just calls the remote API for everything LLM related?

Yeah, it's just an API call but for Ollama, it uses your Ollama serving address at localhost:11434 or remote server if you use the base_url parameter. There are some modifications that you can make to force Ollama use your own GPU as below:

  • You can set the num_gpu=1 to explicitly tell the model that you want to use GPU,
  • Set keep_alive="10m" to let the model stay in your memory for 10 minutes,
    and also
  • Set num_thread=0 to force the model to suppress the CPU load.

Below is the code snippet to do so:

    elif provider == "ollama":
        return ChatOllama(
            model=kwargs.get("model_name", "qwen2.5:7b"),
            temperature=kwargs.get("temperature", 0.0),
            num_ctx=128000,
            base_url=kwargs.get("base_url", "http://localhost:11434"),
            num_thread=0,
            num_gpu=1,
            keep_alive="10m",
        )

You can add the mentioned code to this return call in the source code.

@EvilFreelancer
Copy link

Reason is here: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/ollama.py#L160-L173

Seems like a bug in OllamaChat in langchain, too many parameters passed to ollama API.

I've commented all of them (except temperature) and now it seems uses my GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants