-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ollama does not use GPU #111
Comments
it is hard to tell, but I'm seeing similar slowness when using Ollama as well. I use Ollama all of the time for LLM prompting and it is quite quick (read takes less than 30sec for a response always) on my RTX3090 when using llama3.1:8b or even other models once loaded. Anyway, no matter what model I use with web-ui here it takes upwards of 30 MINUTES to do a task and often never finishes anyway. Is this normal behavior? How many LLM calls is it making? From I can tell though, Ollama is using the GPU which makes sense because an external Ollama isn't controlled by web-ui. Web-ui is just making API calls to it. Not to hijack this thread but I get the feeling we are experiencing the same thing. |
I tried the RTX3090 Desktop (24GB vRAM, 32GB RAM) and it ran smoothly (without Docker). It was using around ~70% GPU and 30% CPU (logged with |
I'm running Ollama and Webui in Docker so that's a difference. Dumb question but Webui doesn't need a GPU itself right? It just calls the remote API for everything LLM related? |
Yeah, it's just an API call but for
Below is the code snippet to do so: elif provider == "ollama":
return ChatOllama(
model=kwargs.get("model_name", "qwen2.5:7b"),
temperature=kwargs.get("temperature", 0.0),
num_ctx=128000,
base_url=kwargs.get("base_url", "http://localhost:11434"),
num_thread=0,
num_gpu=1,
keep_alive="10m",
) You can add the mentioned code to this return call in the source code. |
Reason is here: https://github.com/langchain-ai/langchain/blob/master/libs/community/langchain_community/llms/ollama.py#L160-L173 Seems like a bug in OllamaChat in langchain, too many parameters passed to ollama API. I've commented all of them (except temperature) and now it seems uses my GPU. |
try to run web-ui with remote ollama server with qwen2.5 model, it can create a connection with server, but only running model in cpu mode.

test with open-webui with remote ollama server, it is running good with gpu.
running model in remote server in cli wit gpu without any issuses.
repull the model, restart ollama server, restart remote server, but still same result.
the browser-web-ui working good with groq.
The text was updated successfully, but these errors were encountered: