-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenAI API completions endpoint - Not working as expected #2719
Comments
Hello @anandnandagiri You are trying to serve the the model directly, with no continuous batching pipeline. In such scenario the model is exposed for single inference via standard TFS/KServe APIs with no text generation loop. To fully utilize text generation use case via OpenAI completion API, please refer to Continuous Batching Demo. Just follow the steps and swap the model to llama 3.2 1b. |
Thank You @dkalinowski |
followed the steps in Continuous Batching Demo. Below are expectations I am getting. I don't see any GPU resource issue(Image attached for reference). I am testing on INTEL I7 11th Generation Processor. When I run code using GPU its working fine but in case of Model Server this is not working docker run --rm -it -v %cd%\ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --config_path ovmodels/model_config_list.json --rest_port 8000 [2024-10-03 14:13:45.386][1][serving][info][server.cpp:75] OpenVINO Model Server 2024.4.28219825c [2024-10-03 14:15:05.783][1][serving][error][llmnoderesources.cpp:169] Error during llm node initialization for models_path: /ovmodels/llama3.2-1b/1 exception: Exception from src/inference/src/cpp/remote_context.cpp:68: [2024-10-03 14:15:05.783][1][serving][error][mediapipegraphdefinition.cpp:468] Failed to process LLM node graph llama3.2-1b |
@anandnandagiri How much memory do you have assigned to the WLS? From Linux, there might be less memory assigned to the GPU. Try reducing the cache size in graph.pbtxt which in the demo is to to 8GB. Try lower value like 4 or even less. |
@dtrawins it worked well when I made changes to cache size to 2 in graph.pbtxt. Info: I am using WSL2 on Window 10 Pro where no .wslconfig is present. it is running on default configure. I am using Docker Desktop to run Model Server (no Linux Distro) thru command prompt Any help on below?
|
@anandnandagiri The graph documentation can be found here https://github.com/openvinotoolkit/model_server/blob/main/docs/llm/reference.md The mount parameters which you used are required to make the GPU accessible in the container on WSL. It is documented here https://github.com/openvinotoolkit/model_server/blob/main/docs/accelerators.md#starting-a-docker-container-with-intel-integrated-gpu-intel-data-center-gpu-flex-series-and-intel-arc-gpu |
@dtrawins I have followed Demos Embedding I see few issue with configuring graph.pbtxt, config.json and subconfig.json with docker image. Did I missed any? config.json and folder structure graph.pbtxt
subconfig.json
I am using docker GPU Image below is command
I am getting following error |
@anandnandagiri I think you should have in the folder of gte_large_en_1.5v/models/graph.pbtxt the content of the graph specific for embeddings. I think you copied the same graph from llm pipeline. This graph file is defining which calculators should be applied and how they should be connected. Try to copy the once from embeddings demo. |
@anandnandagiri There was a recent simplification in the docker command in the demo to drop the parameter --cpu_extension d720af7 which I assume you followed. It requires however building the image from the latest main branch. |
@dtrawins I followed all the steps mentioned above. But I have renamed Models folder to gte-large-en-v1.5 to make some standards and run multiple models and below is screen shot of folder structure Docker Command
|
@anandnandagiri |
I have downloaded LLAMA 3.2 1B Model from Hugging face with optimum-cli
optimum-cli export openvino --model meta-llama/Llama-3.2-1B-Instruct llama3.2-1b/1
Below are files downloaded
Note: I manually removed openvino_detokenizer.bin, openvino_detokenizer.xml, openvino_tokenizer.xml, openvino_tokenizer.bin to ensure we have only 1 bin and 1 xml file in the version 1 folder
Run Model Server with below command ensuring window wsl path is given correct. Also parameter for Intel Iris GPU for docker
docker run --rm -it -v %cd%/ovmodels/llama3.2-1b:/models/llama3.2-1b --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --model_path /models/llama3.2-1b --model_name llama3.2-1b --rest_port 8000
I have run below command which worked perfect
curl --request GET http://172.17.0.3:8000/v1/config
Below is output
{
"llama3.2-1b" :
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}
But incase of below curl command for OpenAI API Completions did not worked as expected
curl http://172.17.0.3:8000/v3/completions
-H "Content-Type: application/json"
-d '{"model": "llama3.2-1b","prompt": "This is a test","stream": false }'
Giving Error
{"error": "Model with requested name is not found"}
The text was updated successfully, but these errors were encountered: