OpenAI API completions endpoint - Not working as expected #2719

anandnandagiri · 2024-10-02T00:36:54Z

I have downloaded LLAMA 3.2 1B Model from Hugging face with optimum-cli

optimum-cli export openvino --model meta-llama/Llama-3.2-1B-Instruct llama3.2-1b/1

Below are files downloaded

Note: I manually removed openvino_detokenizer.bin, openvino_detokenizer.xml, openvino_tokenizer.xml, openvino_tokenizer.bin to ensure we have only 1 bin and 1 xml file in the version 1 folder

Run Model Server with below command ensuring window wsl path is given correct. Also parameter for Intel Iris GPU for docker

docker run --rm -it -v %cd%/ovmodels/llama3.2-1b:/models/llama3.2-1b --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --model_path /models/llama3.2-1b --model_name llama3.2-1b --rest_port 8000

I have run below command which worked perfect
curl --request GET http://172.17.0.3:8000/v1/config

Below is output

{
"llama3.2-1b" :
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}

But incase of below curl command for OpenAI API Completions did not worked as expected

curl http://172.17.0.3:8000/v3/completions
-H "Content-Type: application/json"
-d '{"model": "llama3.2-1b","prompt": "This is a test","stream": false }'

Giving Error
{"error": "Model with requested name is not found"}

dkalinowski · 2024-10-03T13:08:15Z

Hello @anandnandagiri

You are trying to serve the the model directly, with no continuous batching pipeline. In such scenario the model is exposed for single inference via standard TFS/KServe APIs with no text generation loop. To fully utilize text generation use case via OpenAI completion API, please refer to Continuous Batching Demo. Just follow the steps and swap the model to llama 3.2 1b.

anandnandagiri · 2024-10-03T13:21:55Z

Thank You @dkalinowski

anandnandagiri · 2024-10-03T14:20:53Z

@dkalinowski

followed the steps in Continuous Batching Demo. Below are expectations I am getting. I don't see any GPU resource issue(Image attached for reference). I am testing on INTEL I7 11th Generation Processor.

When I run code using GPU its working fine but in case of Model Server this is not working

docker run --rm -it -v %cd%\ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --config_path ovmodels/model_config_list.json --rest_port 8000

[2024-10-03 14:13:45.386][1][serving][info][server.cpp:75] OpenVINO Model Server 2024.4.28219825c
[2024-10-03 14:13:45.386][1][serving][info][server.cpp:76] OpenVINO backend c3152d32c9c7
[2024-10-03 14:13:45.386][1][serving][info][pythoninterpretermodule.cpp:35] PythonInterpreterModule starting
Python version
3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
[2024-10-03 14:13:45.559][1][serving][info][pythoninterpretermodule.cpp:46] PythonInterpreterModule started
[2024-10-03 14:13:45.766][1][modelmanager][info][modelmanager.cpp:125] Available devices for Open VINO: CPU, GPU
[2024-10-03 14:13:45.768][1][serving][info][grpcservermodule.cpp:122] GRPCServerModule starting
[2024-10-03 14:13:45.770][1][serving][info][grpcservermodule.cpp:191] GRPCServerModule started
[2024-10-03 14:13:45.770][1][serving][info][grpcservermodule.cpp:192] Started gRPC server on port 9178
[2024-10-03 14:13:45.770][1][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-10-03 14:13:45.770][1][serving][info][httpservermodule.cpp:37] Will start 32 REST workers
[2024-10-03 14:13:45.776][1][serving][info][http_server.cpp:269] REST server listening on port 8000 with 32 threads
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-10-03 14:13:45.776][1][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-10-03 14:13:45.777][1][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-10-03 14:13:45.777][1][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-10-03 14:13:45.791][1][modelmanager][info][modelmanager.cpp:536] Configuration file doesn't have custom node libraries property.
[2024-10-03 14:13:45.791][1][modelmanager][info][modelmanager.cpp:579] Configuration file doesn't have pipelines property.
[2024-10-03 14:13:45.796][1][serving][info][mediapipegraphdefinition.cpp:419] MediapipeGraphDefinition initializing graph nodes
Inference requests aggregated statistic:
Paged attention % of inference execution: -nan
MatMul % of inference execution: -nan
Total inference execution secs: 0

[2024-10-03 14:15:05.783][1][serving][error][llmnoderesources.cpp:169] Error during llm node initialization for models_path: /ovmodels/llama3.2-1b/1 exception: Exception from src/inference/src/cpp/remote_context.cpp:68:
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:201:
[GPU] out of GPU resources

[2024-10-03 14:15:05.783][1][serving][error][mediapipegraphdefinition.cpp:468] Failed to process LLM node graph llama3.2-1b
[2024-10-03 14:15:05.783][1][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: llama3.2-1b state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent:
[2024-10-03 14:15:05.784][1][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started
[2024-10-03 14:15:05.785][115][modelmanager][info][modelmanager.cpp:1087] Started cleaner thread
[2024-10-03 14:15:05.784][114][modelmanager][info][modelmanager.cpp:1068] Started model manager thread

dtrawins · 2024-10-04T23:00:09Z

@anandnandagiri How much memory do you have assigned to the WLS? From Linux, there might be less memory assigned to the GPU. Try reducing the cache size in graph.pbtxt which in the demo is to to 8GB. Try lower value like 4 or even less.

anandnandagiri · 2024-10-05T20:01:00Z

@dtrawins it worked well when I made changes to cache size to 2 in graph.pbtxt.

Info: I am using WSL2 on Window 10 Pro where no .wslconfig is present. it is running on default configure. I am using Docker Desktop to run Model Server (no Linux Distro) thru command prompt

Any help on below?

Is there any link for documentation on graph.pbtxt. (To my surprise if I remove --volume /usr/lib/wsl:/usr/lib/wsl for docker model server GPU this is not at all running)
I am not able to run models or at least convert text embedding models to OpenVINO format to run text embedding for Vector Store https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 using "optimum-cli export openvino". Does Optimum-cli can convert if yes does Model Server supports it?

dtrawins · 2024-10-10T21:32:04Z

@anandnandagiri The graph documentation can be found here https://github.com/openvinotoolkit/model_server/blob/main/docs/llm/reference.md The mount parameters which you used are required to make the GPU accessible in the container on WSL. It is documented here https://github.com/openvinotoolkit/model_server/blob/main/docs/accelerators.md#starting-a-docker-container-with-intel-integrated-gpu-intel-data-center-gpu-flex-series-and-intel-arc-gpu
Regarding embeddings, we just added support to OpenAI API embeddings endpoint. You can check the demo https://github.com/openvinotoolkit/model_server/blob/main/demos/embeddings/README.md There is documented export from HF to deploy the model in OVMS. nomic-embed-text model should work fine.

anandnandagiri · 2024-10-12T10:55:10Z

@dtrawins I have followed Demos Embedding I see few issue with configuring graph.pbtxt, config.json and subconfig.json with docker image. Did I missed any?

config.json and folder structure

graph.pbtxt

input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: 'LOOPBACK:0',
    back_edge: true
  }
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "/ovmodels/llama3.2-1b/1",
          plugin_config: '{}',
          enable_prefix_caching: false
          cache_size: 4,
          block_size: 16,
          dynamic_split_fuse: false,
          max_num_seqs: 25,
          max_num_batched_tokens:2048,          
          device: "GPU"
      }
  }
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
        }
      }
    }
  }
}

subconfig.json

{
    "model_config_list": [],
    "mediapipe_config_list": [
        {
            "name": "Alibaba-NLP/gte-large-en-v1.5-embeddings",
            "base_path": "/ovmodels/gte-large-en-v1.5/models/gte-large-en-v1.5-embeddings"
          },
      {
        "name": "Alibaba-NLP/gte-large-en-v1.5-tokenizer",
        "base_path": "/ovmodels/gte-large-en-v1.5/models/gte-large-en-v1.5-tokenizer"
      }
    ]
  }

I am using docker GPU Image below is command

docker run --rm -it -v  ./ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000  openvino/model_server:latest-gpu --config_path ovmodels/config.json --rest_port 8000

I am getting following error

dtrawins · 2024-10-17T22:52:32Z

@anandnandagiri I think you should have in the folder of gte_large_en_1.5v/models/graph.pbtxt the content of the graph specific for embeddings. I think you copied the same graph from llm pipeline. This graph file is defining which calculators should be applied and how they should be connected. Try to copy the once from embeddings demo.

anandnandagiri · 2024-10-21T11:45:38Z

I am getting continous error and used graph.pbtxt from emedding

dtrawins · 2024-10-21T20:41:12Z

@anandnandagiri There was a recent simplification in the docker command in the demo to drop the parameter --cpu_extension d720af7 which I assume you followed. It requires however building the image from the latest main branch.
You could rebuild the image or add the parameter --cpu_extension /ovms/lib/libopenvino_tokenizers.so to the docker run command.

anandnandagiri · 2024-10-24T03:02:47Z

@dtrawins I followed all the steps mentioned above. But I have renamed Models folder to gte-large-en-v1.5 to make some standards and run multiple models and below is screen shot of folder structure

Docker Command

docker run --rm -it -v  ./ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000  openvino/model_server:latest-gpu --config_path ovmodels/configembed.json --rest_port 8000 --cpu_extension /ovms/lib/libopenvino_tokenizers.so

I see below error

atobiszei · 2024-10-25T08:03:53Z

@anandnandagiri
This message:
Unable to find Calculator EmbeddingsCalculator
suggest that you are using latest release which do not support embedding yet. This work is not released yet in public docker image. You need to build docker image from main.

anandnandagiri added the bug Something isn't working label Oct 2, 2024

atobiszei added the help wanted Extra attention is needed label Oct 2, 2024

anandnandagiri closed this as completed Oct 3, 2024

anandnandagiri reopened this Oct 3, 2024

atobiszei removed the bug Something isn't working label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI API completions endpoint - Not working as expected #2719

OpenAI API completions endpoint - Not working as expected #2719

anandnandagiri commented Oct 2, 2024 •

edited

Loading

dkalinowski commented Oct 3, 2024 •

edited

Loading

anandnandagiri commented Oct 3, 2024

anandnandagiri commented Oct 3, 2024

dtrawins commented Oct 4, 2024

anandnandagiri commented Oct 5, 2024 •

edited

Loading

dtrawins commented Oct 10, 2024

anandnandagiri commented Oct 12, 2024

dtrawins commented Oct 17, 2024

anandnandagiri commented Oct 21, 2024

dtrawins commented Oct 21, 2024

anandnandagiri commented Oct 24, 2024 •

edited

Loading

atobiszei commented Oct 25, 2024 •

edited

Loading

OpenAI API completions endpoint - Not working as expected #2719

OpenAI API completions endpoint - Not working as expected #2719

Comments

anandnandagiri commented Oct 2, 2024 • edited Loading

dkalinowski commented Oct 3, 2024 • edited Loading

anandnandagiri commented Oct 3, 2024

anandnandagiri commented Oct 3, 2024

dtrawins commented Oct 4, 2024

anandnandagiri commented Oct 5, 2024 • edited Loading

dtrawins commented Oct 10, 2024

anandnandagiri commented Oct 12, 2024

dtrawins commented Oct 17, 2024

anandnandagiri commented Oct 21, 2024

dtrawins commented Oct 21, 2024

anandnandagiri commented Oct 24, 2024 • edited Loading

atobiszei commented Oct 25, 2024 • edited Loading

anandnandagiri commented Oct 2, 2024 •

edited

Loading

dkalinowski commented Oct 3, 2024 •

edited

Loading

anandnandagiri commented Oct 5, 2024 •

edited

Loading

anandnandagiri commented Oct 24, 2024 •

edited

Loading

atobiszei commented Oct 25, 2024 •

edited

Loading