Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAI API completions endpoint - Not working as expected #2719

Open
anandnandagiri opened this issue Oct 2, 2024 · 12 comments
Open

OpenAI API completions endpoint - Not working as expected #2719

anandnandagiri opened this issue Oct 2, 2024 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@anandnandagiri
Copy link

anandnandagiri commented Oct 2, 2024

I have downloaded LLAMA 3.2 1B Model from Hugging face with optimum-cli

optimum-cli export openvino --model meta-llama/Llama-3.2-1B-Instruct llama3.2-1b/1

Below are files downloaded

image

Note: I manually removed openvino_detokenizer.bin, openvino_detokenizer.xml, openvino_tokenizer.xml, openvino_tokenizer.bin to ensure we have only 1 bin and 1 xml file in the version 1 folder

Run Model Server with below command ensuring window wsl path is given correct. Also parameter for Intel Iris GPU for docker

docker run --rm -it -v %cd%/ovmodels/llama3.2-1b:/models/llama3.2-1b --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --model_path /models/llama3.2-1b --model_name llama3.2-1b --rest_port 8000

I have run below command which worked perfect
curl --request GET http://172.17.0.3:8000/v1/config

Below is output

{
"llama3.2-1b" :
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": "OK"
}
}
]
}

But incase of below curl command for OpenAI API Completions did not worked as expected

curl http://172.17.0.3:8000/v3/completions
-H "Content-Type: application/json"
-d '{"model": "llama3.2-1b","prompt": "This is a test","stream": false }'

Giving Error
{"error": "Model with requested name is not found"}

@anandnandagiri anandnandagiri added the bug Something isn't working label Oct 2, 2024
@atobiszei atobiszei added the help wanted Extra attention is needed label Oct 2, 2024
@dkalinowski
Copy link
Collaborator

dkalinowski commented Oct 3, 2024

Hello @anandnandagiri

You are trying to serve the the model directly, with no continuous batching pipeline. In such scenario the model is exposed for single inference via standard TFS/KServe APIs with no text generation loop. To fully utilize text generation use case via OpenAI completion API, please refer to Continuous Batching Demo. Just follow the steps and swap the model to llama 3.2 1b.

@anandnandagiri
Copy link
Author

Thank You @dkalinowski

@anandnandagiri
Copy link
Author

@dkalinowski

followed the steps in Continuous Batching Demo. Below are expectations I am getting. I don't see any GPU resource issue(Image attached for reference). I am testing on INTEL I7 11th Generation Processor.

When I run code using GPU its working fine but in case of Model Server this is not working

docker run --rm -it -v %cd%\ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000 openvino/model_server:latest-gpu --config_path ovmodels/model_config_list.json --rest_port 8000

[2024-10-03 14:13:45.386][1][serving][info][server.cpp:75] OpenVINO Model Server 2024.4.28219825c
[2024-10-03 14:13:45.386][1][serving][info][server.cpp:76] OpenVINO backend c3152d32c9c7
[2024-10-03 14:13:45.386][1][serving][info][pythoninterpretermodule.cpp:35] PythonInterpreterModule starting
Python version
3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
[2024-10-03 14:13:45.559][1][serving][info][pythoninterpretermodule.cpp:46] PythonInterpreterModule started
[2024-10-03 14:13:45.766][1][modelmanager][info][modelmanager.cpp:125] Available devices for Open VINO: CPU, GPU
[2024-10-03 14:13:45.768][1][serving][info][grpcservermodule.cpp:122] GRPCServerModule starting
[2024-10-03 14:13:45.770][1][serving][info][grpcservermodule.cpp:191] GRPCServerModule started
[2024-10-03 14:13:45.770][1][serving][info][grpcservermodule.cpp:192] Started gRPC server on port 9178
[2024-10-03 14:13:45.770][1][serving][info][httpservermodule.cpp:33] HTTPServerModule starting
[2024-10-03 14:13:45.770][1][serving][info][httpservermodule.cpp:37] Will start 32 REST workers
[2024-10-03 14:13:45.776][1][serving][info][http_server.cpp:269] REST server listening on port 8000 with 32 threads
[evhttp_server.cc : 253] NET_LOG: Entering the event loop ...
[2024-10-03 14:13:45.776][1][serving][info][httpservermodule.cpp:47] HTTPServerModule started
[2024-10-03 14:13:45.777][1][serving][info][httpservermodule.cpp:48] Started REST server at 0.0.0.0:8000
[2024-10-03 14:13:45.777][1][serving][info][servablemanagermodule.cpp:51] ServableManagerModule starting
[2024-10-03 14:13:45.791][1][modelmanager][info][modelmanager.cpp:536] Configuration file doesn't have custom node libraries property.
[2024-10-03 14:13:45.791][1][modelmanager][info][modelmanager.cpp:579] Configuration file doesn't have pipelines property.
[2024-10-03 14:13:45.796][1][serving][info][mediapipegraphdefinition.cpp:419] MediapipeGraphDefinition initializing graph nodes
Inference requests aggregated statistic:
Paged attention % of inference execution: -nan
MatMul % of inference execution: -nan
Total inference execution secs: 0

[2024-10-03 14:15:05.783][1][serving][error][llmnoderesources.cpp:169] Error during llm node initialization for models_path: /ovmodels/llama3.2-1b/1 exception: Exception from src/inference/src/cpp/remote_context.cpp:68:
Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:201:
[GPU] out of GPU resources

[2024-10-03 14:15:05.783][1][serving][error][mediapipegraphdefinition.cpp:468] Failed to process LLM node graph llama3.2-1b
[2024-10-03 14:15:05.783][1][modelmanager][info][pipelinedefinitionstatus.hpp:59] Mediapipe: llama3.2-1b state changed to: LOADING_PRECONDITION_FAILED after handling: ValidationFailedEvent:
[2024-10-03 14:15:05.784][1][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started
[2024-10-03 14:15:05.785][115][modelmanager][info][modelmanager.cpp:1087] Started cleaner thread
[2024-10-03 14:15:05.784][114][modelmanager][info][modelmanager.cpp:1068] Started model manager thread

image

@dtrawins
Copy link
Collaborator

dtrawins commented Oct 4, 2024

@anandnandagiri How much memory do you have assigned to the WLS? From Linux, there might be less memory assigned to the GPU. Try reducing the cache size in graph.pbtxt which in the demo is to to 8GB. Try lower value like 4 or even less.

@anandnandagiri
Copy link
Author

anandnandagiri commented Oct 5, 2024

@dtrawins it worked well when I made changes to cache size to 2 in graph.pbtxt.

Info: I am using WSL2 on Window 10 Pro where no .wslconfig is present. it is running on default configure. I am using Docker Desktop to run Model Server (no Linux Distro) thru command prompt

Any help on below?

  1. Is there any link for documentation on graph.pbtxt. (To my surprise if I remove --volume /usr/lib/wsl:/usr/lib/wsl for docker model server GPU this is not at all running)
  2. I am not able to run models or at least convert text embedding models to OpenVINO format to run text embedding for Vector Store https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 using "optimum-cli export openvino". Does Optimum-cli can convert if yes does Model Server supports it?

@dtrawins
Copy link
Collaborator

@anandnandagiri The graph documentation can be found here https://github.com/openvinotoolkit/model_server/blob/main/docs/llm/reference.md The mount parameters which you used are required to make the GPU accessible in the container on WSL. It is documented here https://github.com/openvinotoolkit/model_server/blob/main/docs/accelerators.md#starting-a-docker-container-with-intel-integrated-gpu-intel-data-center-gpu-flex-series-and-intel-arc-gpu
Regarding embeddings, we just added support to OpenAI API embeddings endpoint. You can check the demo https://github.com/openvinotoolkit/model_server/blob/main/demos/embeddings/README.md There is documented export from HF to deploy the model in OVMS. nomic-embed-text model should work fine.

@anandnandagiri
Copy link
Author

@dtrawins I have followed Demos Embedding I see few issue with configuring graph.pbtxt, config.json and subconfig.json with docker image. Did I missed any?

config.json and folder structure
image

graph.pbtxt

input_stream: "HTTP_REQUEST_PAYLOAD:input"
output_stream: "HTTP_RESPONSE_PAYLOAD:output"

node: {
  name: "LLMExecutor"
  calculator: "HttpLLMCalculator"
  input_stream: "LOOPBACK:loopback"
  input_stream: "HTTP_REQUEST_PAYLOAD:input"
  input_side_packet: "LLM_NODE_RESOURCES:llm"
  output_stream: "LOOPBACK:loopback"
  output_stream: "HTTP_RESPONSE_PAYLOAD:output"
  input_stream_info: {
    tag_index: 'LOOPBACK:0',
    back_edge: true
  }
  node_options: {
      [type.googleapis.com / mediapipe.LLMCalculatorOptions]: {
          models_path: "/ovmodels/llama3.2-1b/1",
          plugin_config: '{}',
          enable_prefix_caching: false
          cache_size: 4,
          block_size: 16,
          dynamic_split_fuse: false,
          max_num_seqs: 25,
          max_num_batched_tokens:2048,          
          device: "GPU"
      }
  }
  input_stream_handler {
    input_stream_handler: "SyncSetInputStreamHandler",
    options {
      [mediapipe.SyncSetInputStreamHandlerOptions.ext] {
        sync_set {
          tag_index: "LOOPBACK:0"
        }
      }
    }
  }
}

subconfig.json

{
    "model_config_list": [],
    "mediapipe_config_list": [
        {
            "name": "Alibaba-NLP/gte-large-en-v1.5-embeddings",
            "base_path": "/ovmodels/gte-large-en-v1.5/models/gte-large-en-v1.5-embeddings"
          },
      {
        "name": "Alibaba-NLP/gte-large-en-v1.5-tokenizer",
        "base_path": "/ovmodels/gte-large-en-v1.5/models/gte-large-en-v1.5-tokenizer"
      }
    ]
  }

I am using docker GPU Image below is command

docker run --rm -it -v  ./ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000  openvino/model_server:latest-gpu --config_path ovmodels/config.json --rest_port 8000

I am getting following error

image

@dtrawins
Copy link
Collaborator

@anandnandagiri I think you should have in the folder of gte_large_en_1.5v/models/graph.pbtxt the content of the graph specific for embeddings. I think you copied the same graph from llm pipeline. This graph file is defining which calculators should be applied and how they should be connected. Try to copy the once from embeddings demo.

@anandnandagiri
Copy link
Author

I am getting continous error and used graph.pbtxt from emedding

image

@dtrawins
Copy link
Collaborator

@anandnandagiri There was a recent simplification in the docker command in the demo to drop the parameter --cpu_extension d720af7 which I assume you followed. It requires however building the image from the latest main branch.
You could rebuild the image or add the parameter --cpu_extension /ovms/lib/libopenvino_tokenizers.so to the docker run command.

@anandnandagiri
Copy link
Author

anandnandagiri commented Oct 24, 2024

@dtrawins I followed all the steps mentioned above. But I have renamed Models folder to gte-large-en-v1.5 to make some standards and run multiple models and below is screen shot of folder structure

image

Docker Command

docker run --rm -it -v  ./ovmodels:/ovmodels --device=/dev/dxg --volume /usr/lib/wsl:/usr/lib/wsl -p 8000:8000  openvino/model_server:latest-gpu --config_path ovmodels/configembed.json --rest_port 8000 --cpu_extension /ovms/lib/libopenvino_tokenizers.so

I see below error
image

@atobiszei
Copy link
Collaborator

atobiszei commented Oct 25, 2024

@anandnandagiri
This message:
Unable to find Calculator EmbeddingsCalculator
suggest that you are using latest release which do not support embedding yet. This work is not released yet in public docker image. You need to build docker image from main.

@atobiszei atobiszei removed the bug Something isn't working label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants