Release OpenVINO™ Model Server 2024.5 · openvinotoolkit/model_server

The 2024.5 release comes with support for embedding and rerank endpoints, as well as experimental Windows support version.

Changes and improvements

The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building block for AI applications like RAG.
The rerank endpoint has been added based on Cohere API, enabling easy similarity detection between a query and a set of documents. It is one of the building blocks for AI applications like RAG and makes integration with frameworks such as langchain easy.
The echo sampling parameter together with logprobs in the completions endpoint is now supported.
Performance increase on both CPU and GPU for LLM text generation.
LLM dynamic_split_fuse for GPU target device boosts throughput in high-concurrency scenarios.
The procedure for LLM service deployment and model repository preparation has been simplified.
Improvements in LLM tests coverage and stability.
Instructions how to build experimental version of a Windows binary package - native model server for Windows OS – is available. This release includes a set of limitations and has limited tests coverage. It is intended for testing, while the production-ready release is expected with 2025.0. All feedback is welcome.
OpenVINO Model Server C-API now supports asynchronous inference, improves performance with ability of setting outputs, enables using OpenCL & VA surfaces on both inputs & outputs for GPU target device's
KServe REST API Model_metadata endpoint can now provide additional model_info references.
Included support for NPU and iGPU on MTL and LNL platforms
Security and stability improvements

No breaking changes.

Fix support for url encoded model name for KServe REST API
OpenAI text generation endpoints now accepts requests with both v3 & v3/v1 path prefix
Fix reporting metrics in video stream benchmark client
Fix sporadic INVALID_ARGUMENT error on completions endpoint
Fix incorrect LLM finish reason when expecting stop but got length

In the future release, support for the following build options will not be maintained:

You can use an OpenVINO Model Server public Docker images based on Ubuntu22.04 via the following command:

docker pull openvino/model_server:2024.5 - CPU device support
docker pull openvino/model_server:2024.5-gpu - GPU, NPU and CPU device support

or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog