The 2024.5 release comes with support for embedding and rerank endpoints, as well as experimental Windows support version.
Changes and improvements
-
The OpenAI API text embedding endpoint has been added, enabling OVMS to be used as a building block for AI applications like RAG.
-
The rerank endpoint has been added based on Cohere API, enabling easy similarity detection between a query and a set of documents. It is one of the building blocks for AI applications like RAG and makes integration with frameworks such as langchain easy.
-
The
echo
sampling parameter together withlogprobs
in thecompletions
endpoint is now supported. -
Performance increase on both CPU and GPU for LLM text generation.
-
LLM dynamic_split_fuse for GPU target device boosts throughput in high-concurrency scenarios.
-
The procedure for LLM service deployment and model repository preparation has been simplified.
-
Improvements in LLM tests coverage and stability.
-
Instructions how to build experimental version of a Windows binary package - native model server for Windows OS – is available. This release includes a set of limitations and has limited tests coverage. It is intended for testing, while the production-ready release is expected with 2025.0. All feedback is welcome.
-
OpenVINO Model Server C-API now supports asynchronous inference, improves performance with ability of setting outputs, enables using OpenCL & VA surfaces on both inputs & outputs for GPU target device's
-
KServe REST API Model_metadata endpoint can now provide additional model_info references.
-
Included support for NPU and iGPU on MTL and LNL platforms
-
Security and stability improvements
Breaking changes
No breaking changes.
Bug fixes:
- Fix support for url encoded model name for KServe REST API
- OpenAI text generation endpoints now accepts requests with both v3 & v3/v1 path prefix
- Fix reporting metrics in video stream benchmark client
- Fix sporadic INVALID_ARGUMENT error on completions endpoint
- Fix incorrect LLM finish reason when expecting stop but got length
Discontinuation plans
In the future release, support for the following build options will not be maintained:
- Ubuntu 20 as the base image
- OpenVINO NVIDIA plugin
You can use an OpenVINO Model Server public Docker images based on Ubuntu22.04 via the following command:
docker pull openvino/model_server:2024.5
- CPU device supportdocker pull openvino/model_server:2024.5-gpu
- GPU, NPU and CPU device support
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog