diff --git a/docs/blog/articles/2023-10-08-KServe-0.11-release.md b/docs/blog/articles/2023-10-08-KServe-0.11-release.md new file mode 100644 index 000000000..45f219ff1 --- /dev/null +++ b/docs/blog/articles/2023-10-08-KServe-0.11-release.md @@ -0,0 +1,143 @@ +# Announcing: KServe v0.11 + +We are excited to announce the release of KServe 0.11, in this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency managemenet. +For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe. + + +Here is a summary of the key changes: + +## KServe Core Inference Enhancements + +- Support path based routing which is served as an alternative way to the host based routing, the URL of the `InferenceService` could look like `http:///serving//`. + Please refer to the [doc](https://github.com/kserve/kserve/blob/294a10495b6b5cda9c64d3e1573b60aec62aceb9/config/configmap/inferenceservice.yaml#L237) for how to enable path based routing. + +- Introduced priority field for `Serving Runtime` custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from [the serving runtime doc](https://kserve.github.io/website/0.11/modelserving/servingruntimes/#priority). + +- Introduced Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration: + ```yaml + apiVersion: "serving.kserve.io/v1alpha1" + kind: ClusterStorageContainer + metadata: + name: default + spec: + container: + name: storage-initializer + image: kserve/model-registry:latest + resources: + requests: + memory: 100Mi + cpu: 100m + limits: + memory: 1Gi + cpu: "1" + supportedUriFormats: + - prefix: model-registry:// + ``` + +- Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. + `Dependency` field with options `Soft` and `Hard` is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps: + + ```yaml + apiVersion: serving.kserve.io/v1alpha1 + kind: InferenceGraph + metadata: + name: graph_with_switch_node + spec: + nodes: + root: + routerType: Sequence + steps: + - name: "rootStep1" + nodeName: node1 + dependency: Hard + - name: "rootStep2" + serviceName: {{ success_200_isvc_id }} + node1: + routerType: Switch + steps: + - name: "node1Step1" + serviceName: {{ error_404_isvc_id }} + condition: "[@this].#(decision_picker==ERROR)" + dependency: Hard + ``` + For more details please refer to the [issue](https://github.com/kserve/kserve/issues/2484). + +- Improved InferenceService debugging experience by adding the aggregated `RoutesReady` status and `LastDeploymentReady` condition to the InferenceService Status to differentiate the endpoint and deployment status. + This applies to the serverless mode and for more details refer to the [API docs](https://pkg.go.dev/github.com/kserve/kserve@v0.11.1/pkg/apis/serving/v1beta1#InferenceServiceStatus). + +### Enhanced Python SDK Dependency Management + +- KServe has adopted [poetry](https://python-poetry.org/docs/) to manage python dependencies. You can now install the KServe SDK with locked dependencies using `poetry install`. +While `pip install` still works, we highly recommend using poetry to ensure predictable dependency management. + +- The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with `pip install kserve[storage]`. + + +### KServe Python Runtimes Improvements +- KServe Python Runtimes including [sklearnserver](../../modelserving/v1beta1/sklearn/v2/README.md), [lgbserver](../../modelserving/v1beta1/lightgbm/README.md), [xgbserver](../../modelserving/v1beta1/xgboost/README.md) + now support the open inference protocol for both REST and gRPC. + +- Logging improvements including adding Uvicorn access logging and a default KServe logger. + +- `Postprocess` handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities. + + +### LLM Runtimes + +### TorchServe LLM Runtime +KServe now integrates with TorchServe 0.8, offering the support for [LLM models](https://pytorch.org/serve/large_model_inference.html) that may not fit onto a single GPU. +Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the [detailed example](../../modelserving/v1beta1/llm/) for how to serve the LLM on KServe with TorchServe runtime. + +### vLLM Runtime +Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. +It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization, +[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. + +In the [example](../../modelserving/v1beta1/llm/vllm/README.md) we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed [generate endpoint](https://github.com/kserve/open-inference-protocol/pull/7) for open inference protocol. + +## ModelMesh Updates + +### Storing Models on Kubernetes Persistent Volumes (PVC) +ModelMesh now allows to [directly mount model files onto serving runtimes pods](https://github.com/kserve/modelmesh-serving/blob/main/docs/predictors/setup-storage.md#deploy-a-model-stored-on-a-persistent-volume-claim) +using [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/). Depending on the selected [storage solution](https://kubernetes.io/docs/concepts/storage/storage-classes/) this approach can significantly reduce latency when deploying new predictors, +potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether. + + +### Horizontal Pod Autoscaling (HPA) +Kubernetes [Horizontal Pod Autoscaling](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a `HorizontalPodAutoscaler` automatically updates the serving +runtime deployment with the number of Pods to best match the demand. + +### Model Metrics, Metrics Dashboard, Payload Event Logging +ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses. + +A new [Grafana dashboard](https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md#import-the-grafana-dashboard) was added to display the comprehensive set of [Prometheus metrics](https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md) like model loading +and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment. + +The new [`PayloadProcessor` interface](https://github.com/kserve/modelmesh/blob/main/src/main/java/com/ibm/watson/modelmesh/payload/) can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems. + +## What's Changed? :warning: +- To allow longer InferenceService name due to DNS max length limits from [issue](https://github.com/kserve/kserve/issues/1397), the `Default` suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. + This affects the client that is using the component url directly instead of the top level InferenceService url. + +- Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode. + +- Raw bytes are now accepted in v1 protocol, setting the right content-type header to `application/json` is required to recognize and decode the json payload if `content-type` is specified. +```bash +curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json +``` + + +For a complete change list please read the release notes from [KServe v0.11](https://github.com/kserve/kserve/releases/tag/v0.11.0) and +[ModelMesh v0.11](https://github.com/kserve/modelmesh-serving/releases/tag/v0.11.0). + +## Join the community + +- Visit our [Website](https://kserve.github.io/website/) or [GitHub](https://github.com/kserve) +- Join the Slack ([#kserve](https://kubeflow.slack.com/?redir=%2Farchives%2FCH6E58LNP)) +- Attend our community meeting by subscribing to the [KServe calendar](https://wiki.lfaidata.foundation/display/kserve/calendars). +- View our [community github repository](https://github.com/kserve/community) to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption! + + +Thanks for all the contributors who have made the commits to 0.11 release! + +The KServe Working Group diff --git a/docs/modelserving/v1beta1/llm/vllm/README.md b/docs/modelserving/v1beta1/llm/vllm/README.md new file mode 100644 index 000000000..d3976fb61 --- /dev/null +++ b/docs/modelserving/v1beta1/llm/vllm/README.md @@ -0,0 +1,81 @@ +## Deploy the LLaMA model with vLLM Runtime +Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. +It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization, +[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens. + +You can deploy the LLaMA model with built vLLM inference server container image using the `InferenceService` yaml API spec. +We have work in progress integrating `vLLM` with `Open Inference Protocol` and KServe observability stack. + +The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta-llama/Llama-2-7b) and upload to your cloud storage. + +=== "Yaml" + ```yaml + kubectl apply -n kserve-test -f - < \ + --tokenizer --dataset \ + --request-rate +""" +import argparse +import asyncio +import json +import random +import time +from typing import Optional, Union, AsyncGenerator, List, Tuple + +import aiohttp +import numpy as np +from transformers import AutoTokenizer, PreTrainedTokenizerBase, PreTrainedTokenizer, PreTrainedTokenizerFast + +# (prompt len, output len, latency) +REQUEST_LATENCY: List[Tuple[int, int, float]] = [] + +def get_tokenizer( + tokenizer_name: str, + *args, + tokenizer_mode: str = "auto", + trust_remote_code: bool = False, + tokenizer_revision: Optional[str] = None, + **kwargs, +) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]: + """Gets a tokenizer for the given model name via Huggingface.""" + if tokenizer_mode == "slow": + if kwargs.get("use_fast", False): + raise ValueError( + "Cannot use the fast tokenizer in slow tokenizer mode.") + kwargs["use_fast"] = False + + if ("llama" in tokenizer_name.lower() and kwargs.get("use_fast", True) + and tokenizer_name != _FAST_LLAMA_TOKENIZER): + logger.info( + "For some LLaMA V1 models, initializing the fast tokenizer may " + "take a long time. To reduce the initialization time, consider " + f"using '{_FAST_LLAMA_TOKENIZER}' instead of the original " + "tokenizer.") + try: + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name, + *args, + trust_remote_code=trust_remote_code, + tokenizer_revision=tokenizer_revision, + **kwargs) + except TypeError as e: + # The LLaMA tokenizer causes a protobuf error in some environments. + err_msg = ( + "Failed to load the tokenizer. If you are using a LLaMA V1 model " + f"consider using '{_FAST_LLAMA_TOKENIZER}' instead of the " + "original tokenizer.") + raise RuntimeError(err_msg) from e + except ValueError as e: + # If the error pertains to the tokenizer class not existing or not + # currently being imported, suggest using the --trust-remote-code flag. + if (not trust_remote_code and + ("does not exist or is not currently imported." in str(e) + or "requires you to execute the tokenizer file" in str(e))): + err_msg = ( + "Failed to load the tokenizer. If the tokenizer is a custom " + "tokenizer not yet available in the HuggingFace transformers " + "library, consider setting `trust_remote_code=True` in LLM " + "or using the `--trust-remote-code` flag in the CLI.") + raise RuntimeError(err_msg) from e + else: + raise e + + if not isinstance(tokenizer, PreTrainedTokenizerFast): + logger.warning( + "Using a slow tokenizer. This might cause a significant " + "slowdown. Consider using a fast tokenizer instead.") + return tokenizer + +def sample_requests( + dataset_path: str, + num_requests: int, + tokenizer: PreTrainedTokenizerBase, +) -> List[Tuple[str, int, int]]: + # Load the dataset. + with open(dataset_path) as f: + dataset = json.load(f) + # Filter out the conversations with less than 2 turns. + dataset = [ + data for data in dataset + if len(data["conversations"]) >= 2 + ] + # Only keep the first two turns of each conversation. + dataset = [ + (data["conversations"][0]["value"], data["conversations"][1]["value"]) + for data in dataset + ] + + # Tokenize the prompts and completions. + prompts = [prompt for prompt, _ in dataset] + prompt_token_ids = tokenizer(prompts).input_ids + completions = [completion for _, completion in dataset] + completion_token_ids = tokenizer(completions).input_ids + tokenized_dataset = [] + for i in range(len(dataset)): + output_len = len(completion_token_ids[i]) + tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len)) + + # Filter out too long sequences. + filtered_dataset: List[Tuple[str, int, int]] = [] + for prompt, prompt_token_ids, output_len in tokenized_dataset: + prompt_len = len(prompt_token_ids) + if prompt_len < 4 or output_len < 4: + # Prune too short sequences. + # This is because TGI causes errors when the input or output length + # is too short. + continue + if prompt_len > 1024 or prompt_len + output_len > 2048: + # Prune too long sequences. + continue + filtered_dataset.append((prompt, prompt_len, output_len)) + + # Sample the requests. + sampled_requests = random.sample(filtered_dataset, num_requests) + return sampled_requests + + +async def get_request( + input_requests: List[Tuple[str, int, int]], + request_rate: float, +) -> AsyncGenerator[Tuple[str, int, int], None]: + input_requests = iter(input_requests) + for request in input_requests: + yield request + + if request_rate == float("inf"): + # If the request rate is infinity, then we don't need to wait. + continue + # Sample the request interval from the exponential distribution. + interval = np.random.exponential(1.0 / request_rate) + # The next request will be sent after the interval. + await asyncio.sleep(interval) + + +async def send_request( + backend: str, + api_url: str, + prompt: str, + prompt_len: int, + output_len: int, + best_of: int, + use_beam_search: bool, +) -> None: + request_start_time = time.perf_counter() + + headers = {"User-Agent": "Benchmark Client"} + if backend == "vllm": + pload = { + "prompt": prompt, + "n": 1, + "best_of": best_of, + "use_beam_search": use_beam_search, + "temperature": 0.0 if use_beam_search else 1.0, + "top_p": 1.0, + "max_tokens": output_len, + "ignore_eos": True, + "stream": False, + } + elif backend == "tgi": + assert not use_beam_search + params = { + "best_of": best_of, + "max_new_tokens": output_len, + "do_sample": True, + } + pload = { + "inputs": prompt, + "parameters": params, + } + else: + raise ValueError(f"Unknown backend: {backend}") + + timeout = aiohttp.ClientTimeout(total=3 * 3600) + async with aiohttp.ClientSession(timeout=timeout) as session: + while True: + async with session.post(api_url, headers=headers, json=pload) as response: + chunks = [] + async for chunk, _ in response.content.iter_chunks(): + chunks.append(chunk) + output = b"".join(chunks).decode("utf-8") + output = json.loads(output) + + # Re-send the request if it failed. + if "error" not in output: + break + + request_end_time = time.perf_counter() + request_latency = request_end_time - request_start_time + REQUEST_LATENCY.append((prompt_len, output_len, request_latency)) + + +async def benchmark( + backend: str, + api_url: str, + input_requests: List[Tuple[str, int, int]], + best_of: int, + use_beam_search: bool, + request_rate: float, +) -> None: + tasks: List[asyncio.Task] = [] + async for request in get_request(input_requests, request_rate): + prompt, prompt_len, output_len = request + task = asyncio.create_task(send_request(backend, api_url, prompt, + prompt_len, output_len, + best_of, use_beam_search)) + tasks.append(task) + await asyncio.gather(*tasks) + + +def main(args: argparse.Namespace): + print(args) + random.seed(args.seed) + np.random.seed(args.seed) + + api_url = f"http://{args.host}/{args.port}/generate" + tokenizer = get_tokenizer(args.tokenizer, trust_remote_code=args.trust_remote_code) + input_requests = sample_requests(args.dataset, args.num_prompts, tokenizer) + + benchmark_start_time = time.perf_counter() + asyncio.run(benchmark(args.backend, api_url, input_requests, args.best_of, + args.use_beam_search, args.request_rate)) + benchmark_end_time = time.perf_counter() + benchmark_time = benchmark_end_time - benchmark_start_time + print(f"Total time: {benchmark_time:.2f} s") + print(f"Throughput: {args.num_prompts / benchmark_time:.2f} requests/s") + + # Compute the latency statistics. + avg_latency = np.mean([latency for _, _, latency in REQUEST_LATENCY]) + print(f"Average latency: {avg_latency:.2f} s") + avg_per_token_latency = np.mean([ + latency / (prompt_len + output_len) + for prompt_len, output_len, latency in REQUEST_LATENCY + ]) + print(f"Average latency per token: {avg_per_token_latency:.2f} s") + avg_per_output_token_latency = np.mean([ + latency / output_len + for _, output_len, latency in REQUEST_LATENCY + ]) + print("Average latency per output token: " + f"{avg_per_output_token_latency:.2f} s") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Benchmark the online serving throughput.") + parser.add_argument("--backend", type=str, default="vllm", + choices=["vllm", "tgi"]) + parser.add_argument("--host", type=str, default="localhost") + parser.add_argument("--port", type=int, default=8000) + parser.add_argument("--dataset", type=str, required=True, + help="Path to the dataset.") + parser.add_argument("--tokenizer", type=str, required=True, + help="Name or path of the tokenizer.") + parser.add_argument("--best-of", type=int, default=1, + help="Generates `best_of` sequences per prompt and " + "returns the best one.") + parser.add_argument("--use-beam-search", action="store_true") + parser.add_argument("--num-prompts", type=int, default=1000, + help="Number of prompts to process.") + parser.add_argument("--request-rate", type=float, default=float("inf"), + help="Number of requests per second. If this is inf, " + "then all the requests are sent at time 0. " + "Otherwise, we use Poisson process to synthesize " + "the request arrival times.") + parser.add_argument("--seed", type=int, default=0) + parser.add_argument('--trust-remote-code', action='store_true', + help='trust remote code from huggingface') + args = parser.parse_args() + main(args) diff --git a/docs/modelserving/v1beta1/llm/vllm/vllm.yaml b/docs/modelserving/v1beta1/llm/vllm/vllm.yaml new file mode 100644 index 000000000..902bac4cf --- /dev/null +++ b/docs/modelserving/v1beta1/llm/vllm/vllm.yaml @@ -0,0 +1,33 @@ +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + name: llama-2-7b +spec: + predictor: + containers: + - args: + - --port + - "8080" + - --model + - /mnt/models + command: + - python3 + - -m + - vllm.entrypoints.api_server + env: + - name: STORAGE_URI + value: gs://kfserving-examples/models/huggingface/llama # Upload the llama model on your cloud storage + image: kserve/vllmserver:latest + name: kserve-container + resources: + limits: + cpu: "4" + memory: 50Gi + nvidia.com/gpu: "1" + requests: + cpu: "1" + memory: 50Gi + nvidia.com/gpu: "1" + maxReplicas: 1 + minReplicas: 1 + diff --git a/mkdocs.yml b/mkdocs.yml index dfd79d586..cf171d1c3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -97,6 +97,7 @@ nav: - Debugging guide: developer/debug.md - Blog: - Releases: + - KServe 0.11 Release: blog/articles/2023-10-08-KServe-0.11-release.md - KServe 0.10 Release: blog/articles/2023-02-05-KServe-0.10-release.md - KServe 0.9 Release: blog/articles/2022-07-21-KServe-0.9-release.md - KServe 0.8 Release: blog/articles/2022-02-18-KServe-0.8-release.md diff --git a/overrides/main.html b/overrides/main.html index 8eeba009a..f57d96775 100644 --- a/overrides/main.html +++ b/overrides/main.html @@ -2,6 +2,6 @@ {% block announce %}

- KServe v0.10 is Released, Read blog >> + KServe v0.11 is Released, Read blog >>

{% endblock %}